In [1]:
import pandas as pd

In [2]:
file_path = "data/test.Tap.20150720.csv"
airpay_log = pd.read_csv(file_path)

#因为文件是上一个步骤输出的，格式和数据都比较标准，同时第一行即为列名
#因此直接使用read_csv就能轻松把数据读进内存

In [6]:
airpay_log.head()

Unnamed: 0,type,country,value,user_id
0,password.gesture.once,th,0,1185391
1,shop.select_channel.home,th,type.10006,1185391
2,shop.select_channel.home,th,id.20041,1185391
3,shop.select_channel.home,th,type.11003,1185391
4,shop.select_channel.list,th,id.22015,1185391


##### usage1. 直接自动判断数据类型
pandas对读入的数据会自动进行类型判断，分析他们到底是string还是时间类型抑或是数字类型

而计算开始的一个很大前提就是要先确定好数据源的数据类型。

In [7]:
airpay_log.dtypes

type       object
country    object
value      object
user_id     int64
dtype: object

**解析一下上面两个dtypes**
1. 将string保存为object而不是string本身
这里的object其实是指向所对应string的指针


The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray **must has the same size in byte**. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save **pointers** to objects, because of this the dtype of this kind ndarray is object.

Here is an example:

* the int64 array contains 4 int64 value.
* the object array contains 4 pointers to 3 string objects.
<img src="img/pFF44.png">

2. 自动检测到数字并设置为int64格式

虽然这里的user_id在为数字格式对计算用处并不大，以为我们并不会计算user_id中的数值。

但是这里还是演示一下：

In [9]:
# 数据的快速统计汇总
airpay_log.describe()

Unnamed: 0,user_id
count,32950.0
mean,1141753.530744
std,82196.546774
min,1000094.0
25%,1068384.0
50%,1147116.0
75%,1221670.0
max,1252135.0


* mean ：平均值
* std：标准差 

```
满足标准正态分布的前提下：
68.268949%的面积在平均数左右的一个标准差范围内。
95.449974%的面积在平均数左右两个标准差的范围内。
99.730020%的面积在平均数左右三个标准差的范围内。
99.993666%的面积在平均数左右四个标准差的范围内。
（经验法则）
```
http://baike.baidu.com/view/78339.htm

http://wallstreetcn.com/node/211672

##### usage2. 各种按条件选择

In [5]:
airpay_log_country = airpay_log[airpay_log.country.str.upper() == "TH"]
# 这里调用string的相关操作前，需要使用.str
# 详细参考： Working with Text Data [http://pandas.pydata.org/pandas-docs/stable/text.html]
grouped = airpay_log_country.groupby('type')
grouped.count()

Unnamed: 0_level_0,country,value,user_id
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
misc.need_assistance,189,0,189
password.gesture.change,117,117,117
password.gesture.once,5109,5109,5109
shop.pay,4011,4011,4011
shop.pay.otp,125,125,125
shop.select_channel.banner,1,1,1
shop.select_channel.home,10319,10319,10319
shop.select_channel.list,11028,11028,11028
shop.select_channel.welcome_gift,239,239,239
shop.select_payment,86,86,86
