# 数据分析的基本目标是寻找变量间的关系。频数表是一种探索数据、发现关系的基础工具。频数表反映的是【分类变量】的频数。

以下用Titanic 数据集来练习操作。

* pandas.crosstab(index=,

                columns=,

                margins=)

* 取比例，整个df 沿行横着除以某列：df.div(axis=0)

* 高维表

In [21]:
import pandas as pd
import numpy as np
%matplotlib inline

In [2]:
train=pd.read_csv('train.csv')

## 单变量频数表

In [6]:
my_tab=pd.crosstab(index=train['Survived'],
                   columns='任意名') # 单变量，所以可以任意指定计数列列名
my_tab

col_0,任意名
Survived,Unnamed: 1_level_1
0,549
1,342


In [7]:
type(my_tab)

pandas.core.frame.DataFrame

In [8]:
pd.crosstab(train['Pclass'],
            columns='count')

col_0,count
Pclass,Unnamed: 1_level_1
1,216
2,184
3,491


三等最多，比1等2等加起来还多

In [9]:
pd.crosstab(index=train.Sex,
            columns='count')

col_0,count
Sex,Unnamed: 1_level_1
female,314
male,577


男性比女性多

In [13]:
char_cabin=train.Cabin.astype(str)
new_cabin=np.array([cabin[0] for cabin in char_cabin])


In [14]:
train['new_cabin']=new_cabin

In [18]:
cabin_tab=pd.crosstab(train.new_cabin,'count')
cabin_tab

col_0,count
new_cabin,Unnamed: 1_level_1
A,15
B,47
C,59
D,33
E,32
F,13
G,4
T,1
n,687


如果传入的变量有很多unique values ，比如数值变量，那么也能对每个unique value 计数，但这样的频数就没有什么特别意义了

因为pd.crosstab 产生的是DataFrame，也就可以调用DataFrame的属性和方法：

In [19]:
cabin_tab.sum()

col_0
count    891
dtype: int64

In [24]:
cabin_tab.iloc[1:7]

col_0,count
new_cabin,Unnamed: 1_level_1
B,47
C,59
D,33
E,32
F,13
G,4


In [None]:
cabin_tab.plot(kind='bar')

### 频数表最好用的运算操作是提取某类变量的比例数据：

In [25]:
cabin_tab/cabin_tab.sum()

col_0,count
new_cabin,Unnamed: 1_level_1
A,0.016835
B,0.05275
C,0.066218
D,0.037037
E,0.035915
F,0.01459
G,0.004489
T,0.001122
n,0.771044


## 双变量频数表
也叫contingency tables 列联表、关联表，是2维表，每维代表一个变量。

列联表可以直观反映两个变量间的关系：

In [35]:
survived_sex=pd.crosstab(index=train['Survived'],  #  注意区分参数rownames 和 index
                         columns=train['Sex'])  

survived_sex.index=['died','survived']

survived_sex

Sex,female,male
died,81,468
survived,233,109


In [42]:
survived_pclass=pd.crosstab(index=train.Survived,
                            columns=train.Pclass,
                            margins=True)          ##Include row and column totals

survived_pclass

Pclass,1,2,3,All
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,80,97,372,549
1,136,87,119,342
All,216,184,491,891


In [43]:
survived_pclass.index=['died','survived','coltotal']      # 重命名
survived_pclass.columns=['class1','class2','class3','rowtotal']

survived_pclass

Unnamed: 0,class1,class2,class3,rowtotal
died,80,97,372,549
survived,136,87,119,342
coltotal,216,184,491,891


In [44]:
# To get the total proportion of counts in each cell, divide the table by 【the grand total】:

survived_pclass/survived_pclass.ix['coltotal','rowtotal']  # 整个表除以一个单元格（标量）

Unnamed: 0,class1,class2,class3,rowtotal
died,0.089787,0.108866,0.417508,0.616162
survived,0.152637,0.097643,0.133558,0.383838
coltotal,0.242424,0.20651,0.551066,1.0


In [45]:
# 沿着列竖着除
survived_pclass/survived_pclass.ix['coltotal']

Unnamed: 0,class1,class2,class3,rowtotal
died,0.37037,0.527174,0.757637,0.616162
survived,0.62963,0.472826,0.242363,0.383838
coltotal,1.0,1.0,1.0,1.0


In [51]:
#沿行横着除：
survived_pclass.div(survived_pclass['rowtotal'],axis=0)

Unnamed: 0,class1,class2,class3,rowtotal
died,0.145719,0.176685,0.677596,1.0
survived,0.397661,0.254386,0.347953,1.0
coltotal,0.242424,0.20651,0.551066,1.0


In [56]:
# 或者转置后竖着除：
survived_pclass.T/survived_pclass['rowtotal']

Unnamed: 0,died,survived,coltotal
class1,0.145719,0.397661,0.242424
class2,0.176685,0.254386,0.20651
class3,0.677596,0.347953,0.551066
rowtotal,1.0,1.0,1.0


## 高维表

In [57]:
surv_sex_class=pd.crosstab(index=train.Survived,
                           columns = [train.Sex,     # 在外层索引
                                      train.Pclass],  # 在内层索引
                           margins=True)

surv_sex_class

Sex,female,female,female,male,male,male,All
Pclass,1,2,3,1,2,3,Unnamed: 7_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,3,6,72,77,91,300,549
1,91,70,72,45,17,47,342
All,94,76,144,122,108,347,891


In [64]:
surv_sex_class['female']

Pclass,1,2,3
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3,6,72
1,91,70,72
All,94,76,144


In [63]:
surv_sex_class['female'][1]  # 切片的列索引， 前后需要按传入时的列名顺序

Survived
0       3
1      91
All    94
Name: 1, dtype: int64

In [66]:
# 取各比例：
surv_sex_class/surv_sex_class.ix['All']

Sex,female,female,female,male,male,male,All
Pclass,1,2,3,1,2,3,Unnamed: 7_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,0.031915,0.078947,0.5,0.631148,0.842593,0.864553,0.616162
1,0.968085,0.921053,0.5,0.368852,0.157407,0.135447,0.383838
All,1.0,1.0,1.0,1.0,1.0,1.0,1.0


发现：女性中，在1、2等舱的生还率在90%以上，3等舱50%；男性，在一等舱的生还率也比其他的高。Pclass是个很有影响力的特征。