导包
- numpy, pandas, pandas/Series,DataFrame, pivot, pivot_table, crosstab
- matplotlib

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from pandas import crosstab, pivot, pivot_table
import matplotlib.pyplot as plt
%matplotlib inline

### 透视表

各种电子表格程序和其他数据分析软件中一种常见的数据汇总工具。

它根据一个或多个键对数据进行聚合，并根据行和列上的分组键将数据分配到各个矩形区域中

In [2]:
df = DataFrame(data={
    'sex':np.random.choice(['男', '女'], size=10),
    'smoke': np.random.choice(['Yes', 'No'], size=10),
    'height(cm)': np.random.randint(150, 199, size=10),
    'weight(kg)': np.random.randint(40, 100, size=10)
})
df.head()

Unnamed: 0,height(cm),sex,smoke,weight(kg)
0,186,男,Yes,75
1,165,女,No,65
2,152,男,No,81
3,152,男,No,47
4,152,男,Yes,71


行分组透视表 设置index参数
- df.pivot_table(index)
   - index 可以指定多个column 来以 “行透视”
   - 即将列转为index行， 则会影响行数，
   - 对同一类的数据会进 aggfunc函数处理, 默认 np.mean()
  
<font color=blue>一般不会产生 NaN 值 </font>

In [6]:
df.pivot_table(index='smoke', aggfunc='max')

Unnamed: 0_level_0,height(cm),sex,weight(kg)
smoke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,189,男,81
Yes,195,男,75


列分组透视表 设置columns参数
- df.pivot(columns)
    - 按列分组，不会影响行数
    - 一般会出现NaN值数据， 
    
<font color=red>一般很少使用pivot() 函数进行单列的透视</font>

In [9]:
df.pivot(columns='smoke')

Unnamed: 0_level_0,height(cm),height(cm),sex,sex,weight(kg),weight(kg)
smoke,No,Yes,No,Yes,No,Yes
0,,186.0,,男,,75.0
1,165.0,,女,,65.0,
2,152.0,,男,,81.0,
3,152.0,,男,,47.0,
4,,152.0,,男,,71.0
5,189.0,,男,,56.0,
6,165.0,,男,,48.0,
7,180.0,,女,,59.0,
8,,195.0,,男,,69.0
9,165.0,,女,,72.0,


In [12]:
df.pivot_table(columns=['sex'])

sex,女,男
height(cm),170.0,170.142857
weight(kg),65.333333,63.857143


行列分组的透视表  同时设定index、columns参数
- df.pivot_table(index,columns, values)
- 增加喜欢的颜色 color,  ['red', 'white', 'gray']

In [14]:
df.pivot_table(index='sex',
               columns='smoke',
               values='height(cm)')

smoke,No,Yes
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
女,170.0,
男,164.5,177.666667


In [15]:
df['color'] = Series(np.random.choice(['red', 'white', 'gray'], size=10))
df.head()

Unnamed: 0,height(cm),sex,smoke,weight(kg),color
0,186,男,Yes,75,gray
1,165,女,No,65,red
2,152,男,No,81,gray
3,152,男,No,47,gray
4,152,男,Yes,71,red


练习： 按 sex,color进行行分组， 按smoke进行列分组，显示 height和weight

In [16]:
df.pivot_table(index=['sex', 'color'],
               columns='smoke',
               values=('height(cm)', 'weight(kg)'))

Unnamed: 0_level_0,Unnamed: 1_level_0,height(cm),height(cm),weight(kg),weight(kg)
Unnamed: 0_level_1,smoke,No,Yes,No,Yes
sex,color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
女,red,165.0,,68.5,
女,white,180.0,,59.0,
男,gray,164.333333,186.0,61.333333,75.0
男,red,,173.5,,70.0
男,white,165.0,,48.0,


aggfunc：设置应用在每个区域的聚合函数，默认值为np.mean
- 分类聚合时， 显示某一特征的最大值
- values 指定期望看到的特征值( 列名， 可以是str, tuple, list)

fill_value：替换结果中的缺失值

In [18]:
df.pivot_table(index=['sex', 'color'],
               columns='smoke',
               values=('height(cm)', 'weight(kg)'),
               fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,height(cm),height(cm),weight(kg),weight(kg)
Unnamed: 0_level_1,smoke,No,Yes,No,Yes
sex,color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
女,red,165.0,0.0,68.5,0
女,white,180.0,0.0,59.0,0
男,gray,164.333333,186.0,61.333333,75
男,red,0.0,173.5,0.0,70
男,white,165.0,0.0,48.0,0


In [21]:
pivot_table(df,
            index='color',
            fill_value=0,
            columns='smoke')

Unnamed: 0_level_0,height(cm),height(cm),weight(kg),weight(kg)
smoke,No,Yes,No,Yes
color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
gray,164.333333,186.0,61.333333,75
red,165.0,173.5,68.5,70
white,172.5,0.0,53.5,0


### 交叉表

是一种用于计算分组频率的特殊透视图,主要是对数据进行统计汇总

pd.crosstab(index,colums)

- index:分组数据，交叉表的行索引
- columns:交叉表的列索引

In [22]:
df = DataFrame(data={
    'sex':np.random.choice(['男', '女'], size=1000),
    'smoke': np.random.choice(['Yes', 'No'], size=1000),
    'height(cm)': np.random.randint(150, 199, size=1000),
    'weight(kg)': np.random.randint(40, 100, size=1000),
    'color': np.random.choice(['red', 'gray', 'white', 'black'], size=1000)
})

In [24]:
crosstab(index=df['smoke'],
         columns=df['sex'])

sex,女,男
smoke,Unnamed: 1_level_1,Unnamed: 2_level_1
No,244,258
Yes,232,266


In [28]:
crosstab(df.smoke, df.color)[['black', 'red']]

color,black,red
smoke,Unnamed: 1_level_1,Unnamed: 2_level_1
No,120,154
Yes,122,136


作业： 读取美国2015－2016年所有股市数据
- data/stock2015-2016.csv

查看有哪些股票
- Ticker字段

每个股票的平均收盘价格 "Adj Close"
- pivot_table()

每个股票的最高的收盘价格

查看每天的苹果股票的收盘价格
- Date 时间