# ch09 数据聚合与分组运算
本章主要学习：
+ 根据一个或多个键（可以是函数、数组或DataFrame列名）拆分pandas对象；
+ 计算分组摘要统计，如计数、平均值、标准差、或用户自定义函数；
+ 对DataFrame的列应用各种各样的函数；
+ 应用组内转换或其他运算，如规格化、线性回归、排名或选取子集等；
+ 计算透视表或交叉表；
+ 执行分位数分析以及其他分组分析；

## 9.1 GroupBy技术
分组键可以有多种形式，且类型不必相同：
+ 列表或数组，其长度与待分组的轴一样；
+ 表示DataFrame某个列名的值；
+ 字典或Series，给出待分组轴上的值与分组名之间的对应关系；
+ 函数，用于处理轴索引或索引中的各个标签

后三种都只是快捷方式而已，其最终目的就是产生一组用于拆分对象的值。

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                  'key2':['one','two','one','two','one'],
                  'data1':np.random.randn(5),
                  'data2':np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,-0.260166,2.714922,a,one
1,-0.461518,2.201637,a,two
2,-1.603832,1.450366,b,one
3,1.002918,-0.268999,b,two
4,0.626369,0.329957,a,one


In [2]:
# 按key1进行分组，并计算data1列的平均值
# 访问data1，并根据key1调用groupby
grouped = df['data1'].groupby(df['key1'])
grouped.mean()

key1
a   -0.031771
b   -0.300457
Name: data1, dtype: float64

In [3]:
# 一次传入多个数组
means = df['data1'].groupby([df['key1'],df['key2']]).mean()
means

key1  key2
a     one     0.183102
      two    -0.461518
b     one    -1.603832
      two     1.002918
Name: data1, dtype: float64

In [4]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.183102,-0.461518
b,-1.603832,1.002918


In [5]:
# 分组键可以是任何长度适当的数组
states = np.array(['Ohio','California','California','Ohio','Ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()

California  2005   -0.461518
            2006   -1.603832
Ohio        2005    0.371376
            2006    0.626369
Name: data1, dtype: float64

In [6]:
# 将列名（可以是字符串、数字或其他Python对象）用作分组键：
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.031771,1.748839
b,-0.300457,0.590684


In [7]:
df.groupby(['key1','key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.183102,1.522439
a,two,-0.461518,2.201637
b,one,-1.603832,1.450366
b,two,1.002918,-0.268999


In [8]:
# GroupBy的size方法，可以返回一个含有分组大小的Series
df.groupby(['key1','key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 9.1.1 对分组进行迭代

In [10]:
for name,group in df.groupby('key1'):
    print name
    print group

a
      data1     data2 key1 key2
0 -0.260166  2.714922    a  one
1 -0.461518  2.201637    a  two
4  0.626369  0.329957    a  one
b
      data1     data2 key1 key2
2 -1.603832  1.450366    b  one
3  1.002918 -0.268999    b  two


In [12]:
for (k1,k2),group in df.groupby(['key1','key2']):
    print k1,k2
    print group

a one
      data1     data2 key1 key2
0 -0.260166  2.714922    a  one
4  0.626369  0.329957    a  one
a two
      data1     data2 key1 key2
1 -0.461518  2.201637    a  two
b one
      data1     data2 key1 key2
2 -1.603832  1.450366    b  one
b two
      data1     data2 key1 key2
3  1.002918 -0.268999    b  two


将这些数据片段做成一个字典：

In [14]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-1.603832,1.450366,b,one
3,1.002918,-0.268999,b,two


groupby默认是在axis=0上进行分组，通过设置也可以在其他任何轴上进行分组，例如可以根据dtype对列进行分组：

In [16]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [19]:
grouped = df.groupby(df.dtypes,axis=1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0 -0.260166  2.714922
 1 -0.461518  2.201637
 2 -1.603832  1.450366
 3  1.002918 -0.268999
 4  0.626369  0.329957, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

### 9.1.2 选取一个或一组列
对于由DataFrame产生的GroupBy对象，如果用一个（单个字符串）或一组（字符串数组）列名对其进行索引，就能实现选取部分列进行聚合的目的：

In [20]:
df.groupby(['key1','key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,1.522439
a,two,2.201637
b,one,1.450366
b,two,-0.268999


这种索引操作所返回的对象是一个已经分组的DataFrame（如果传入的是列表或数组）或已分组的Series（如果传入的是标量形式的单个列名）：

In [22]:
s_grouped = df.groupby(['key1','key2'])['data2']
s_grouped

<pandas.core.groupby.SeriesGroupBy object at 0x000000000CA83320>

In [23]:
s_grouped.mean()

key1  key2
a     one     1.522439
      two     2.201637
b     one     1.450366
      two    -0.268999
Name: data2, dtype: float64

### 9.1.3 通过字典或Series进行分组
除数组外，分组信息还可以以其他形式存在：

In [27]:
people = pd.DataFrame(np.random.randn(5,5),
                     columns = ['a','b','c','d','e'],
                     index = ['Joe','Steve','Wes','Jim','Travis'])
people.loc[2:3,['b','c']]=np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,-1.42112,1.1115,0.137531,-0.350362,-0.869652
Steve,0.465219,-1.672135,0.623598,-1.091333,0.159835
Wes,0.309525,,,1.591797,-1.35652
Jim,-0.467595,-0.917956,-0.256062,1.778099,0.168285
Travis,-0.75471,2.25677,-0.493249,0.020519,0.238518


假设已知列的分组关系，并希望根据分组计算列的总计：

In [28]:
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_column = people.groupby(mapping,axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,-0.212831,-1.179272
Steve,-0.467734,-1.047082
Wes,1.591797,-1.046995
Jim,1.522037,-1.217266
Travis,-0.47273,1.740578


Series也有同样的功能，它可以被看作一个固定大小的映射，上例如果用Series作为分组键，则pandas会检查Series以确保其索引跟分组轴是对齐的：

In [37]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [38]:
people.groupby(map_series,axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 9.1.4 通过函数进行分组
任何被当作分组键的函数都会在各个索引值上被调用一次，其返回值就会被用作分组名称。上例中，若根据人名的长度进行分组，仅传入len函数就可以了：

In [39]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-1.57919,0.193544,-0.118531,3.019534,-2.057887
5,0.465219,-1.672135,0.623598,-1.091333,0.159835
6,-0.75471,2.25677,-0.493249,0.020519,0.238518


将函数跟数组、列表、字典、Series混合使用也不是问题，因为任何东子最终都会被转换为数组：

In [40]:
key_list=['one','one','one','two','one']
people.groupby([len,key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-1.42112,1.1115,0.137531,-0.350362,-1.35652
3,two,-0.467595,-0.917956,-0.256062,1.778099,0.168285
5,one,0.465219,-1.672135,0.623598,-1.091333,0.159835
6,one,-0.75471,2.25677,-0.493249,0.020519,0.238518


### 9.1.5 根据索引级别分组
层次化索引数据集最方便的地方就在于它能够根据索引级别进行聚合，要实现该目的，通过level关键字传入级别编号或名称即可：

In [41]:
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
                                    [1,3,5,1,3]],names=['city','tenor'])
hier_df = pd.DataFrame(np.random.randn(4,5),columns=columns)
hier_df

city,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-1.171354,2.290931,0.6065,-0.268082,0.695004
1,0.743864,-1.924068,-0.122843,-1.940006,0.783718
2,-0.668218,0.267772,-1.347503,1.759664,0.596267
3,-0.065137,-0.969136,-1.0686,-0.089359,0.372795


In [42]:
hier_df.groupby(level='city',axis=1).count()

city,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 9.2 数据聚合