# ch09 数据聚合与分组运算
本章主要学习：
+ 根据一个或多个键（可以是函数、数组或DataFrame列名）拆分pandas对象；
+ 计算分组摘要统计，如计数、平均值、标准差、或用户自定义函数；
+ 对DataFrame的列应用各种各样的函数；
+ 应用组内转换或其他运算，如规格化、线性回归、排名或选取子集等；
+ 计算透视表或交叉表；
+ 执行分位数分析以及其他分组分析；

## 9.1 GroupBy技术
分组键可以有多种形式，且类型不必相同：
+ 列表或数组，其长度与待分组的轴一样；
+ 表示DataFrame某个列名的值；
+ 字典或Series，给出待分组轴上的值与分组名之间的对应关系；
+ 函数，用于处理轴索引或索引中的各个标签

后三种都只是快捷方式而已，其最终目的就是产生一组用于拆分对象的值。

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                  'key2':['one','two','one','two','one'],
                  'data1':np.random.randn(5),
                  'data2':np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,0.167154,0.839215,a,one
1,-0.583971,-0.538031,a,two
2,-0.193537,-0.304562,b,one
3,0.865375,-0.747342,b,two
4,1.505119,-0.5076,a,one


In [2]:
# 按key1进行分组，并计算data1列的平均值
# 访问data1，并根据key1调用groupby
grouped = df['data1'].groupby(df['key1'])
grouped.mean()

key1
a    0.362767
b    0.335919
Name: data1, dtype: float64

In [3]:
# 一次传入多个数组
means = df['data1'].groupby([df['key1'],df['key2']]).mean()
means

key1  key2
a     one     0.836136
      two    -0.583971
b     one    -0.193537
      two     0.865375
Name: data1, dtype: float64

In [4]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.836136,-0.583971
b,-0.193537,0.865375


In [5]:
# 分组键可以是任何长度适当的数组
states = np.array(['Ohio','California','California','Ohio','Ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()

California  2005   -0.583971
            2006   -0.193537
Ohio        2005    0.516265
            2006    1.505119
Name: data1, dtype: float64

In [6]:
# 将列名（可以是字符串、数字或其他Python对象）用作分组键：
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.362767,-0.068805
b,0.335919,-0.525952


In [7]:
df.groupby(['key1','key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.836136,0.165807
a,two,-0.583971,-0.538031
b,one,-0.193537,-0.304562
b,two,0.865375,-0.747342


In [8]:
# GroupBy的size方法，可以返回一个含有分组大小的Series
df.groupby(['key1','key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 9.1.1 对分组进行迭代

In [9]:
for name,group in df.groupby('key1'):
    print name
    print group

a
      data1     data2 key1 key2
0  0.167154  0.839215    a  one
1 -0.583971 -0.538031    a  two
4  1.505119 -0.507600    a  one
b
      data1     data2 key1 key2
2 -0.193537 -0.304562    b  one
3  0.865375 -0.747342    b  two


In [10]:
for (k1,k2),group in df.groupby(['key1','key2']):
    print k1,k2
    print group

a one
      data1     data2 key1 key2
0  0.167154  0.839215    a  one
4  1.505119 -0.507600    a  one
a two
      data1     data2 key1 key2
1 -0.583971 -0.538031    a  two
b one
      data1     data2 key1 key2
2 -0.193537 -0.304562    b  one
b two
      data1     data2 key1 key2
3  0.865375 -0.747342    b  two


将这些数据片段做成一个字典：

In [11]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-0.193537,-0.304562,b,one
3,0.865375,-0.747342,b,two


groupby默认是在axis=0上进行分组，通过设置也可以在其他任何轴上进行分组，例如可以根据dtype对列进行分组：

In [12]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [13]:
grouped = df.groupby(df.dtypes,axis=1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0  0.167154  0.839215
 1 -0.583971 -0.538031
 2 -0.193537 -0.304562
 3  0.865375 -0.747342
 4  1.505119 -0.507600, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

### 9.1.2 选取一个或一组列
对于由DataFrame产生的GroupBy对象，如果用一个（单个字符串）或一组（字符串数组）列名对其进行索引，就能实现选取部分列进行聚合的目的：

In [14]:
df.groupby(['key1','key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.165807
a,two,-0.538031
b,one,-0.304562
b,two,-0.747342


这种索引操作所返回的对象是一个已经分组的DataFrame（如果传入的是列表或数组）或已分组的Series（如果传入的是标量形式的单个列名）：

In [15]:
s_grouped = df.groupby(['key1','key2'])['data2']
s_grouped

<pandas.core.groupby.SeriesGroupBy object at 0x000000000CF7D8D0>

In [16]:
s_grouped.mean()

key1  key2
a     one     0.165807
      two    -0.538031
b     one    -0.304562
      two    -0.747342
Name: data2, dtype: float64

### 9.1.3 通过字典或Series进行分组
除数组外，分组信息还可以以其他形式存在：

In [17]:
people = pd.DataFrame(np.random.randn(5,5),
                     columns = ['a','b','c','d','e'],
                     index = ['Joe','Steve','Wes','Jim','Travis'])
people.loc[2:3,['b','c']]=np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,-0.386099,-1.432621,0.660237,0.489086,0.40204
Steve,-0.332529,0.452466,0.506883,-0.128833,0.477774
Wes,-0.936195,,,1.390733,-0.085433
Jim,1.340927,0.055309,1.506467,-1.781724,-0.291343
Travis,-0.891619,-0.657688,1.65894,1.490788,-0.593959


假设已知列的分组关系，并希望根据分组计算列的总计：

In [18]:
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_column = people.groupby(mapping,axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,1.149323,-1.41668
Steve,0.37805,0.597711
Wes,1.390733,-1.021628
Jim,-0.275257,1.104894
Travis,3.149728,-2.143267


Series也有同样的功能，它可以被看作一个固定大小的映射，上例如果用Series作为分组键，则pandas会检查Series以确保其索引跟分组轴是对齐的：

In [19]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [20]:
people.groupby(map_series,axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 9.1.4 通过函数进行分组
任何被当作分组键的函数都会在各个索引值上被调用一次，其返回值就会被用作分组名称。上例中，若根据人名的长度进行分组，仅传入len函数就可以了：

In [21]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.018634,-1.377312,2.166704,0.098095,0.025264
5,-0.332529,0.452466,0.506883,-0.128833,0.477774
6,-0.891619,-0.657688,1.65894,1.490788,-0.593959


将函数跟数组、列表、字典、Series混合使用也不是问题，因为任何东子最终都会被转换为数组：

In [22]:
key_list=['one','one','one','two','one']
people.groupby([len,key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.936195,-1.432621,0.660237,0.489086,-0.085433
3,two,1.340927,0.055309,1.506467,-1.781724,-0.291343
5,one,-0.332529,0.452466,0.506883,-0.128833,0.477774
6,one,-0.891619,-0.657688,1.65894,1.490788,-0.593959


### 9.1.5 根据索引级别分组
层次化索引数据集最方便的地方就在于它能够根据索引级别进行聚合，要实现该目的，通过level关键字传入级别编号或名称即可：

In [23]:
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
                                    [1,3,5,1,3]],names=['city','tenor'])
hier_df = pd.DataFrame(np.random.randn(4,5),columns=columns)
hier_df

city,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.413016,-2.510973,-0.66258,-2.051099,-0.473012
1,2.612819,-0.015368,-0.357439,-1.003357,0.591213
2,-0.349009,-0.709934,-2.208292,1.286282,1.222846
3,-2.136297,1.208343,1.171808,-0.035197,0.696152


In [24]:
hier_df.groupby(level='city',axis=1).count()

city,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 9.2 数据聚合

In [25]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.167154,0.839215,a,one
1,-0.583971,-0.538031,a,two
2,-0.193537,-0.304562,b,one
3,0.865375,-0.747342,b,two
4,1.505119,-0.5076,a,one


In [26]:
grouped = df.groupby('key1')
 # quantile可以计算Series或DataFrame列的样本分位数
grouped['data1'].quantile(0.9)

key1
a    1.237526
b    0.759484
Name: data1, dtype: float64

虽然quantile并没有明确地实现于GroupBy，但它是一个Series方法，所以这里是能用的。实际上，GroupBy会高效地对Series进行切片，然后对各片调用piece.quantile（0.9），最后将这些结果组装成最终结果。

如果要使用自己的聚合函数，只需将其传入aggregate或agg方法即可：

In [27]:
def peak_to_peak(arr):
    return arr.max()-arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.08909,1.377245
b,1.058913,0.442781


有些方法（如describe）也是可以用在这里的，即使严格上来讲，它们并非聚合运算：

In [29]:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,0.362767,1.058193,-0.583971,-0.208409,0.167154,0.836136,1.505119,3.0,-0.068805,0.786516,-0.538031,-0.522815,-0.5076,0.165807,0.839215
b,2.0,0.335919,0.748764,-0.193537,0.071191,0.335919,0.600647,0.865375,2.0,-0.525952,0.313093,-0.747342,-0.636647,-0.525952,-0.415257,-0.304562


表9-1：经过优化的groupby的方法

| 函数名        | 说明                   |
|:------------- |:-------------|
| count    | 分组中非NA值的数量   |
| sum     | 非NA值的和    |
| mean    | 非NA值的平均值 |
| median   | 非NA值的算数中位数 |
| std/var  | 无偏（分母为n-1）标准差和方差 |
| min/max | 非NA值的最小值和最大值|
| prod | 非NA值的积 |
| first/last | 第一个和最后一个非NA值|

In [30]:
tips = pd.read_csv('D:/python-dataset/tips.csv')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [31]:
# 添加“小费占总额百分比”的列
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips[:6]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
5,25.29,4.71,Male,No,Sun,Dinner,4,0.18624


### 9.2.1 面向列的多函数应用
对Series或DataFrame列的聚合运算其实就是使用aggregate（使用自定义函数）或调用诸如mean、std之类的方法。然而，可能有些需求需要对不同的列使用不同的聚合函数，或一次应用多个函数。

In [34]:
# 根据sex和smoker对tips分组：
grouped = tips.groupby(['sex','smoker'])
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')

sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

In [35]:
# 传入一组函数或函数名
grouped_pct.agg(['mean','std',peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,No,0.156921,0.036421,0.195876
Female,Yes,0.18215,0.071595,0.360233
Male,No,0.160669,0.041849,0.220186
Male,Yes,0.152771,0.090588,0.674707


In [36]:
# 自定义列名
grouped_pct.agg([('foo','mean'),('bar',np.std)])

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,0.156921,0.036421
Female,Yes,0.18215,0.071595
Male,No,0.160669,0.041849
Male,Yes,0.152771,0.090588


In [37]:
# 向多列应用多个函数
functions = ['count','mean','max']
result = grouped['tip_pct','total_bill'].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Female,No,54,0.156921,0.252672,54,18.105185,35.83
Female,Yes,33,0.18215,0.416667,33,17.977879,44.3
Male,No,97,0.160669,0.29199,97,19.791237,48.33
Male,Yes,60,0.152771,0.710345,60,22.2845,50.81


In [38]:
result['tip_pct']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,max
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,No,54,0.156921,0.252672
Female,Yes,33,0.18215,0.416667
Male,No,97,0.160669,0.29199
Male,Yes,60,0.152771,0.710345


In [39]:
ftuples = [('Durchschnitt','mean'),('Abweichung',np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Durchschnitt,Abweichung,Durchschnitt,Abweichung
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Female,No,0.156921,0.001327,18.105185,53.092422
Female,Yes,0.18215,0.005126,17.977879,84.451517
Male,No,0.160669,0.001751,19.791237,76.152961
Male,Yes,0.152771,0.008206,22.2845,98.244673


In [40]:
# 对不同列应用不同函数
grouped.agg({'tip':np.max,'size':'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,5.2,140
Female,Yes,6.5,74
Male,No,9.0,263
Male,Yes,10.0,150


In [41]:
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,No,0.056797,0.252672,0.156921,0.036421,140
Female,Yes,0.056433,0.416667,0.18215,0.071595,74
Male,No,0.071804,0.29199,0.160669,0.041849,263
Male,Yes,0.035638,0.710345,0.152771,0.090588,150


### 9.2.2 以“无索引”的形式返回聚合数据

In [42]:
# 可以向GroupBy传入as_index=False以禁用该功能
tips.groupby(['sex','smoker'],as_index=False).mean()

Unnamed: 0,sex,smoker,total_bill,tip,size,tip_pct
0,Female,No,18.105185,2.773519,2.592593,0.156921
1,Female,Yes,17.977879,2.931515,2.242424,0.18215
2,Male,No,19.791237,3.113402,2.71134,0.160669
3,Male,Yes,22.2845,3.051167,2.5,0.152771


In [44]:
# 对结果调用reset_index()也能得到这种形式
tips.groupby(['sex','smoker']).mean().reset_index()

Unnamed: 0,sex,smoker,total_bill,tip,size,tip_pct
0,Female,No,18.105185,2.773519,2.592593,0.156921
1,Female,Yes,17.977879,2.931515,2.242424,0.18215
2,Male,No,19.791237,3.113402,2.71134,0.160669
3,Male,Yes,22.2845,3.051167,2.5,0.152771


## 9.3 分组级运算和转换
聚合只不过是分组运算中的一种而已，是数据转换的一个特例，本节介绍transform和apply方法，能够执行更多分组运算；

In [45]:
# 为一个DataFrame添加一个用于存放各索引分组平均值的列，一个办法就是先聚合再合并
df

Unnamed: 0,data1,data2,key1,key2
0,0.167154,0.839215,a,one
1,-0.583971,-0.538031,a,two
2,-0.193537,-0.304562,b,one
3,0.865375,-0.747342,b,two
4,1.505119,-0.5076,a,one


In [46]:
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means

Unnamed: 0_level_0,mean_data1,mean_data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.362767,-0.068805
b,0.335919,-0.525952


In [48]:
pd.merge(df,k1_means,left_on = 'key1',right_index=True)

Unnamed: 0,data1,data2,key1,key2,mean_data1,mean_data2
0,0.167154,0.839215,a,one,0.362767,-0.068805
1,-0.583971,-0.538031,a,two,0.362767,-0.068805
4,1.505119,-0.5076,a,one,0.362767,-0.068805
2,-0.193537,-0.304562,b,one,0.335919,-0.525952
3,0.865375,-0.747342,b,two,0.335919,-0.525952


上述方法虽然可行但不太灵活，可以将该过程看作利用np.mean函数对两个数列进行转换，再以people数据框为例，在GroupBy方法上使用transform方法：

In [50]:
people

Unnamed: 0,a,b,c,d,e
Joe,-0.386099,-1.432621,0.660237,0.489086,0.40204
Steve,-0.332529,0.452466,0.506883,-0.128833,0.477774
Wes,-0.936195,,,1.390733,-0.085433
Jim,1.340927,0.055309,1.506467,-1.781724,-0.291343
Travis,-0.891619,-0.657688,1.65894,1.490788,-0.593959


In [51]:
key = ['one','two','one','two','one']
people.groupby(key).mean()

Unnamed: 0,a,b,c,d,e
one,-0.737971,-1.045155,1.159589,1.123536,-0.092451
two,0.504199,0.253887,1.006675,-0.955278,0.093216


In [52]:
people.groupby(key).transform(np.mean)

Unnamed: 0,a,b,c,d,e
Joe,-0.737971,-1.045155,1.159589,1.123536,-0.092451
Steve,0.504199,0.253887,1.006675,-0.955278,0.093216
Wes,-0.737971,-1.045155,1.159589,1.123536,-0.092451
Jim,0.504199,0.253887,1.006675,-0.955278,0.093216
Travis,-0.737971,-1.045155,1.159589,1.123536,-0.092451


可见，transform会将一个函数应用到各个分组，然后将结果放置到适当的位置上，如果各分组产生的是一个标量值，则该值就会被广播出去。

现在，假设希望从各组中减去平均值，为此先创建一个距离化函数（demeaning function），然后将其传给transform：

In [53]:
def demean(arr):
    return arr-arr.mean()

demeaned = people.groupby(key).transform(demean)
demeaned

Unnamed: 0,a,b,c,d,e
Joe,0.351872,-0.387466,-0.499352,-0.63445,0.494491
Steve,-0.836728,0.198578,-0.499792,0.826446,0.384558
Wes,-0.198224,,,0.267198,0.007017
Jim,0.836728,-0.198578,0.499792,-0.826446,-0.384558
Travis,-0.153648,0.387466,0.499352,0.367252,-0.501508


In [55]:
# 检查demeaned现在分组平均值是否为0
demeaned.groupby(key).mean().round()

Unnamed: 0,a,b,c,d,e
one,0.0,0.0,0.0,0.0,0.0
two,0.0,0.0,0.0,0.0,0.0


### 9.3.1 apply：一般性的“拆分-应用-合并”
跟aggreegate一样，transform也是一个有严格条件的特殊函数：传入的函数只能产生两种结果，要么产生一个可以广播的标量值（如np.mean），要么产生一个相同大小的结果数组。

最一般化的GroupBy方法是apply，下面将重点讲解它