# Groupby操作

In [2]:
import pandas as pd
import numpy as np

在处理金融时间数据时，我们经常会遇到长格式数据：

In [3]:
df_wide = pd.DataFrame(np.random.randn(10, 10),
                  index=pd.date_range("2018-01-01", periods=10, name="trade_date"),
                  columns=pd.Index(["000001.SZ", "000002.SZ", "000003.SZ", "000004.SZ", "600000.SH", "600001.SH", "600002.SH", "600003.SH", "600004.SH", "600005.SH"], name="code"))
df_long = df_wide.stack(-1).reset_index()
df_long.rename({0: "return"},axis=1, inplace=True)
ser = df_long["code"].map({"000001.SZ": "finance", "000002.SZ": "finance", 
                           "000003.SZ": "finance", "000004.SZ": "finance", 
                           "600000.SH": "finance", "600001.SH": "IT",  
                           "600002.SH": "IT", "600003.SH": "IT", 
                           "600004.SH": "IT", "600005.SH": "IT"})
df_long.insert(1, "industry", ser)
df_long

Unnamed: 0,trade_date,industry,code,return
0,2018-01-01,finance,000001.SZ,0.181138
1,2018-01-01,finance,000002.SZ,-0.620575
2,2018-01-01,finance,000003.SZ,1.357720
3,2018-01-01,finance,000004.SZ,0.578658
4,2018-01-01,finance,600000.SH,-1.202307
5,2018-01-01,IT,600001.SH,-0.420071
6,2018-01-01,IT,600002.SH,-1.475843
7,2018-01-01,IT,600003.SH,-0.875738
8,2018-01-01,IT,600004.SH,-1.763412
9,2018-01-01,IT,600005.SH,-0.364206


如果需要对这种类型的数据进行分组统计，就需要用到groupby

例如，计算每天，每个行业的平均收益率

In [4]:
df_long.groupby(["trade_date", "industry"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,return
trade_date,industry,Unnamed: 2_level_1
2018-01-01,IT,-0.979854
2018-01-01,finance,0.058927
2018-01-02,IT,0.134783
2018-01-02,finance,0.056224
2018-01-03,IT,0.143521
2018-01-03,finance,-0.354444
2018-01-04,IT,-0.121574
2018-01-04,finance,-0.221318
2018-01-05,IT,-0.071957
2018-01-05,finance,0.369501


求和就是用sum

In [5]:
df_long.groupby(["trade_date", "industry"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,return
trade_date,industry,Unnamed: 2_level_1
2018-01-01,IT,-4.89927
2018-01-01,finance,0.294634
2018-01-02,IT,0.673916
2018-01-02,finance,0.281121
2018-01-03,IT,0.717606
2018-01-03,finance,-1.772222
2018-01-04,IT,-0.607868
2018-01-04,finance,-1.106592
2018-01-05,IT,-0.359784
2018-01-05,finance,1.847504


当然不仅仅只支持这两个函数，基本的统计函数都是支持的

![image.png](attachment:image.png)

同样，groupby支持apply方法，通过自定义函数，设计自己个性化的操作

但是这有个问题，就是不好知道函数的输入和输出应该是什么，此时我们仍然使用在函数里加print的小技巧 

In [6]:
def myfun(x):
    print(type(x), "\n")
    print(x, "\n\n\n")
    return x.loc[:, ["return"]].T
df_long.groupby(["trade_date", "industry"]).apply(myfun)

<class 'pandas.core.frame.DataFrame'> 

  trade_date industry       code    return
5 2018-01-01       IT  600001.SH -0.420071
6 2018-01-01       IT  600002.SH -1.475843
7 2018-01-01       IT  600003.SH -0.875738
8 2018-01-01       IT  600004.SH -1.763412
9 2018-01-01       IT  600005.SH -0.364206 



<class 'pandas.core.frame.DataFrame'> 

  trade_date industry       code    return
5 2018-01-01       IT  600001.SH -0.420071
6 2018-01-01       IT  600002.SH -1.475843
7 2018-01-01       IT  600003.SH -0.875738
8 2018-01-01       IT  600004.SH -1.763412
9 2018-01-01       IT  600005.SH -0.364206 



<class 'pandas.core.frame.DataFrame'> 

  trade_date industry       code    return
0 2018-01-01  finance  000001.SZ  0.181138
1 2018-01-01  finance  000002.SZ -0.620575
2 2018-01-01  finance  000003.SZ  1.357720
3 2018-01-01  finance  000004.SZ  0.578658
4 2018-01-01  finance  600000.SH -1.202307 



<class 'pandas.core.frame.DataFrame'> 

   trade_date industry       code    return
15 2018-01

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,90,91,92,93,94
trade_date,industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-01-01,IT,return,-0.420071,-1.475843,-0.875738,-1.763412,-0.364206
2018-01-01,finance,return,0.181138,-0.620575,1.35772,0.578658,-1.202307
2018-01-02,IT,return,1.886091,-0.254063,-0.312342,0.05898,-0.704751
2018-01-02,finance,return,-0.215441,1.878217,0.885557,-1.40874,-0.858472
2018-01-03,IT,return,1.521506,0.501957,0.474788,-0.099465,-1.681179
2018-01-03,finance,return,-0.410214,0.407675,-0.517529,-0.706687,-0.545466
2018-01-04,IT,return,2.026451,-0.184095,-1.304236,-0.393476,-0.752512
2018-01-04,finance,return,0.216167,-0.082542,0.462316,-1.269418,-0.433113
2018-01-05,IT,return,-1.016844,0.280763,-0.186838,0.42905,0.134085
2018-01-05,finance,return,-1.150968,1.133461,1.504086,0.288765,0.072161


可以发现，原来groupby的本质就是将一个DataFrame按照几列进行分组，分成几个小DataFrame，再将这些小的DataFrame分别传给apply接收的函数，得到结果后，组合在一起，也就是：


### 分组，计算，聚合的过程