# Groupby操作

In [2]:
import pandas as pd
import numpy as np

在处理金融时间数据时，我们经常会遇到长格式数据：

In [3]:
df_wide = pd.DataFrame(np.random.randn(10, 10),
                  index=pd.date_range("2018-01-01", periods=10, name="trade_date"),
                  columns=pd.Index(["000001.SZ", "000002.SZ", "000003.SZ", "000004.SZ", "600000.SH", "600001.SH", "600002.SH", "600003.SH", "600004.SH", "600005.SH"], name="code"))
df_long = df_wide.stack(-1).reset_index()
df_long.rename({0: "return"},axis=1, inplace=True)
ser = df_long["code"].map({"000001.SZ": "finance", "000002.SZ": "finance", 
                           "000003.SZ": "finance", "000004.SZ": "finance", 
                           "600000.SH": "finance", "600001.SH": "IT",  
                           "600002.SH": "IT", "600003.SH": "IT", 
                           "600004.SH": "IT", "600005.SH": "IT"})
df_long.insert(1, "industry", ser)
df_long

Unnamed: 0,trade_date,industry,code,return
0,2018-01-01,finance,000001.SZ,-1.021166
1,2018-01-01,finance,000002.SZ,-1.402780
2,2018-01-01,finance,000003.SZ,-0.018900
3,2018-01-01,finance,000004.SZ,0.969914
4,2018-01-01,finance,600000.SH,-0.325153
5,2018-01-01,IT,600001.SH,-0.488420
6,2018-01-01,IT,600002.SH,-0.820296
7,2018-01-01,IT,600003.SH,0.002147
8,2018-01-01,IT,600004.SH,-0.993037
9,2018-01-01,IT,600005.SH,-2.055199


如果需要对这种类型的数据进行分组统计，就需要用到groupby

例如，计算每天，每个行业的平均收益率

In [4]:
df_long.groupby(["trade_date", "industry"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,return
trade_date,industry,Unnamed: 2_level_1
2018-01-01,IT,-0.870961
2018-01-01,finance,-0.359617
2018-01-02,IT,-1.296898
2018-01-02,finance,0.037362
2018-01-03,IT,0.883723
2018-01-03,finance,0.461136
2018-01-04,IT,0.235001
2018-01-04,finance,0.587093
2018-01-05,IT,0.250471
2018-01-05,finance,-0.757273


求和就是用sum

In [5]:
df_long.groupby(["trade_date", "industry"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,return
trade_date,industry,Unnamed: 2_level_1
2018-01-01,IT,-4.354805
2018-01-01,finance,-1.798086
2018-01-02,IT,-6.48449
2018-01-02,finance,0.186808
2018-01-03,IT,4.418616
2018-01-03,finance,2.305678
2018-01-04,IT,1.175003
2018-01-04,finance,2.935464
2018-01-05,IT,1.252354
2018-01-05,finance,-3.786367


当然不仅仅只支持这两个函数，基本的统计函数都是支持的

![image.png](attachment:image.png)

同样，groupby支持apply方法，通过自定义函数，设计自己个性化的操作

但是这有个问题，就是不好知道函数的输入和输出应该是什么，此时我们仍然使用在函数里加print的小技巧 

In [7]:
def myfun(x):
    print(type(x), "\n")
    print(x, "\n\n\n")
    return x.loc[:, ["return"]].T
df_long.groupby(["trade_date", "industry"]).apply(myfun)

<class 'pandas.core.frame.DataFrame'> 

  trade_date industry       code    return
5 2018-01-01       IT  600001.SH -0.488420
6 2018-01-01       IT  600002.SH -0.820296
7 2018-01-01       IT  600003.SH  0.002147
8 2018-01-01       IT  600004.SH -0.993037
9 2018-01-01       IT  600005.SH -2.055199 



<class 'pandas.core.frame.DataFrame'> 

  trade_date industry       code    return
5 2018-01-01       IT  600001.SH -0.488420
6 2018-01-01       IT  600002.SH -0.820296
7 2018-01-01       IT  600003.SH  0.002147
8 2018-01-01       IT  600004.SH -0.993037
9 2018-01-01       IT  600005.SH -2.055199 



<class 'pandas.core.frame.DataFrame'> 

  trade_date industry       code    return
0 2018-01-01  finance  000001.SZ -1.021166
1 2018-01-01  finance  000002.SZ -1.402780
2 2018-01-01  finance  000003.SZ -0.018900
3 2018-01-01  finance  000004.SZ  0.969914
4 2018-01-01  finance  600000.SH -0.325153 



<class 'pandas.core.frame.DataFrame'> 

   trade_date industry       code    return
15 2018-01

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,90,91,92,93,94
trade_date,industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-01-01,IT,return,-0.48842,-0.820296,0.002147,-0.993037,-2.055199
2018-01-01,finance,return,-1.021166,-1.40278,-0.0189,0.969914,-0.325153
2018-01-02,IT,return,-0.567409,-0.076138,-1.765368,-1.500179,-2.575396
2018-01-02,finance,return,-1.007916,-0.749957,2.404985,-0.853619,0.393316
2018-01-03,IT,return,0.194626,1.634896,-0.362762,0.095297,2.85656
2018-01-03,finance,return,-0.16125,-0.421908,1.349923,0.193758,1.345156
2018-01-04,IT,return,-0.46524,1.498868,0.126732,-0.344038,0.358681
2018-01-04,finance,return,0.020973,0.925955,1.588451,0.075144,0.32494
2018-01-05,IT,return,-0.343281,0.931851,-0.359196,0.735676,0.287305
2018-01-05,finance,return,-1.995079,-0.757608,-2.089613,0.915938,0.139995


# 可以发现，原来