# 範例目標:
1. 實做 groupby 函式實現資料科學的 Split-Apply-Combine 策略

# 範例重點:

1. Groupby可以同時針對多個欄位做 Group，並在Group中做運算
2. Split：將大的數據集拆成可獨立計算的小數據集
3. Apply：獨立計算各個小數據集
4. Combine：將小數據集運算結果合併

Groupby https://zhuanlan.zhihu.com/p/101284491

In [39]:
import pandas as pd

In [40]:
score_df = pd.DataFrame([[1,50,80,70,'boy'], 
              [2,60,45,50,'boy'],
              [3,98,43,55,'boy'],
              [4,70,69,89,'boy'],
              [5,56,79,60,'girl'],
              [6,60,68,55,'girl'],
              [7,45,70,77,'girl'],
              [8,55,77,76,'girl'],
              [9,25,57,60,'girl'],
              [10,88,40,43,'girl']],columns=['student_id','math_score','english_score','chinese_score','sex'])
score_df = score_df.set_index('student_id')
score_df

Unnamed: 0_level_0,math_score,english_score,chinese_score,sex
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,50,80,70,boy
2,60,45,50,boy
3,98,43,55,boy
4,70,69,89,boy
5,56,79,60,girl
6,60,68,55,girl
7,45,70,77,girl
8,55,77,76,girl
9,25,57,60,girl
10,88,40,43,girl


**運用索引將資料分開再取平均**

In [41]:
boy_score_df = score_df.loc[score_df.sex=='boy']
girl_score_df = score_df.loc[score_df.sex=='girl']

print(boy_score_df)
print(girl_score_df)
print(boy_score_df.loc[score_df.sex=='boy', ['math_score']])
print(boy_score_df.loc[score_df['sex']=='boy', ['math_score']])

            math_score  english_score  chinese_score  sex
student_id                                               
1                   50             80             70  boy
2                   60             45             50  boy
3                   98             43             55  boy
4                   70             69             89  boy
            math_score  english_score  chinese_score   sex
student_id                                                
5                   56             79             60  girl
6                   60             68             55  girl
7                   45             70             77  girl
8                   55             77             76  girl
9                   25             57             60  girl
10                  88             40             43  girl
            math_score
student_id            
1                   50
2                   60
3                   98
4                   70
            math_score
student_id         

In [42]:
print(boy_score_df.mean())
print(girl_score_df.mean())

math_score       69.50
english_score    59.25
chinese_score    66.00
dtype: float64
math_score       54.833333
english_score    65.166667
chinese_score    61.833333
dtype: float64


**運用groupby方法值接取平均**

In [43]:
score_df.groupby('sex').mean()

Unnamed: 0_level_0,math_score,english_score,chinese_score
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
boy,69.5,59.25,66.0
girl,54.833333,65.166667,61.833333


In [44]:
score_df.groupby(['sex']).agg(['mean'])

Unnamed: 0_level_0,math_score,english_score,chinese_score
Unnamed: 0_level_1,mean,mean,mean
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
boy,69.5,59.25,66.0
girl,54.833333,65.166667,61.833333


In [45]:
for sex in set(score_df['sex']):
    print(sex,': \n',score_df[score_df['sex']==sex].mean(),'\n')

girl : 
 math_score       54.833333
english_score    65.166667
chinese_score    61.833333
dtype: float64 

boy : 
 math_score       69.50
english_score    59.25
chinese_score    66.00
dtype: float64 



In [46]:
for sex in set(score_df['sex']):
    print(sex,': \n',score_df[score_df['sex']==sex]['math_score'].mean(),'\n')

girl : 
 54.833333333333336 

boy : 
 69.5 



**新增欄位class**

In [47]:
score_df['class'] = [1,2,1,2,1,2,1,2,1,2]
score_df

Unnamed: 0_level_0,math_score,english_score,chinese_score,sex,class
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,50,80,70,boy,1
2,60,45,50,boy,2
3,98,43,55,boy,1
4,70,69,89,boy,2
5,56,79,60,girl,1
6,60,68,55,girl,2
7,45,70,77,girl,1
8,55,77,76,girl,2
9,25,57,60,girl,1
10,88,40,43,girl,2


In [48]:
score_df.groupby(['sex','class']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,math_score,english_score,chinese_score
sex,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
boy,1,74.0,61.5,62.5
boy,2,65.0,57.0,69.5
girl,1,42.0,68.666667,65.666667
girl,2,67.666667,61.666667,58.0


**Groupby也可以針對多個欄位做分析**

In [49]:
score_df

Unnamed: 0_level_0,math_score,english_score,chinese_score,sex,class
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,50,80,70,boy,1
2,60,45,50,boy,2
3,98,43,55,boy,1
4,70,69,89,boy,2
5,56,79,60,girl,1
6,60,68,55,girl,2
7,45,70,77,girl,1
8,55,77,76,girl,2
9,25,57,60,girl,1
10,88,40,43,girl,2


**Groupby也可以針對欄位做多個分析**

In [50]:
score_df.groupby(['sex']).agg(['mean','std'])

Unnamed: 0_level_0,math_score,math_score,english_score,english_score,chinese_score,chinese_score,class,class
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
boy,69.5,20.680103,59.25,18.191115,66.0,17.530925,1.5,0.57735
girl,54.833333,20.566153,65.166667,14.579666,61.833333,12.952477,1.5,0.547723


**Groupby也可以同時針對多個欄位做多個分析**

In [51]:
score_df.groupby(['sex','class']).agg(['mean','max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,math_score,math_score,english_score,english_score,chinese_score,chinese_score
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,max,mean,max,mean,max
sex,class,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
boy,1,74.0,98,61.5,80,62.5,70
boy,2,65.0,70,57.0,69,69.5,89
girl,1,42.0,56,68.666667,79,65.666667,77
girl,2,67.666667,88,61.666667,77,58.0,76
