## CMPINF 2100 Week 05

### SPLIT-APPLY-COMBINE or GROUPBY and AGGREGATE (summarize)

We have learned how to summarize data in Pandas. But we have summarized INDIVIDUAL columns ignoring all other columns!

Let's now explore how summary stats of one column CHANGE or VARY across the categories of another column!

Or...is the average different for a different group?

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Read data

Continue working with our JOINED data set.

In [2]:
df = pd.read_csv( 'joined_data.csv' )

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       12 non-null     object 
 1   B       12 non-null     float64
 2   C       12 non-null     float64
 3   D       12 non-null     object 
 4   E       12 non-null     object 
 5   F       14 non-null     int64  
 6   G       9 non-null      float64
 7   H       14 non-null     object 
dtypes: float64(3), int64(1), object(4)
memory usage: 1.0+ KB


## Review

We know how to summarize individual columns!

In [4]:
df.nunique()

A    12
B    12
C    12
D    12
E     4
F     4
G     3
H     4
dtype: int64

In [5]:
df.dtypes

A     object
B    float64
C    float64
D     object
E     object
F      int64
G    float64
H     object
dtype: object

In [6]:
df.E.value_counts()

aa    3
bb    3
cc    3
dd    3
Name: E, dtype: int64

In [7]:
df.E.value_counts(dropna=False)

aa     3
bb     3
cc     3
dd     3
NaN    2
Name: E, dtype: int64

In [8]:
df.B.mean()

5.5

In [9]:
df.C.mean()

-650.0

But what is the AVERAGE of `B` for each unique of `E`?

We cannot answer this question by simply applying the same summary methods!

We need to do something else to support our exploration!

## Split-Apply-Combine

Split means we DIVIDE or BREAK the data into distinct and separate groups!

We must partition the data set into the unique categories of a **GROUPING VARIABLE**.

We need to know the unique values of the grouping variable.

In [11]:
df.E.unique()

array(['aa', 'bb', 'cc', 'dd', nan], dtype=object)

We must split `df` into smaller data sets. Each data set will have 1 and only 1 value of `E`.

In [12]:
df_E_aa = df.loc[ df.E == 'aa', : ].copy()

In [13]:
df_E_aa.shape

(3, 8)

In [14]:
df_E_aa

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA


In [15]:
df_E_aa.nunique()

A    3
B    3
C    3
D    3
E    1
F    2
G    1
H    2
dtype: int64

In [16]:
type( df_E_aa )

pandas.core.frame.DataFrame

We can apply all Pandas methods to our smaller DataFrame!

In [18]:
df_E_aa.B.mean()

1.0

In [19]:
df.B.mean()

5.5

In [20]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


We need to repeat the SPLITTING process for each unique value of `E`.

In [21]:
df_E_bb = df.loc[ df.E == 'bb', : ].copy()

In [22]:
df_E_cc = df.loc[ df.E == 'cc', : ].copy()

In [23]:
df_E_dd = df.loc[ df.E == 'dd', : ].copy()

WE also need to SPLIT on the MISSING values of `E`.

In [24]:
df_E_na = df.loc[ df.E.isna(), :].copy()

In [25]:
df_E_na

Unnamed: 0,A,B,C,D,E,F,G,H
12,,,,,,30,,CCC
13,,,,,,40,,DDD


We need to APPLY the summary method to `B` within each SPLIT smaller data set!

In [26]:
df_E_bb.B.mean()

4.0

In [27]:
df_E_cc.B.mean()

7.0

In [28]:
df_E_dd.B.mean()

10.0

In [29]:
df_E_dd

Unnamed: 0,A,B,C,D,E,F,G,H
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB
10,k,10.0,-1100.0,Nov,dd,10,400.0,AAA
11,l,11.0,-1200.0,Dec,dd,20,400.0,BBB


In [30]:
df_E_bb

Unnamed: 0,A,B,C,D,E,F,G,H
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB


In [32]:
df_E_na.B.mean()

nan

But, we do NOT want to just let these values stay as stray numbers in our notebook.

We want to COLLECT or COMBINE the summary statistics PER GROUP into a new DataFrame!!!

SPLIT-APPLY-COMBINE breaks a dataset based on categories, applies methods to summarize variables within each smaller dataset, and then combines the summary statistics per group into a new easy to use dataframe!

In [33]:
df_E_summary = pd.DataFrame({'E': df.E.unique(),
                             'B_avg': [df_E_aa.B.mean(), df_E_bb.B.mean(), df_E_cc.B.mean(), df_E_dd.B.mean(), df_E_na.B.mean()]})

In [34]:
df_E_summary

Unnamed: 0,E,B_avg
0,aa,1.0
1,bb,4.0
2,cc,7.0
3,dd,10.0
4,,


Pandas has a method to manage SPLIT-APPLY-COMBINE for us!!!!

The `.groupby()` method will divide the dataset into smaller groups based on the categories of a grouping variable!!!

In [35]:
df.groupby('E').B.mean()

E
aa     1.0
bb     4.0
cc     7.0
dd    10.0
Name: B, dtype: float64

We can force `.groupby()` to INCLUDE the MISSING!

In [36]:
df.groupby('E', dropna=False).B.mean()

E
aa      1.0
bb      4.0
cc      7.0
dd     10.0
NaN     NaN
Name: B, dtype: float64

In [37]:
df.groupby('E', dropna=False)['B'].mean()

E
aa      1.0
bb      4.0
cc      7.0
dd     10.0
NaN     NaN
Name: B, dtype: float64

The bracket notation allows providing variables or objects to define the column to summarize.

In [38]:
var_to_summarize = 'B'

In [39]:
df.groupby('E', dropna=False)[ var_to_summarize ].mean()

E
aa      1.0
bb      4.0
cc      7.0
dd     10.0
NaN     NaN
Name: B, dtype: float64

We can also use a string or object to identify the grouping variable.

In [40]:
var_to_group = 'E'

In [41]:
df.groupby( var_to_group, dropna=False )[ var_to_summarize ].mean()

E
aa      1.0
bb      4.0
cc      7.0
dd     10.0
NaN     NaN
Name: B, dtype: float64

In [42]:
df.groupby( 'E', dropna=False ).B.std()

E
aa     1.0
bb     1.0
cc     1.0
dd     1.0
NaN    NaN
Name: B, dtype: float64

In [43]:
df.groupby( 'E', dropna=False ).C.mean()

E
aa     -200.0
bb     -500.0
cc     -800.0
dd    -1100.0
NaN       NaN
Name: C, dtype: float64

In [44]:
df.groupby( 'E', dropna=False ).C.std()

E
aa     100.0
bb     100.0
cc     100.0
dd     100.0
NaN      NaN
Name: C, dtype: float64

Apply the SEM method!

In [45]:
df.groupby( 'E', dropna=False ).B.sem()

E
aa     0.57735
bb     0.57735
cc     0.57735
dd     0.57735
NaN        NaN
Name: B, dtype: float64

In [46]:
df.B.sem()

1.0408329997330663

In [47]:
df.B.std()

3.605551275463989

If we want MULTIPLE summary stats returned we can apply the `.describe()` method!

In [48]:
df.groupby('E', dropna=False).B.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
E,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
aa,3.0,1.0,1.0,0.0,0.5,1.0,1.5,2.0
bb,3.0,4.0,1.0,3.0,3.5,4.0,4.5,5.0
cc,3.0,7.0,1.0,6.0,6.5,7.0,7.5,8.0
dd,3.0,10.0,1.0,9.0,9.5,10.0,10.5,11.0
,0.0,,,,,,,


We do not need to necessarily identify a single column...when we summarize...

In [49]:
df.groupby('E', dropna=False).mean()

  df.groupby('E', dropna=False).mean()


Unnamed: 0_level_0,B,C,F,G
E,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
aa,1.0,-200.0,13.333333,100.0
bb,4.0,-500.0,16.666667,200.0
cc,7.0,-800.0,13.333333,
dd,10.0,-1100.0,16.666667,400.0
,,,35.0,


In [50]:
df.groupby('E', dropna=False).mean(numeric_only=True)

Unnamed: 0_level_0,B,C,F,G
E,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
aa,1.0,-200.0,13.333333,100.0
bb,4.0,-500.0,16.666667,200.0
cc,7.0,-800.0,13.333333,
dd,10.0,-1100.0,16.666667,400.0
,,,35.0,


In [51]:
df.groupby('E', dropna=False).std(numeric_only=True)

Unnamed: 0_level_0,B,C,F,G
E,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
aa,1.0,100.0,5.773503,0.0
bb,1.0,100.0,5.773503,0.0
cc,1.0,100.0,5.773503,
dd,1.0,100.0,5.773503,0.0
,,,7.071068,


In [52]:
df.groupby('E', dropna=False).sem(numeric_only=True)

Unnamed: 0_level_0,B,C,F,G
E,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
aa,0.57735,57.735027,3.333333,0.0
bb,0.57735,57.735027,3.333333,0.0
cc,0.57735,57.735027,3.333333,
dd,0.57735,57.735027,3.333333,0.0
,,,5.0,


In [53]:
df.groupby('E', dropna=False).describe()

Unnamed: 0_level_0,B,B,B,B,B,B,B,B,C,C,...,F,F,G,G,G,G,G,G,G,G
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
E,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
aa,3.0,1.0,1.0,0.0,0.5,1.0,1.5,2.0,3.0,-200.0,...,15.0,20.0,3.0,100.0,0.0,100.0,100.0,100.0,100.0,100.0
bb,3.0,4.0,1.0,3.0,3.5,4.0,4.5,5.0,3.0,-500.0,...,20.0,20.0,3.0,200.0,0.0,200.0,200.0,200.0,200.0,200.0
cc,3.0,7.0,1.0,6.0,6.5,7.0,7.5,8.0,3.0,-800.0,...,15.0,20.0,0.0,,,,,,,
dd,3.0,10.0,1.0,9.0,9.5,10.0,10.5,11.0,3.0,-1100.0,...,20.0,20.0,3.0,400.0,0.0,400.0,400.0,400.0,400.0,400.0
,0.0,,,,,,,,0.0,,...,37.5,40.0,0.0,,,,,,,


In [54]:
df.groupby('E', dropna=False).describe(include='object')

Unnamed: 0_level_0,A,A,A,A,D,D,D,D,H,H,H,H
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
E,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
aa,3,3,a,1.0,3,3,Jan,1.0,3,2,AAA,2
bb,3,3,d,1.0,3,3,Apr,1.0,3,2,BBB,2
cc,3,3,g,1.0,3,3,Jul,1.0,3,2,AAA,2
dd,3,3,j,1.0,3,3,Oct,1.0,3,2,BBB,2
,0,0,,,0,0,,,2,2,CCC,1


In [55]:
df.groupby('E', dropna=False).describe().columns

MultiIndex([('B', 'count'),
            ('B',  'mean'),
            ('B',   'std'),
            ('B',   'min'),
            ('B',   '25%'),
            ('B',   '50%'),
            ('B',   '75%'),
            ('B',   'max'),
            ('C', 'count'),
            ('C',  'mean'),
            ('C',   'std'),
            ('C',   'min'),
            ('C',   '25%'),
            ('C',   '50%'),
            ('C',   '75%'),
            ('C',   'max'),
            ('F', 'count'),
            ('F',  'mean'),
            ('F',   'std'),
            ('F',   'min'),
            ('F',   '25%'),
            ('F',   '50%'),
            ('F',   '75%'),
            ('F',   'max'),
            ('G', 'count'),
            ('G',  'mean'),
            ('G',   'std'),
            ('G',   'min'),
            ('G',   '25%'),
            ('G',   '50%'),
            ('G',   '75%'),
            ('G',   'max')],
           )

In [56]:
df.groupby('E', dropna=False).describe()[ ('B', 'mean') ]

E
aa      1.0
bb      4.0
cc      7.0
dd     10.0
NaN     NaN
Name: (B, mean), dtype: float64

In [57]:
df.groupby('E', dropna=False).describe()[ 'mean' ]

KeyError: 'mean'

Pandas provides multiple different ways to APPLY summary methods to GROUPED dataframes.

I personally like the `.groupby().aggregate()` approach. I feel this is the most flexible yet straightforward way to apply summary methods to DIFFERENT COLUMNS!!!

We can pick the method to apply and to pick which column we apply that method to.

In [66]:
df.groupby('E', dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_std = ('B', 'std'),
          B_sem = ('B', 'sem'),
          C_avg = ('C', 'mean'),
          C_sem = ('C', 'sem'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count'),
          F_nunique = ('F', 'nunique'),
          G_nunique = ('G', 'nunique'),
          A_nunique = ('A', 'nunique')).\
reset_index()

Unnamed: 0,E,B_avg,B_std,B_sem,C_avg,C_sem,B_numrows,B_nonmissing,F_nunique,G_nunique,A_nunique
0,aa,1.0,1.0,0.57735,-200.0,57.735027,3,3,2,1,3
1,bb,4.0,1.0,0.57735,-500.0,57.735027,3,3,2,1,3
2,cc,7.0,1.0,0.57735,-800.0,57.735027,3,3,2,0,3
3,dd,10.0,1.0,0.57735,-1100.0,57.735027,3,3,2,1,3
4,,,,,,,2,0,2,0,0


Rather than just displaying the GROUPED and SUMMARIZED result...let's SAVE or ASSIGN it to a new object.

In [67]:
df_E_summary_info = df.groupby('E', dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_std = ('B', 'std'),
          B_sem = ('B', 'sem'),
          C_avg = ('C', 'mean'),
          C_sem = ('C', 'sem'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count'),
          F_nunique = ('F', 'nunique'),
          G_nunique = ('G', 'nunique'),
          A_nunique = ('A', 'nunique')).\
reset_index()

In [68]:
df_E_summary_info

Unnamed: 0,E,B_avg,B_std,B_sem,C_avg,C_sem,B_numrows,B_nonmissing,F_nunique,G_nunique,A_nunique
0,aa,1.0,1.0,0.57735,-200.0,57.735027,3,3,2,1,3
1,bb,4.0,1.0,0.57735,-500.0,57.735027,3,3,2,1,3
2,cc,7.0,1.0,0.57735,-800.0,57.735027,3,3,2,0,3
3,dd,10.0,1.0,0.57735,-1100.0,57.735027,3,3,2,1,3
4,,,,,,,2,0,2,0,0


In [69]:
df_E_summary_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   E             4 non-null      object 
 1   B_avg         4 non-null      float64
 2   B_std         4 non-null      float64
 3   B_sem         4 non-null      float64
 4   C_avg         4 non-null      float64
 5   C_sem         4 non-null      float64
 6   B_numrows     5 non-null      int64  
 7   B_nonmissing  5 non-null      int64  
 8   F_nunique     5 non-null      int64  
 9   G_nunique     5 non-null      int64  
 10  A_nunique     5 non-null      int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 568.0+ bytes


Does the AVERAGE of `B` change across the categories of `E`?

In [70]:
df_E_summary_info

Unnamed: 0,E,B_avg,B_std,B_sem,C_avg,C_sem,B_numrows,B_nonmissing,F_nunique,G_nunique,A_nunique
0,aa,1.0,1.0,0.57735,-200.0,57.735027,3,3,2,1,3
1,bb,4.0,1.0,0.57735,-500.0,57.735027,3,3,2,1,3
2,cc,7.0,1.0,0.57735,-800.0,57.735027,3,3,2,0,3
3,dd,10.0,1.0,0.57735,-1100.0,57.735027,3,3,2,1,3
4,,,,,,,2,0,2,0,0


We can also group by MULTIPLE COLUMNS if we supply multiple column names within a list!

In [71]:
df.groupby(['E', 'H'], dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count'))

Unnamed: 0_level_0,Unnamed: 1_level_0,B_avg,B_numrows,B_nonmissing
E,H,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
aa,AAA,1.0,2,2
aa,BBB,1.0,1,1
bb,AAA,4.0,1,1
bb,BBB,4.0,2,2
cc,AAA,7.0,2,2
cc,BBB,7.0,1,1
dd,AAA,10.0,1,1
dd,BBB,10.0,2,2
,CCC,,1,0
,DDD,,1,0


In [72]:
df.groupby(['E', 'H'], dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count')).\
index

MultiIndex([('aa', 'AAA'),
            ('aa', 'BBB'),
            ('bb', 'AAA'),
            ('bb', 'BBB'),
            ('cc', 'AAA'),
            ('cc', 'BBB'),
            ('dd', 'AAA'),
            ('dd', 'BBB'),
            ( nan, 'CCC'),
            ( nan, 'DDD')],
           names=['E', 'H'])

The multi index is annoying and causes a lot of filtering issues.

So again I highly recommend and encourage reseting the index!!

In [73]:
df.groupby(['E', 'H'], dropna=False).\
aggregate(B_avg = ('B', 'mean'),
          B_numrows = ('B', 'size'),
          B_nonmissing = ('B', 'count')).\
reset_index()

Unnamed: 0,E,H,B_avg,B_numrows,B_nonmissing
0,aa,AAA,1.0,2,2
1,aa,BBB,1.0,1,1
2,bb,AAA,4.0,1,1
3,bb,BBB,4.0,2,2
4,cc,AAA,7.0,2,2
5,cc,BBB,7.0,1,1
6,dd,AAA,10.0,1,1
7,dd,BBB,10.0,2,2
8,,CCC,,1,0
9,,DDD,,1,0


## Realistic example

In [74]:
import seaborn as sns

In [75]:
titanic = sns.load_dataset('titanic')

In [76]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [77]:
titanic.survived.value_counts()

0    549
1    342
Name: survived, dtype: int64

In [78]:
titanic.survived.value_counts(dropna=False, normalize=True)

0    0.616162
1    0.383838
Name: survived, dtype: float64

In [79]:
titanic.survived.mean()

0.3838383838383838

But does the survival rate depend on the passenger class?

In [80]:
titanic.pclass.value_counts()

3    491
1    216
2    184
Name: pclass, dtype: int64

Let's GROUP BY `pclass` and APPLY methods to SUMMARIZE the `survived` column.

This way we can study or explore does the survial rate change across the `pclass` categories?

In [81]:
titanic.groupby('pclass', dropna=False).\
aggregate(num_rows = ('survived', 'size'),
          num_nonmissing = ('survived', 'count'),
          num_survive = ('survived', 'sum'),
          prop_survive = ('survived', 'mean'),
          survive_sem = ('survived', 'sem')).\
reset_index()

Unnamed: 0,pclass,num_rows,num_nonmissing,num_survive,prop_survive,survive_sem
0,1,216,216,136,0.62963,0.032934
1,2,184,184,87,0.472826,0.036906
2,3,491,491,119,0.242363,0.019358


Let's group by 2 variables just to see what happens.

Let's group by `pclass` and `class`.

In [82]:
titanic.pclass.value_counts()

3    491
1    216
2    184
Name: pclass, dtype: int64

In [83]:
titanic['class'].value_counts()

Third     491
First     216
Second    184
Name: class, dtype: int64

In [84]:
titanic.groupby(['pclass', 'class'], dropna=False).\
aggregate(num_rows = ('survived', 'size'),
          num_nonmissing = ('survived', 'count'),
          num_survive = ('survived', 'sum'),
          prop_survive = ('survived', 'mean'),
          survive_sem = ('survived', 'sem')).\
reset_index()

Unnamed: 0,pclass,class,num_rows,num_nonmissing,num_survive,prop_survive,survive_sem
0,1,First,216,216,136,0.62963,0.032934
1,1,Second,0,0,0,,
2,1,Third,0,0,0,,
3,2,First,0,0,0,,
4,2,Second,184,184,87,0.472826,0.036906
5,2,Third,0,0,0,,
6,3,First,0,0,0,,
7,3,Second,0,0,0,,
8,3,Third,491,491,119,0.242363,0.019358


It can be useful to simplify the grouped and summarized result by only focusing on the OBSERVED combinations!

In [85]:
titanic.groupby(['pclass', 'class'], dropna=False, observed=True).\
aggregate(num_rows = ('survived', 'size'),
          num_nonmissing = ('survived', 'count'),
          num_survive = ('survived', 'sum'),
          prop_survive = ('survived', 'mean'),
          survive_sem = ('survived', 'sem')).\
reset_index()

Unnamed: 0,pclass,class,num_rows,num_nonmissing,num_survive,prop_survive,survive_sem
0,1,First,216,216,136,0.62963,0.032934
1,2,Second,184,184,87,0.472826,0.036906
2,3,Third,491,491,119,0.242363,0.019358
