---
# Grouping for Aggregation, Filtration, and Transformation
---

In [1]:
import numpy as np
import pandas as pd

## Defining an aggregation
we examine the flights dataset and perform the simplest aggregation involving only a single grouping column, a single aggregating column, and a single aggregating function.
We will find the average arrival delay for each airline.

In [2]:
flights = pd.read_csv('./flights.csv')
flights.head()

Unnamed: 0,MONTH,DAY,WEEKDAY,AIRLINE,ORG_AIR,DEST_AIR,SCHED_DEP,DEP_DELAY,AIR_TIME,DIST,SCHED_ARR,ARR_DELAY,DIVERTED,CANCELLED
0,1,1,4,WN,LAX,SLC,1625,58.0,94.0,590,1905,65.0,0,0
1,1,1,4,UA,DEN,IAD,823,7.0,154.0,1452,1333,-13.0,0,0
2,1,1,4,MQ,DFW,VPS,1305,36.0,85.0,641,1453,35.0,0,0
3,1,1,4,AA,DFW,DCA,1555,7.0,126.0,1192,1935,-7.0,0,0
4,1,1,4,WN,LAX,MCI,1720,48.0,166.0,1363,2225,39.0,0,0


Define the grouping columns (AIRLINE), aggregating columns (ARR_DELAY), and aggregating functions (mean). Place the grouping column in the .groupby method and then call the .agg method with a dictionary pairing the aggregating column with its aggregating function. If you pass in a dictionary, it returns back a DataFrame instance

In [3]:
(
    flights.groupby(['AIRLINE'])
    .agg({'ARR_DELAY': 'mean'})
)

Unnamed: 0_level_0,ARR_DELAY
AIRLINE,Unnamed: 1_level_1
AA,5.542661
AS,-0.833333
B6,8.692593
DL,0.339691
EV,7.03458
F9,13.630651
HA,4.972973
MQ,6.860591
NK,18.43607
OO,7.593463


Alternatively, we can place the aggregating column in the index operator and then pass the aggregating function as a string to `.agg`. This will return a Series:

In [8]:
(
    flights.groupby('AIRLINE')
    ['ARR_DELAY'].agg('mean')
 
)

AIRLINE
AA     5.542661
AS    -0.833333
B6     8.692593
DL     0.339691
EV     7.034580
F9    13.630651
HA     4.972973
MQ     6.860591
NK    18.436070
OO     7.593463
UA     7.765755
US     1.681105
VX     5.348884
WN     6.397353
Name: ARR_DELAY, dtype: float64

With numpy mean

In [5]:
(
    flights.groupby('AIRLINE')
    ['ARR_DELAY'].agg(np.mean)
    .head()
)

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

It's possible to skip the agg method altogether in this case and use the code in text
method directly.

In [6]:
(
    flights.groupby('AIRLINE')
    ['ARR_DELAY'].mean()
    .head()
)

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

In [7]:
flights.groupby('AIRLINE')['ARR_DELAY'].agg([np.std]).head()

Unnamed: 0_level_0,std
AIRLINE,Unnamed: 1_level_1
AA,43.32316
AS,31.168354
B6,40.221718
DL,32.299471
EV,36.682336


## Grouping and aggregating with multiple columns and functions

As usual with any kind of grouping operation, it helps to identify the **three** components: 
-  the grouping columns, 
-  aggregating columns, 
-  and aggregating functions.

`.groupby` method by answering the following queries:
-  Finding the number of canceled flights for every airline per weekday
- Finding the number and percentage of canceled and diverted flights for every airline per weekday
-  For each origin and destination, finding the total number of flights, the number and percentage of canceled flights, and the average and variance of the airtime

In [15]:
flights.groupby(['AIRLINE', 'WEEKDAY'])['CANCELLED'].agg('sum')

AIRLINE  WEEKDAY
AA       1          41
         2           9
         3          16
         4          20
         5          18
                    ..
WN       3          18
         4          10
         5           7
         6          10
         7           7
Name: CANCELLED, Length: 98, dtype: int64

In [18]:
flights.groupby(['AIRLINE', 'WEEKDAY'])[['CANCELLED', 'DIVERTED']].agg(['sum', 'mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,DIVERTED,DIVERTED
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
AIRLINE,WEEKDAY,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AA,1,41,0.032106,6,0.004699
AA,2,9,0.007341,2,0.001631
AA,3,16,0.011949,2,0.001494
AA,4,20,0.015004,5,0.003751
AA,5,18,0.014151,1,0.000786
...,...,...,...,...,...
WN,3,18,0.014118,2,0.001569
WN,4,10,0.007911,4,0.003165
WN,5,7,0.005828,0,0.000000
WN,6,10,0.010132,3,0.003040


In [19]:
(
    flights.groupby(['ORG_AIR', 'DEST_AIR'])
    .agg(
        {
            'CANCELLED':['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']
        })
 )

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,CANCELLED,AIR_TIME,AIR_TIME
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,size,mean,var
ORG_AIR,DEST_AIR,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ATL,ABE,0,0.000000,31,96.387097,45.778495
ATL,ABQ,0,0.000000,16,170.500000,87.866667
ATL,ABY,0,0.000000,19,28.578947,6.590643
ATL,ACY,0,0.000000,6,91.333333,11.466667
ATL,AEX,0,0.000000,40,78.725000,47.332692
...,...,...,...,...,...,...
SFO,SNA,4,0.032787,122,64.059322,11.338331
SFO,STL,0,0.000000,20,198.900000,101.042105
SFO,SUN,0,0.000000,10,78.000000,25.777778
SFO,TUS,0,0.000000,20,100.200000,35.221053


To flatten the columns, we can use the `.to_flat_index` method

In [21]:
res = (
    flights.groupby(['ORG_AIR', 'DEST_AIR']).agg(
        {
            'CANCELLED':['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']
        }
    )
)
# res.columns
res.columns = ['_'.join(x) for x in res.columns.to_flat_index()]
res

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED_sum,CANCELLED_mean,CANCELLED_size,AIR_TIME_mean,AIR_TIME_var
ORG_AIR,DEST_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATL,ABE,0,0.000000,31,96.387097,45.778495
ATL,ABQ,0,0.000000,16,170.500000,87.866667
ATL,ABY,0,0.000000,19,28.578947,6.590643
ATL,ACY,0,0.000000,6,91.333333,11.466667
ATL,AEX,0,0.000000,40,78.725000,47.332692
...,...,...,...,...,...,...
SFO,SNA,4,0.032787,122,64.059322,11.338331
SFO,STL,0,0.000000,20,198.900000,101.042105
SFO,SUN,0,0.000000,10,78.000000,25.777778
SFO,TUS,0,0.000000,20,100.200000,35.221053


An eleguant method to flatten is to chaining with `.pipe` method

In [22]:
def flatten_cols(df):
  df.columns = ['_'.join(x) for x in df.columns.to_flat_index()]
  return df

In [25]:
res = (
    flights.groupby(['ORG_AIR', 'DEST_AIR']).agg(
        {
            'CANCELLED':['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']
        }
    ).pipe(flatten_cols)
)

res

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED_sum,CANCELLED_mean,CANCELLED_size,AIR_TIME_mean,AIR_TIME_var
ORG_AIR,DEST_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATL,ABE,0,0.000000,31,96.387097,45.778495
ATL,ABQ,0,0.000000,16,170.500000,87.866667
ATL,ABY,0,0.000000,19,28.578947,6.590643
ATL,ACY,0,0.000000,6,91.333333,11.466667
ATL,AEX,0,0.000000,40,78.725000,47.332692
...,...,...,...,...,...,...
SFO,SNA,4,0.032787,122,64.059322,11.338331
SFO,STL,0,0.000000,20,198.900000,101.042105
SFO,SUN,0,0.000000,10,78.000000,25.777778
SFO,TUS,0,0.000000,20,100.200000,35.221053


Be aware that when grouping with multiple columns, pandas creates a hierarchical index, or multi-index. In the preceding example, it returned 1,130 rows. However, if one of the columns that we group by is categorical (and has a category type, not an object type), then pandas will create a Cartesian product of all combinations for each level. In this case, it returns 2,710 rows. However, if you have categorical columns with higher cardinality, you can get many more values

In [26]:
res = (
    flights.assign(
        ORG_AIR = flights.ORG_AIR.astype('category')
    ).groupby(['ORG_AIR', 'DEST_AIR']).agg(
        {
            'CANCELLED':['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']
        }
    ).pipe(flatten_cols)
)

In [27]:
res

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED_sum,CANCELLED_mean,CANCELLED_size,AIR_TIME_mean,AIR_TIME_var
ORG_AIR,DEST_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATL,ABE,0,0.0,31,96.387097,45.778495
ATL,ABI,0,,0,,
ATL,ABQ,0,0.0,16,170.500000,87.866667
ATL,ABR,0,,0,,
ATL,ABY,0,0.0,19,28.578947,6.590643
...,...,...,...,...,...,...
SFO,TYS,0,,0,,
SFO,VLD,0,,0,,
SFO,VPS,0,,0,,
SFO,XNA,0,0.0,2,173.500000,0.500000


To remedy the combinatoric explosion, use the `observed=True` parameter. This makes the categorical group bys work like grouping with string types, and only shows the observed values and not the Cartesian product

In [28]:
res = (
    flights.assign(
        ORG_AIR = flights.ORG_AIR.astype('category')
    ).groupby(['ORG_AIR', 'DEST_AIR'], observed=True).agg(
        {
            'CANCELLED':['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']
        }
    ).pipe(flatten_cols)
)

res

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED_sum,CANCELLED_mean,CANCELLED_size,AIR_TIME_mean,AIR_TIME_var
ORG_AIR,DEST_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LAX,ABQ,1,0.018182,55,89.259259,29.403215
LAX,ANC,0,0.000000,7,307.428571,78.952381
LAX,ASE,1,0.038462,26,102.920000,102.243333
LAX,ATL,0,0.000000,174,224.201149,127.155837
LAX,AUS,0,0.000000,80,150.537500,57.897310
...,...,...,...,...,...,...
MSP,TTN,1,0.125000,8,124.428571,57.952381
MSP,TUL,0,0.000000,18,91.611111,63.075163
MSP,TUS,0,0.000000,2,176.000000,32.000000
MSP,TVC,0,0.000000,5,56.600000,10.300000


## Removing the MultiIndex after grouping
MultiIndexes can happen in both the index and the columns. DataFrames with MultiIndexes are more difficult to navigate. The objective is to manipulate the index so that it has a single level and the column names are descriptive.

In [29]:
flights.head(2)

Unnamed: 0,MONTH,DAY,WEEKDAY,AIRLINE,ORG_AIR,DEST_AIR,SCHED_DEP,DEP_DELAY,AIR_TIME,DIST,SCHED_ARR,ARR_DELAY,DIVERTED,CANCELLED
0,1,1,4,WN,LAX,SLC,1625,58.0,94.0,590,1905,65.0,0,0
1,1,1,4,UA,DEN,IAD,823,7.0,154.0,1452,1333,-13.0,0,0


In [31]:
airline_info = (
    flights.groupby(['AIRLINE', 'WEEKDAY']).agg(
        {
            'DIST': ['sum', 'mean'],
            'ARR_DELAY': ['min', 'max']
        }
    )
    .astype(int)
)

airline_info

Unnamed: 0_level_0,Unnamed: 1_level_0,DIST,DIST,ARR_DELAY,ARR_DELAY
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,max
AIRLINE,WEEKDAY,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AA,1,1455386,1139,-60,551
AA,2,1358256,1107,-52,725
AA,3,1496665,1117,-45,473
AA,4,1452394,1089,-46,349
AA,5,1427749,1122,-41,732
...,...,...,...,...,...
WN,3,997213,782,-38,262
WN,4,1024854,810,-52,284
WN,5,981036,816,-44,244
WN,6,823946,834,-41,290


In [32]:
print(airline_info.columns.get_level_values(0))
print()
print(airline_info.columns.get_level_values(1))

Index(['DIST', 'DIST', 'ARR_DELAY', 'ARR_DELAY'], dtype='object')

Index(['sum', 'mean', 'min', 'max'], dtype='object')


In [33]:
airline_info.columns = ['_'.join(x) for x in 
                        airline_info.columns.to_flat_index()]
airline_info.columns

Index(['DIST_sum', 'DIST_mean', 'ARR_DELAY_min', 'ARR_DELAY_max'], dtype='object')

A quick way to get rid of the row MultiIndex is to use the `.reset_index ` method:

In [34]:
airline_info.reset_index()

Unnamed: 0,AIRLINE,WEEKDAY,DIST_sum,DIST_mean,ARR_DELAY_min,ARR_DELAY_max
0,AA,1,1455386,1139,-60,551
1,AA,2,1358256,1107,-52,725
2,AA,3,1496665,1117,-45,473
3,AA,4,1452394,1089,-46,349
4,AA,5,1427749,1122,-41,732
...,...,...,...,...,...,...
93,WN,3,997213,782,-38,262
94,WN,4,1024854,810,-52,284
95,WN,5,981036,816,-44,244
96,WN,6,823946,834,-41,290


In [35]:
airline_info = (
    flights.groupby(['AIRLINE', 'WEEKDAY']).agg(
        {
            'DIST':['sum', 'min'],
            'ARR_DELAY': ['min', 'max'],
        }
    )
    .astype(int)
    .pipe(flatten_cols)
    .reset_index()
)

airline_info

Unnamed: 0,AIRLINE,WEEKDAY,DIST_sum,DIST_min,ARR_DELAY_min,ARR_DELAY_max
0,AA,1,1455386,175,-60,551
1,AA,2,1358256,175,-52,725
2,AA,3,1496665,175,-45,473
3,AA,4,1452394,175,-46,349
4,AA,5,1427749,175,-41,732
...,...,...,...,...,...,...
93,WN,3,997213,197,-38,262
94,WN,4,1024854,197,-52,284
95,WN,5,981036,197,-44,244
96,WN,6,823946,197,-41,290


By default, at the end of a groupby operation, pandas puts all of the grouping columns in the index. The `as_index` parameter in the `.groupby` method can be set to False to avoid this behavior. You can chain the `.reset_index` method after grouping to get the same effect 

In [36]:
(
    flights.groupby('AIRLINE', as_index=False)['DIST']
    .agg('mean')
    .round(0)
)

Unnamed: 0,AIRLINE,DIST
0,AA,1114.0
1,AS,1066.0
2,B6,1772.0
3,DL,866.0
4,EV,460.0
5,F9,970.0
6,HA,2615.0
7,MQ,404.0
8,NK,1047.0
9,OO,511.0


In [38]:
airline_info = (
    flights.groupby(['AIRLINE', 'WEEKDAY'], as_index=False).agg(
        {
            'DIST':['sum', 'min'],
            'ARR_DELAY': ['min', 'max'],
        }
    )
    # .astype(int)
    .pipe(flatten_cols)
    # .reset_index()
)

airline_info

Unnamed: 0,AIRLINE_,WEEKDAY_,DIST_sum,DIST_min,ARR_DELAY_min,ARR_DELAY_max
0,AA,1,1455386,175,-60.0,551.0
1,AA,2,1358256,175,-52.0,725.0
2,AA,3,1496665,175,-45.0,473.0
3,AA,4,1452394,175,-46.0,349.0
4,AA,5,1427749,175,-41.0,732.0
...,...,...,...,...,...,...
93,WN,3,997213,197,-38.0,262.0
94,WN,4,1024854,197,-52.0,284.0
95,WN,5,981036,197,-44.0,244.0
96,WN,6,823946,197,-41.0,290.0


## Grouping with a custom aggregation function
pandas provides a number of aggregation functions to use with the groupby object. At some point, we may need to write our own custom user-defined function that does not exist in pandas or NumPy.

Use the college dataset to calculate the mean and standard deviation of the undergraduate student population per state to find the maximum number of standard deviations from the mean that any single population value is per state.

Find the mean and standard deviation of the undergraduate population by state:

In [40]:
college = pd.read_csv('./college.csv')
college.head(3)

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [43]:
(
    college.groupby('STABBR')['UGDS'].agg(
        ['mean', 'std']
        
    )
    .round(0)
)

Unnamed: 0_level_0,mean,std
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,2493.0,4052.0
AL,2790.0,4658.0
AR,1644.0,3143.0
AS,1276.0,
AZ,4130.0,14894.0
CA,3518.0,6709.0
CO,2325.0,4670.0
CT,1874.0,2871.0
DC,2645.0,3225.0
DE,2491.0,4503.0


This output isn't quite what we desire. We are not looking for the mean and standard deviations of the entire group but the maximum number of standard deviations away from the mean for any one institution. To calculate this, we need to subtract the mean undergraduate population by state from each institution's undergraduate population and then divide by the standard deviation. This standardizes the undergraduate population for each group. We can then take the maximum of the absolute value of these scores to find the one that is farthest away from the mean. pandas does not provide a function capable of doing this. Instead, we will need to create a custom function:

In [44]:
def max_deviation(s):
  std_score = (s - s.mean()) / s.std()
  return std_score.abs().max()

In [46]:
(
    college.groupby('STABBR')['UGDS']
    .agg(max_deviation)
    .round(1)
)

STABBR
AK     2.6
AL     5.8
AR     6.3
AS     NaN
AZ     9.9
CA     6.1
CO     5.0
CT     5.6
DC     2.4
DE     3.5
FL     8.4
FM     NaN
GA     5.4
GU     1.0
HI     3.8
IA     6.5
ID     4.5
IL     7.3
IN     9.1
KS     4.9
KY     5.2
LA     6.5
MA     6.1
MD     5.3
ME     4.0
MH     NaN
MI     6.7
MN     7.8
MO     7.2
MP     NaN
MS     4.0
MT     3.9
NC     4.9
ND     3.5
NE     5.0
NH     5.3
NJ     7.1
NM     4.5
NV     4.7
NY     8.2
OH    10.3
OK     5.9
OR     5.3
PA    10.1
PR     6.0
PW     NaN
RI     2.9
SC     6.0
SD     4.2
TN     6.0
TX     7.7
UT     5.1
VA     7.0
VI     NaN
VT     3.8
WA     6.6
WI     5.8
WV     7.2
WY     2.8
Name: UGDS, dtype: float64

It is possible to apply this custom function to multiple aggregating columns. We simply add more column names to the indexing operator. The _max_deviation_ function only works with __numeric columns__

In [47]:
(
    college.groupby('STABBR')[['UGDS', 'SATVRMID', 'SATMTMID']]
    .agg(max_deviation)
    .round(1)
)

Unnamed: 0_level_0,UGDS,SATVRMID,SATMTMID
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,2.6,,
AL,5.8,1.6,1.8
AR,6.3,2.2,2.3
AS,,,
AZ,9.9,1.9,1.4
CA,6.1,2.7,2.5
CO,5.0,2.1,2.3
CT,5.6,3.0,2.7
DC,2.4,1.7,1.6
DE,3.5,1.2,1.1


We can also use custom aggregation function along with the prebuilt functions. The
following does this and groups by state and religious affiliation:

In [49]:
(
    college.groupby(['STABBR', 'RELAFFIL'])
    [['UGDS', 'SATVRMID', 'SATMTMID']]
    .agg([max_deviation, np.mean, np.std])
    .round(1)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATVRMID,SATVRMID,SATVRMID,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,max_deviation,mean,std,max_deviation,mean,std,max_deviation,mean,std
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
AK,0,2.1,3508.9,4539.5,,,,,,
AK,1,1.1,123.3,132.9,,555.0,,,503.0,
AL,0,5.2,3248.8,5102.4,1.6,514.9,56.5,1.7,515.8,56.7
AL,1,2.4,979.7,870.8,1.5,498.0,53.0,1.4,485.6,61.4
AR,0,5.8,1793.7,3401.6,1.9,481.1,37.9,2.0,503.6,39.0
...,...,...,...,...,...,...,...,...,...,...
WI,0,5.3,2879.1,5031.5,1.3,558.8,47.5,1.3,591.2,85.7
WI,1,3.4,1716.2,1934.6,2.1,500.1,66.0,1.8,526.6,42.5
WV,0,6.9,1873.9,6271.7,1.6,466.7,27.9,1.8,480.0,27.7
WV,1,1.3,716.4,503.6,1.9,485.7,14.6,1.7,484.8,17.7


In [52]:
def flatten_cols(df):
  df.columns = ['_'.join(x) for x in df.columns.to_flat_index()]
  return df

max_deviation.__name__ = 'Max_Deviation'

(
    college.groupby(['STABBR', 'RELAFFIL'], observed=True)
    [['UGDS', 'SATVRMID', 'SATMTMID']]
    .agg([max_deviation, np.mean, np.std])
    .round(1)
    .pipe(flatten_cols)
    .reset_index()
    .head()
)

Unnamed: 0,STABBR,RELAFFIL,UGDS_Max_Deviation,UGDS_mean,UGDS_std,SATVRMID_Max_Deviation,SATVRMID_mean,SATVRMID_std,SATMTMID_Max_Deviation,SATMTMID_mean,SATMTMID_std
0,AK,0,2.1,3508.9,4539.5,,,,,,
1,AK,1,1.1,123.3,132.9,,555.0,,,503.0,
2,AL,0,5.2,3248.8,5102.4,1.6,514.9,56.5,1.7,515.8,56.7
3,AL,1,2.4,979.7,870.8,1.5,498.0,53.0,1.4,485.6,61.4
4,AR,0,5.8,1793.7,3401.6,1.9,481.1,37.9,2.0,503.6,39.0


## Customizing aggregating functions with `*args` and `**kwargs`
The signature to `.agg` is `agg(func, *args, **kwargs)`.


build a customized function for the college dataset that finds the
percentage of schools by state and religious affiliation that have an undergraduate population
between two values.  

-  Define a function that returns the percentage of schools with an undergraduate population of between 1,000 and 3,000:


In [54]:
def pct_between_1_3k(s):
  return (
      s.between(1000, 3000)
      .mean()
      * 100
    )

-  Calculate this percentage grouping by state and religious affiliation:

In [55]:
(
    college.groupby(['STABBR', 'RELAFFIL'])
    ['UGDS'].agg(pct_between_1_3k)
    .round(1)
    .head()
)

STABBR  RELAFFIL
AK      0           14.3
        1            0.0
AL      0           23.6
        1           33.3
AR      0           27.9
Name: UGDS, dtype: float64

This function works, but it does not give the user any flexibility to choose the lower
and upper bound. Let's create a new function that allows the user to parameterize
these bounds:

In [56]:
def pct_between(s, low, high):
  return (
      s.between(low, high)
      .mean()
      * 100
    )

to call multiple aggregation functions and some of them need parameters,we can utilize Python's closure functionality to create a new function that has the parameters closed over in its calling environment:

In [64]:
def pct_between_n_m(n, m):
  def wrapper(ser):
    return pct_between(ser, n, m)
  wrapper.__name__ = f'pct_between_{n}_{m}'
  return wrapper

In [68]:
(
    college.groupby(['STABBR', 'RELAFFIL'])
    ['UGDS'].agg([pct_between_n_m(1000, 10000), 'mean', 'max'])
    .round(1)
    .head()
)

Unnamed: 0_level_0,Unnamed: 1_level_0,pct_between_1000_10000,mean,max
STABBR,RELAFFIL,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AK,0,42.9,3508.9,12865.0
AK,1,0.0,123.3,275.0
AL,0,45.8,3248.8,29851.0
AL,1,37.5,979.7,3033.0
AR,0,39.7,1793.7,21405.0
