# Chapter 10 - Data Aggregation and Group Operations

## Apply: General split-apply-combine

In [1]:
import pandas as pd
# Additional datasets
import seaborn

In [2]:
# Load from library
df = seaborn.load_dataset('titanic')
df = df[['pclass', 'sex', 'age', 'fare', 'class', 'who']]
_ = df.dropna(inplace=True)
df.head()

Unnamed: 0,pclass,sex,age,fare,class,who
0,3,male,22.0,7.25,Third,man
1,1,female,38.0,71.2833,First,woman
2,3,female,26.0,7.925,Third,woman
3,1,female,35.0,53.1,First,woman
4,3,male,35.0,8.05,Third,man


`apply()` can return more than a value. It can help to return chunks of data after performing a `groupby()` function. Consider the following function. It returns the most expensive 10 tickets on the Titanic.

In [3]:
def top_5_tix(df):
    df2 = df.copy()
    df2.sort_values('fare', ascending=False, inplace=True)
    return pd.concat([df2.head(5)])

In [4]:
top_5_tix(df)

Unnamed: 0,pclass,sex,age,fare,class,who
679,1,male,36.0,512.3292,First,man
258,1,female,35.0,512.3292,First,woman
737,1,male,35.0,512.3292,First,man
438,1,male,64.0,263.0,First,man
341,1,female,24.0,263.0,First,woman


This function can also be applied to get the most expensive tickets for each `pclass` by passing this function into the `.apply()` step. What this does is it splits into the 3 chunks using the unique values in `pclass` and then gets the top 5 tickets (second step) and finally glues the chunks back together to obtain the output.

In [5]:
df.groupby('pclass', as_index=False).apply(top_5_tix)[['pclass', 'fare']]

Unnamed: 0,Unnamed: 1,pclass,fare
0,258,1,512.3292
0,679,1,512.3292
0,737,1,512.3292
0,27,1,263.0
0,438,1,263.0
1,385,2,73.5
1,665,2,73.5
1,120,2,73.5
1,655,2,73.5
1,72,2,73.5


For the `.apply()` function to work on others also can be done. What can be done is it splits the dataset into chunks and again applies the function on each chunk, before gluing them back.

In [6]:
def calc_stats(d):
    return {'min' : d.min(), 'max' : d.max(), 'std' : d.std()}

In [7]:
display(calc_stats(df['fare']))
stats = df.groupby('pclass')['fare'].apply(calc_stats)
display(stats)
# From here, use unstack() to transform it to a df
display(df.groupby('pclass')['fare'].apply(calc_stats).unstack())

{'min': 0.0, 'max': 512.3292, 'std': 52.91892950254356}

pclass     
1       max    512.329200
        min      0.000000
        std     80.857189
2       max     73.500000
        min     10.500000
        std     13.187429
3       max     56.495800
        min      0.000000
        std     10.043158
Name: fare, dtype: float64

Unnamed: 0_level_0,max,min,std
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,512.3292,0.0,80.857189
2,73.5,10.5,13.187429
3,56.4958,0.0,10.043158


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)