# ADVANCED PANDAS: DATA GROUPING & AGGREGATION

## Course Outline:
- Introduction to Data Wrangling
    - Case-study: Data Preprocessing for The Absolute Beginners
- Data Cleaning & Preparation
    - Data Cleaning (Missing & Duplicated Data)
    - String Manipulation (Regular Expression)
    - Data Transformation
- Merging, Joining, and Concatenating Data
    - concat()
    - merge()
    - join()
- ***Aggregation and Grouping***
    - ***groupby()***
- Reshaping and Pivoting
    - pivot()
    - pivot_table()
    - crosstab()

##### Importing Libraries & Datasets

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [None]:
titanic = sns.load_dataset('titanic')
titanic

In [None]:
olympics = pd.read_csv('data/olympics.csv')
olympics

==========

# Data Aggregation (Reduction) and Grouping

### Basics of Grouping

##### DATAFRAME -> SPLIT (.loc) -> APPLY (sum()) -> COMBINE (groupby())

In [None]:
from IPython.display import Image
Image("data/groupby.png")

##### Simple Aggregation using .loc & Indexing

In [None]:
titanic[['sex','age', 'fare']].max(numeric_only=True,axis=0)

In [None]:
titanic[titanic['sex'] == 'female']['fare'].mean()

In [None]:
# 1] Splitting the dataframe
female = titanic[titanic['sex'] == 'female']
# male = titanic[titanic['sex'] == 'male']

In [None]:
# 2] Applying the aggregation function
female.mean()
# male.mean()

In [None]:
# 3] Combining the two splitted dataframes
pd.concat(['female','male'])

##### A Simple Aggregation Using groupby()

In [None]:
olympics.head()

In [None]:
olympics.groupby('City')['Medal'].count().sort_values()

In [None]:
olympics.groupby('City')['Medal'].value_counts().sort_values(ascending=False)

In [None]:
titanic.head()

In [None]:
titanic.groupby('pclass')['survived'].count()

In [None]:
titanic.groupby(['embarked','pclass'])['survived'].count()

##### Spliting with One Key

In [None]:
# Taking a slice from our dataset
titanic_gender = titanic.iloc[:10,[2,3]]
titanic_gender

In [None]:
# Grouping data by the gender
titanic_gender_group = titanic_gender.groupby('sex')
titanic_gender_group

In [None]:
# Let's see the groups
titanic_gender_group.groups

In [None]:
# Accessing a subgroup
titanic_gender_group.get_group('female')

In [None]:
titanic_gender_list = list(titanic_gender_group)
titanic_gender_list

In [None]:
titanic_gender_list[0]

In [None]:
len(titanic_gender_list[0][0])

In [None]:
type(titanic_gender_list[0][1])

In [None]:
titanic_gender_list[0][1]

In [None]:
titanic

##### Splitting with Many Keys

In [None]:
# Splitting with many keys
titanic_class = titanic.iloc[:10,[2,3,8]]
titanic_class

In [None]:
titanic_class['class'].unique()

In [None]:
titanic_splitting = titanic_class.groupby(['sex','class'])
titanic_splitting

In [None]:
list(titanic_splitting)

##### Split-Apply-Combine

In [None]:
titanic_gender

In [None]:
list(titanic_gender.groupby('sex'))[0][1]

In [None]:
list(titanic_gender.groupby('sex'))[1][1]

In [None]:
# Let's apply the aggregation function 'mean'
titanic_gender.groupby('sex').mean()

In [None]:
titanic_sum = titanic.groupby(['class','sex']).sum()
titanic_sum

In [None]:
titanic.groupby(['class','sex']).sum().index

In [None]:
titanic_gender.groupby('sex').mean()

In [None]:
titanic.groupby('sex')['survived'].mean()

In [None]:
titanic.groupby('class').count()

In [None]:
titanic.describe()

In [None]:
# Finding the overall fare mean 
titanic.fare.mean()

In [None]:
# Finding the mean regareding the class
titanic.groupby('class').fare.mean()

##### Multi-indexing Using groupby()

In [None]:
# Multi-indexing using groupby() operation
titanic_sum.index

In [None]:
titanic_sum.loc['First']

In [None]:
titanic_sum.loc['First']['survived']

##### Using agg() function

In [None]:
# Let's select specific columns
titanic_new = titanic.loc[:, ['survived', 'sex', 'age', 'fare', 'class']]
titanic_new

In [None]:
titanic_new.groupby('sex').agg(['mean','min','sum','max'])

In [None]:
titanic_new.groupby('sex').agg({'survived': ['min','max'], 'age': ['sum', 'count']})

In [None]:
# How about renaming these columns with a meaningful names
titanic_new.groupby('sex').agg(survived_total = ('survived', 'sum'))
titanic_new.groupby('sex').agg(survived_total = ('survived', 'sum'), survived_rate = ('survived', 'mean'))

##### Using transform() function to Handle Outliers

In [None]:
titanic

In [None]:
from scipy.stats import zscore
titanic_std = titanic.groupby('pclass')['fare'].transform(zscore)
titanic_std

In [None]:
titanic_std.loc[(titanic_std < -1) | (titanic_std > 3)]

In [None]:
# Another simple example to understand transform() function
values = {'keys':['a','a','b','c','c','c','a'],
      'values':[12,5,17,20,1,3,8]}
df = pd.DataFrame(values)
df

In [None]:
# Applying a function to our DataFrame
df.transform(lambda x: x*2)

In [None]:
df.groupby('keys').sum()

In [None]:
# This is a useful example for understanding transform()
df.groupby('keys').transform('sum')

In [None]:
# Now we will use our understanding for our dataset
titanic.groupby('sex')[['survived']].transform('mean').round(2)

##### Dealing with Missing Data Using groupby()

In [None]:
titanic.head()

In [None]:
titanic.info()

In [None]:
# Filling missing data with the 'overall mean'
titanic['age'].fillna(titanic.age.mean())

In [None]:
# Here is the mean values for each class
titanic.groupby(['sex', 'pclass'])['age'].mean()

In [None]:
# Finding the mean values for every 
titanic['mean_age'] = titanic.groupby(['sex', 'pclass'])['age'].transform('mean')
titanic

In [None]:
titanic['age'].fillna(titanic['mean_age'], inplace=True)

In [None]:
titanic.info()

==========

# THANK YOU!