# 4. Why Tidy Data

### Summary

+ Know why Tidy data works and where it works best

### Resources
+ Reading the Seaborn tutorial on [categorical data](http://seaborn.pydata.org/tutorial/categorical.html)

### Introduction
In this notebook we will see why tidy data is useful. Tidy data is supposed to make our lives easier by making aggregation, sorting, filtering, visualizing and applying machine learning easier.

To test this idea, we will perform many different data analyses on both the tidy and original datasets to see how they differ. We will use the tidied My Brother's Keeper data from the case study.

### Read in original data

In [None]:
import pandas as pd
import numpy as np

In [None]:
df_original = pd.read_csv('../data/tidy/my_brothers_keeper.csv')
df_original.head()

### Trick to removing all percentages with `replace` method
The **`replace`** method normally replaces a specific value with another. If you set the **`regex`** parameter to True, you can be very precise with what you replace. Here, we simply replace all the percentage signs with an empty string.

In [None]:
df_original = df_original.replace('%', '', regex=True)
df_original.head()

### Check our data types

In [None]:
df_original.dtypes

### Convert each column to numeric
Unfortunately, the **`pd.to_numeric`** function only works on Series and not on entire DataFrames. The **`astype`** method can convert all columns to numeric but only if all of them are capable of being converted. Here, the race column will cause it to fail.

### Apply `pd.to_numeric` to each column
The **`apply`** method can apply a function to each column (as a Series) independently. We simply pass the function name as the first argument to the **`apply`**. You can pass additional parameters to that function (**`pd.to_numeric`** in our case) by naming them as usual.

In [None]:
df_original = df_original.apply(pd.to_numeric, errors='ignore')
df_original.dtypes

### `apply` is just a for loop
The **`apply`** is just a replacement for a **`for`** loop. It simply applies the passed function to each column. It does the exact same thing as the following for loop.

In [None]:
for col in df_original.columns:
    df_original[col] = pd.to_numeric(df_original[col], errors='ignore')

# Back to our Comparison with Tidy Data
Let's read in our tidy dataset.

In [None]:
df_tidy = pd.read_csv('../data/tidy/mbk_tidy.csv')
df_tidy.head()

## Comparison #1
For our first comparison between tidy and messy data we will filter for the race **`Black`**.

In [None]:
# original
filt = df_original['Race'] == 'Black'
df_original[filt].head()

In [None]:
# tidy
filt = df_tidy['Race'] == 'Black'
df_tidy[filt].head()

### Comments for comparison #1
Since the messy dataset had the race in a single column, the code is identical. The messy dataset might actually be preferable for readability.

## Comparison #2
Filter for black males.

In [None]:
# original
male_columns = df_original.columns.str.contains('of male')
filt = df_original['Race'] == 'Black'

df_original.loc[filt, male_columns].head()

In [None]:
# tidy
filt = (df_tidy['Race'] == 'Black') & (df_tidy['Gender'] == 'male')
df_tidy[filt].head()

### Comments for comparison #2
This filter is a huge win for tidy data. The filtering is much more straightforward and the entire dataset is returned instead of just two columns. The birth rate is also returned specifically for males.

## Comparison #3
Find the average percentage women for all races for each age group.

In [None]:
# original
cols = ['Distribution of female children born to women ages 18-19',
       'Distribution of female children born to women ages 20-24']
df_original[cols].mean()

In [None]:
# tidy
filt = df_tidy['Gender'] == 'female'
df_tidy[filt].groupby('Age Group').agg({'Gender Percent': 'mean'})

### Comments for comparison #3
Since the messy data has both female and male observations in the same line, no groupby is needed. They are both fairly straightforward.

## Comparison #4
Which gender has the highest average birth rate for each age group.

In [None]:
# original
age_18_19_female = (df_original['Rate of birth to women ages 18-19'] / 100 * 
                 df_original['Distribution of female children born to women ages 18-19']).mean()

age_20_24_female = (df_original['Rate of birth to women ages 20-24'] / 100 * 
                     df_original['Distribution of female children born to women ages 20-24']).mean()

age_18_19_male = (df_original['Rate of birth to women ages 18-19'] / 100 * 
                 df_original['Distribution of male children born to women ages 18-19']).mean()

age_20_24_male = (df_original['Rate of birth to women ages 20-24'] / 100 * 
                     df_original['Distribution of male children born to women ages 20-24']).mean()

In [None]:
age_18_19_female, age_20_24_female, age_18_19_male, age_20_24_male

In [None]:
# tidy
df_tidy.groupby(['Gender', 'Age Group'])['Birth Rate'].mean()

### Comments for comparison #4
Tidy is a huge winner as the birth rate for each gender had been pre-calculated and the annoying long column names can be avoided.

Can also use a pivot table.

In [None]:
df_tidy.pivot_table(index='Age Group', columns='Gender', values='Birth Rate')

## Comparison #5
Which year, age group, race combination has the highest rate of births?

Original:

In [None]:
df_original.sort_values('Rate of birth to women ages 18-19', ascending=False).head(1)

In [None]:
df_original.sort_values('Rate of birth to women ages 20-24', ascending=False).head(1)

Tidy:

In [None]:
df_tidy.groupby(['Race', 'Year', 'Age Group'], as_index=False).agg({'Birth Rate':'sum'}) \
       .sort_values('Birth Rate', ascending=False).head()

### Comments for comparison #5
The original data aggregates by all births, and is not broken down by gender. We need to sort by both birth rate columns to determine which age group is the highest.

For tidy data, we need to group together race and year so that we can sum up the birth rates for both genders. We can then perform one sort to get our answer.

## Visualization Advantage
Huge advantages for tidy data come from plotting use the Seaborn library, which expects tidy data.

The examples below only show tidy. The first plot shows the percentage of each gender per year by age group.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
sns.catplot(x='Year', 
            y='Gender Percent', 
            hue='Gender', 
            col='Age Group',
            data=df_tidy, 
            kind='box', 
            height=6)

In [None]:
sns.catplot(x='Year', 
            y='Birth Rate', 
            hue='Race', 
            row = 'Gender', 
            col='Age Group', 
            kind='point',
            data=df_tidy, 
            ci=0, 
            height=6)

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Tidy the dataset **`tidy/Impaired_Driving_Death_Rate.csv`**. Make a plot using seaborn comparing male to female drivers in 2012/2014.</span>

### Problem 3
<span  style="color:green; font-size:16px">Use the **`pd.read_excel`** function to read the **`tidy/genetic_engineered.xlsx`** and tidy it (very difficult).</span>