## Data Exploration and Visualization
We explore the data given in `data/wage.csv.gz` using
- [pandas](http://pandas.pydata.org/) and
- [seaborn](http://stanford.edu/~mwaskom/software/seaborn/) (hence [matplotlib](http://matplotlib.org/)).

#### Question(s):
- How does wage depend on age?
- Does the educational background influence the income?
- What about ethnicity?

#### Method(s):
- Scatter plot for income over age and data aggregation.
- Categorical plots splitting data into different subgroups.

#### Conclusion(s):
- This is your part.


In [None]:
# some imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
sns.set_style('whitegrid')
warnings.filterwarnings('ignore')

In [None]:
# read data in (notice that we only supply the path)
df = pd.read_csv('data/wage.csv.gz') # data borrowed from https://cran.r-project.org/web/packages/ISLR/index.html

In [None]:
# what does this DataFrame look like?
df.head(3)

In [None]:
# for better overview, just the columns
df.columns

In [None]:
# regarding age and wage, here are some raw numbers
df.describe()[['age', 'wage']]

In [None]:
# let's see how age involves over age
fig, ax = plt.subplots(figsize=(8, 4))
df.plot.scatter(x='age', y='wage', title='wage ~ age', ax=ax)

In [None]:
# that is interesting but it is a bit hard to see where e.g. the median values lie
# we could group by age, compute the median and see how this looks like
median_wage = df.groupby('age')[['wage']].agg(np.median)
median_wage.head()

In [None]:
# well, a picture is often nicer
fig, ax = plt.subplots(figsize=(8, 4))
df.plot.scatter(x='age', y='wage', title='wage ~ age', label='data', ax=ax)
median_wage.plot.line(label='median', linewidth=4, color='r', ax=ax)
ax.legend()

In [None]:
# what about education? let's draw some boxplots
fig, ax = plt.subplots(figsize=(8, 4))
sns.boxplot(y='wage', x='education', data=df, ax=ax)

In [None]:
# the order does not seem to be correct - let's fix that
order = sorted(df['education'].unique(), key=lambda x: int(x[0]))
order

In [None]:
# replot the thing
fig, ax = plt.subplots(figsize=(8, 4))
sns.boxplot(y='wage', x='education', data=df, 
            order=order, ax=ax)

In [None]:
# would be nice if we could now split over the race variable
# well, let's do that! (and fix the order on the way)
hue_order = sorted(df['race'].unique(), key=lambda x: int(x[0]))
fig, ax = plt.subplots(figsize=(8, 4))
sns.boxplot(y='wage', x='education', hue='race', data=df, 
            order=order, hue_order=hue_order, ax=ax)

In [None]:
# if you do not like the styling, you can do
sns.set(style="ticks")
fig, ax = plt.subplots(figsize=(8, 4))
sns.boxplot(y='wage', x='education', hue='race', data=df, 
            order=order, hue_order=hue_order, ax=ax)
sns.despine(offset=10, trim=True)

In [None]:
# what you have seen in the plot is of course also possible directly
# this is called split-apply-combine
# first we split the data via group by
# then we apply some aggregation function
# and then combine the result to obtain an aggerated data set
df.groupby(['education', 'race'])[['wage']].agg(np.median)  # very readable in my opinion