# Tidy data

The concept of *tidy data* was introduced by Hadley Wickham (Chief scientist in **R** project) and was inspired by databases. It turns out that scientists, analysis and statisticians can benifit from the same concepts. In particular, structuring your data in a tidy way will facilitate any type of analysis you want to do.

The core ideas are taken from Hadley Wickham's seminal paper "Tidy data" [1].

[1] Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.

In [None]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl

# %matplotlib notebook
%matplotlib inline

# increase default resolution of figures
mpl.rcParams['figure.dpi'] = 110

In [None]:
pd.__version__

# Tidying: structuring datasets to facilitate analysis

The principles of tidy data provide a standard way to organize data values within a dataset. 

Let's start with an example.

Consider the following 2 ways of presenting the same toy data. Think about whether there is any difference how we organize it.

In [None]:
untidy = pd.DataFrame({'treatment_a':[9, 16, 3],'treatment_b':[2,11,1]}, 
                      index=['John Smith', 'Jane Doe','Mary Johnson'])
untidy

In [None]:
untidy.T

In [None]:
untidy.index.name = 'person'
untidy.columns.name = 'treatment'
tidy = pd.melt(untidy.reset_index(),id_vars=['person'],value_name='IgG_level')
tidy['treatment'].replace({'treatment_a':'a','treatment_b':'b'}, inplace=True)
tidy

# Definition
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. Core principles of tidy data are simple:
1. Each **variable** forms a **column**
2. Each **observation** forms a **row**
3. Each **type of observational unit** forms a **table**

# Why use tidy data structures?

Current tools often require translation: you have to spend time munging the output from one tool so you can input it into another. Tidy datasets and tools for them work hand in hand to make data analysis easier, saving you time in the long run.

Tidying data is a topic for another time, and today we will talk about 2 tidy tools: 

- Split-Apply-Combine

- Tidy plotting (Seaborn)

# Split-Apply-Combine

It turns out that very frequently we need to do some operation based on a **groupping variable**. A common example is calculating the mean of each group (e.g. performance of each subject, or performance on each type of stimuli, etc). This can be thought of as making 3 separate actions:
- Splitting the data based on a groupping variable(s)
- Applying a function to each group separately
- Combining the resulting values back together

Based on these 3 actions, this approach is called *Split-Apply-Combine* (SAC) [1].

[1] Wickham, Hadley. "The split-apply-combine strategy for data analysis." Journal of Statistical Software 40.1 (2011): 1-29.

<img src="http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/03.08-split-apply-combine.png"></img>
From ["Aggregation and groupping" chapter](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb) of ["Python Data Science Handbook"](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb) by Jake VanderPlas

In [None]:
untidy

In [None]:
untidy.columns

In [None]:
untidy_complex = pd.concat([untidy,untidy+10],axis='columns')
untidy_complex.columns = ['treatment_a_0600', 'treatment_b_0600', 'treatment_a_1800', 'treatment_b_1800']
untidy_complex

In [None]:
tidy

In [None]:
tidy_complex = tidy.append(tidy).reset_index(drop=True)
tidy_complex.loc[6:,'IgG_level']+=10
tidy_complex['hour'] = [6]*6 + [18]*6
tidy_complex

In [None]:
tidy_complex.groupby('person')['IgG_level'].apply(np.mean)

In [None]:
tidy_complex.groupby('hour')['IgG_level'].apply(np.mean)

In [None]:
pd.DataFrame(tidy_complex.groupby(['person','treatment'])['IgG_level'].apply(np.mean))

# Tidy plotting

Default tool for plotting in Python is `matplotlib`. It is great, however, in many cases it requires a messy (as opposed to tidy) input. For instance, we could easily plot a ~~barplot~~ boxplot (never use barplots if you can help it) from the untidy dataset:

In [None]:
untidy_complex

In [None]:
plt.boxplot(untidy_complex.values)
plt.plot()

In [None]:
tidy_complex

Plotting tidy datasets with matplotlib is not so easy. Of course you can always transform the tidy to untidy (e.g. using `pivot_table` function), but it wastes time and can be tedious in case of multidimensional data.

# Seaborn: plotting tidy data
`seaborn` ([website](https://seaborn.pydata.org/index.html)) is a visualization package developed on top of `matplotlib`, and it is made specifically for working with tidy data, and creating very appealing figures. It has extensive [gallery](https://seaborn.pydata.org/examples/index.html) with lots of examples and a very good and concise [tutorial](https://seaborn.pydata.org/tutorial.html). Because it is build on top of `matplotlib`, you can still use `matplotlib` to tweak and adjust things, which is great.

In [None]:
import seaborn as sns

In [None]:
tidy_complex

In [None]:
sns.boxplot(data=tidy_complex, x='treatment', y='IgG_level', hue='hour', width=0.3)

In [None]:
births = pd.read_csv('data/births.csv')
births = births.loc[births.day.notnull()]
births = births.loc[births.births>1000]
births['day'] = pd.to_numeric(births.day)
births['month'] = pd.to_numeric(births.month)
births['year'] = pd.to_numeric(births.year)
print(births.shape)
births.head()

Note that because `seaborn` is built on top of `matplotlib`, we can use `matplotlib` to tweak the plots (uncomment the last 2 lines to rotate the X axis labels)

In [None]:
sns.boxplot(data=births, x='year', y='births')
# plt.xticks(rotation='vertical')
# plt.show()

In [None]:
sns.swarmplot(data=births.loc[births.day==1], x='gender', y='births')
plt.xticks(rotation='vertical')

In [None]:
sns.factorplot(data=births,x='year',y='births',kind='point')
plt.xticks(rotation='vertical')

In [None]:
sns.factorplot(data=births,x='year',y='births',kind='point',hue='gender')
plt.xticks(rotation='vertical')

In [None]:
sns.factorplot(data=births.loc[births.day<=3], x='year', y='births', kind='point', hue='gender', row='day')
plt.xticks(rotation='vertical')

In [None]:
sns.factorplot(data=births.loc[births.day<=3], x='month', y='births', kind='point', col='gender', hue='day')

# Conclusions
#### Considering that
- Tidying dataset can be a bit of work
- Tidy dataset are sometimes difficult to look at (too long or too wide)

#### However
- If use have a tidy dataset and tidy tools, any analysis becomes equally easy. With messy datasets, only some analysis is easy, while others are difficult.

- Split-Apply-Combine (in `pandas` is implemented by methods `groupby` and `apply`) is a general principle that can help you analyse and summarise a tidy dataset
- `seaborn` is a visualization library for tidy datasets which helps you to easily separate different factors (groupping variables)