# Chapter 06 Tidy Data
Pandas for Everyone. See the author's [github page](https://github.com/chendaniely/pandas_for_everyone)

There is a very import concept in data analysis, *tidy data*, introduced by Hadley Wickham. Here is the [tidy data paper](http://vita.had.co.nz/papers/tidy-data.pdf) worth looking at.

So what is *tidy data*? Hadley Wickham's paper defines it as having the following criteria:

1. Each row is an observation;
2. Each column is a variable;
3. Each type of observational unit forms a table.

In my own words, think of a table as samples of a function. For example, if we want to represent a function f(x, y) using a table, then the table should have 3 columns, 'x', 'y' and 'result'. Therefore, each table in *tidy data sets* represents some sort of function. If a table has n columns, then the first n-1 columns are variables of the funciton and the last column being the result.

In [2]:
import pandas as pd

## Columns Contain Values, Not Variables

In [5]:
pew = pd.read_csv('data/pew.csv')
pew.head()

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k,$75-100k,$100-150k,>150k,Don't know/refused
0,Agnostic,27,34,60,81,76,137,122,109,84,96
1,Atheist,12,27,37,52,35,70,73,59,74,76
2,Buddhist,27,21,30,34,33,58,62,39,53,54
3,Catholic,418,617,732,670,638,1116,949,792,633,1489
4,Don’t know/refused,15,14,15,11,10,35,21,17,18,116


The above table shows a study on the effect of religion on income levels. It's good for presentation purposes, but for data analysis we need to think about it twice:

1. religion is a variable;
2. '<$10k', '10-20k" etc. are not variable, instead they are values of a variable "income level";

The desired table may look like:

religion | income level | count
---------|--------------|-------
Agnostic | <10k | 27
...

### Melt Down Columns

We can use the pd.melt() function to melt down the irrelevant columns to form a new table.

In [9]:
pew_new = pd.melt(pew, id_vars='religion', var_name='income level', value_name='count')
pew_new

Unnamed: 0,religion,income level,count
0,Agnostic,<$10k,27
1,Atheist,<$10k,12
2,Buddhist,<$10k,27
3,Catholic,<$10k,418
4,Don’t know/refused,<$10k,15
...,...,...,...
175,Orthodox,Don't know/refused,73
176,Other Christian,Don't know/refused,18
177,Other Faiths,Don't know/refused,71
178,Other World Religions,Don't know/refused,8


With the new table, we can easily do an analysis on overall income distribution for a religion, say 'Catholic',

In [25]:
cdf = pew_new.loc[pew_new['religion'] == 'Catholic', ['income level', 'count']]
cdf

Unnamed: 0,income level,count
3,<$10k,418
21,$10-20k,617
39,$20-30k,732
57,$30-40k,670
75,$40-50k,638
93,$50-75k,1116
111,$75-100k,949
129,$100-150k,792
147,>150k,633
165,Don't know/refused,1489


Now get the most frequent income interval

In [29]:
cdf[cdf['income level'] != 'Don\'t know/refused'].sort_values(by=['count']).iloc[-1, 0]

'$50-75k'

In [56]:
""" Let's summarize the steps in one function """
mostFrequentIncome = lambda pew, religion: \
    pew.loc[(pew['religion'] == religion) & (pew['income level'] != 'Don\'t know/refused')].sort_values(by=['count']).iloc[-1, 1]

In [57]:
mostFrequentIncome(pew_new, 'Catholic')

'$50-75k'

In [58]:
pew_new[pew_new['religion'] == 'Hindu']

Unnamed: 0,religion,income level,count
6,Hindu,<$10k,1
24,Hindu,$10-20k,9
42,Hindu,$20-30k,7
60,Hindu,$30-40k,9
78,Hindu,$40-50k,11
96,Hindu,$50-75k,34
114,Hindu,$75-100k,47
132,Hindu,$100-150k,48
150,Hindu,>150k,54
168,Hindu,Don't know/refused,37


In [59]:
mostFrequentIncome(pew_new, 'Hindu')

'>150k'

In [60]:
mostFrequentIncome(pew_new, 'Buddhist')

'$75-100k'

In [61]:
mostFrequentIncome(pew_new, 'Agnostic')

'$50-75k'