# Tidying data for analysis

Here, you'll learn about the principles of tidy data and more importantly, why you should care about them and how they make subsequent data analysis more efficient. You'll gain first hand experience with reshaping and tidying your data using techniques such as pivoting and melting.

## Tidy data
- “Tidy Data” paper by Hadley Wickham, PhD
- Formalize the way we describe the shape of data
-  Gives us a goal when forma!ing our data
- “Standard way to organize data values within a dataset”

## Motivation for tidy data


|   | name  |treatment a   |treatment b   |
|---|---|---|---|
| 0  |-Daniel   |-   | 42  |
| 1  | John  | 12  | 31  |
|2|Jane|24|27


|  |0  |1  | 2 |
|---|---|---|---|
|name  |Daniel  |Jhon  |Jane  |
|treatment a  |-  |12  |24  |
|treatment b  | 42 |31  |27  |

## Principles of tidy data
- Columns represent separate variables
- Rows represent individual observations
- Observational units form tables


|   | name  |treatment a   |treatment b   |
|---|---|---|---|
| 0  |-Daniel   |-   | 42  |
| 1  | John  | 12  | 31  |
|2|Jane|24|27


## Converting to tidy data


|   | name  |treatment a   |treatment b   |
|---|---|---|---|
| 0  |-Daniel   |-   | 42  |
| 1  | John  | 12  | 31  |
|2|Jane|24|27

---
|  | name |treatment  |value  |
|---|---|---|---|
|0  |Daniel|treatment a  |-  |
| 1 |Jhon  |treatment a  |12  |
| 2 |Jane  |treatment a  |24  |
| 3 |Daniel|treatment b  |42  |
| 4 |Jhon  |treatment b  |31  |
| 5 |Jane  |treatment b  |27  |

- Better for reporting vs. be!er for analysis
- Tidy data makes it easier to fix common data
problems

## Converting to tidy data

The data problem we are trying to fix:
-  Columns containing values, instead of variables
- Solution: `pd.melt()`

In [5]:
import pandas as pd
df = pd.read_csv('tiddy.csv')
pd.melt(frame=df, id_vars='name',
        value_vars=['treatment a', 'treatment b'])


Unnamed: 0,name,variable,value
0,Daniel,treatment a,-
1,Jhon,treatment a,12
2,Jane,treatment a,24
3,Daniel,treatment b,42
4,Jhon,treatment b,31
5,Jane,treatment b,27


In [7]:
pd.melt(frame=df, id_vars='name',
        value_vars=['treatment a', 'treatment b'],
                    var_name='treatment', value_name='result')

Unnamed: 0,name,treatment,result
0,Daniel,treatment a,-
1,Jhon,treatment a,12
2,Jane,treatment a,24
3,Daniel,treatment b,42
4,Jhon,treatment b,31
5,Jane,treatment b,27


---
# Let’s practice!

---
# Pivoting data

## Pivot: un-melting data
- Opposite of melting
- In melting, we turned columns into rows
- Pivoting: turn unique values into separate columns
- Analysis friendly shape to reporting friendly shape
- Violates tidy data principle: rows contain observations
    - Multiple variables stored in the same column

## Pivot

In [18]:
weather_tidy = pd.read_csv('weather_tidy.csv')
weather_tidy['date'] = pd.to_datetime(weather_tidy['date'])

weather_tidy

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
1,2010-01-30,tmin,14.5
2,2010-02-02,tmax,27.3
3,2010-02-02,tmin,14.4


In [19]:
weather_tidy = weather_tidy.pivot(index='date',
                             columns='element',
                             values='value')
weather_tidy

element,tmax,tmin
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-30,27.8,14.5
2010-02-02,27.3,14.4


## Using pivot when you have duplicate entries

```python
import numpy as np

weather2_tidy = weather_tidy.pivot(values='value',
                              index='date',
                              columns='element')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-2962bb23f5a3> in <module>()
1 weather2_tidy = weather2.pivot(values='value',
2 index='date',
----> 3 columns='element')
ValueError: Index contains duplicate entries, cannot reshape

```



### Pivot table
- Has a parameter that specifies how to deal with duplicate
values
- Example: Can aggregate the duplicate values by taking their
average


In [24]:
weather_tidy = pd.read_csv('weather_tidy.csv')

weather2_tidy = weather_tidy.pivot_table(values='value',
                                    index='date',
                                    columns='element',
                                    aggfunc=np.mean)
weather2_tidy

element,tmax,tmin
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1/30/2010,27.8,14.5
2/2/2010,27.3,14.4


---
# Let’s practice!

# Beyond melt and pivot

### Beyond melt and pivot
- Melting and pivoting are basic tools
-  Another common problem:
- Columns contain multiple bits of information

In [26]:
tb = pd.read_csv('tb.csv')
tb.head()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,f1524,f2534,f3544,f4554,f5564,f65,fu
0,AD,2000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,,,
1,AE,2000,2.0,4.0,4.0,6.0,5.0,12.0,10.0,,3.0,16.0,1.0,3.0,0.0,0.0,4.0,
2,AF,2000,52.0,228.0,183.0,149.0,129.0,94.0,80.0,,93.0,414.0,565.0,339.0,205.0,99.0,36.0,
3,AG,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,
4,AL,2000,2.0,19.0,21.0,14.0,24.0,19.0,16.0,,3.0,11.0,10.0,8.0,8.0,5.0,11.0,


## Melting and parsing

- Nothing inherently wrong about original data shape
-  Not conducive for analysis

In [32]:
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])
tb_melt.head(6)

Unnamed: 0,country,year,variable,value
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0
5,AM,2000,m014,2.0


## Melting and parsing

In [33]:
tb_melt['sex'] = tb_melt.variable.str[0]
tb_melt.head()

Unnamed: 0,country,year,variable,value,sex
0,AD,2000,m014,0.0,m
1,AE,2000,m014,2.0,m
2,AF,2000,m014,52.0,m
3,AG,2000,m014,0.0,m
4,AL,2000,m014,2.0,m


---
# Let’s practice!