# Tidy Data

A huge amount of effort is spent cleaning data to get it ready for analysis, but there
has been little research on how to make data cleaning as easy and effective as possible.
This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table. This framework makes it easy to tidy messy datasets because only a small
set of tools are needed to deal with a wide range of un-tidy datasets. This structure
also makes it easier to develop tidy tools for data analysis, tools that both input and
output tidy datasets. The advantages of a consistent data structure and matching tools
are demonstrated with a case study free from mundane data manipulation chores.

[Source](http://vita.had.co.nz/papers/tidy-data.pdf)

[Github](https://github.com/hadley/tidy-data)

In [0]:
import numpy as np
import pandas as pd

## Religion income

In [3]:
data_url = "https://raw.githubusercontent.com/davemlz/Master_of_DataScience/master/Modelos_de_Datos/Pandas/Tidy/tidy/tidy-data/data/pew.csv"
religion = pd.read_csv(data_url)
religion.head()

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k,$75-100k,$100-150k,>150k,Don't know/refused
0,Agnostic,27,34,60,81,76,137,122,109,84,96
1,Atheist,12,27,37,52,35,70,73,59,74,76
2,Buddhist,27,21,30,34,33,58,62,39,53,54
3,Catholic,418,617,732,670,638,1116,949,792,633,1489
4,Don’t know/refused,15,14,15,11,10,35,21,17,18,116


In [10]:
religionMelted = religion.melt(id_vars = ['religion'],var_name = 'income',value_name = 'freq')
religionMelted

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
1,Atheist,<$10k,12
2,Buddhist,<$10k,27
3,Catholic,<$10k,418
4,Don’t know/refused,<$10k,15
...,...,...,...
175,Orthodox,Don't know/refused,73
176,Other Christian,Don't know/refused,18
177,Other Faiths,Don't know/refused,71
178,Other World Religions,Don't know/refused,8


In [18]:
religionMelted.groupby(['income','religion']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,freq
income,religion,Unnamed: 2_level_1
$10-20k,Agnostic,34
$10-20k,Atheist,27
$10-20k,Buddhist,21
$10-20k,Catholic,617
$10-20k,Don’t know/refused,14
...,...,...
Don't know/refused,Orthodox,73
Don't know/refused,Other Christian,18
Don't know/refused,Other Faiths,71
Don't know/refused,Other World Religions,8


In [21]:
religionMelted.iloc[religionMelted.groupby('income')['freq'].idxmax()]

Unnamed: 0,religion,income,freq
23,Evangelical Prot,$10-20k,869
129,Catholic,$100-150k,792
41,Evangelical Prot,$20-30k,1064
59,Evangelical Prot,$30-40k,982
77,Evangelical Prot,$40-50k,881
95,Evangelical Prot,$50-75k,1486
111,Catholic,$75-100k,949
5,Evangelical Prot,<$10k,575
154,Mainline Prot,>150k,634
167,Evangelical Prot,Don't know/refused,1529


## Billboard

In [4]:
data_url = "https://raw.githubusercontent.com/davemlz/Master_of_DataScience/master/Modelos_de_Datos/Pandas/Tidy/tidy/tidy-data/data/billboard.csv"

billboard = pd.read_csv(data_url, encoding='ISO-8859-1')

print(billboard.shape) # wide dataset

(317, 83)


In [5]:
billboard.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,x4th.week,x5th.week,x6th.week,x7th.week,x8th.week,x9th.week,x10th.week,x11th.week,x12th.week,x13th.week,x14th.week,x15th.week,x16th.week,x17th.week,x18th.week,x19th.week,x20th.week,x21st.week,x22nd.week,x23rd.week,x24th.week,x25th.week,x26th.week,x27th.week,x28th.week,x29th.week,x30th.week,x31st.week,x32nd.week,x33rd.week,...,x37th.week,x38th.week,x39th.week,x40th.week,x41st.week,x42nd.week,x43rd.week,x44th.week,x45th.week,x46th.week,x47th.week,x48th.week,x49th.week,x50th.week,x51st.week,x52nd.week,x53rd.week,x54th.week,x55th.week,x56th.week,x57th.week,x58th.week,x59th.week,x60th.week,x61st.week,x62nd.week,x63rd.week,x64th.week,x65th.week,x66th.week,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,33.0,23.0,15.0,7.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,7.0,10.0,12.0,15.0,22.0,29.0,31.0,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,5.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,15.0,19.0,21.0,26.0,36.0,48.0,47.0,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,31.0,20.0,13.0,7.0,6.0,4.0,4.0,4.0,6.0,4.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,4.0,8.0,8.0,12.0,14.0,17.0,21.0,24.0,30.0,34.0,37.0,46.0,47.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,14.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,8.0,11.0,16.0,20.0,25.0,27.0,27.0,29.0,44.0,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,29.0,23.0,18.0,11.0,9.0,9.0,11.0,1.0,1.0,1.0,1.0,4.0,8.0,12.0,22.0,23.0,43.0,44.0,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [28]:
billboardMelted = billboard.melt(id_vars = billboard.iloc[:,0:7],var_name="week",value_name='rank')
billboardMelted

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,x1st.week,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,x1st.week,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,x1st.week,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,x1st.week,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,x1st.week,57.0
...,...,...,...,...,...,...,...,...,...
24087,2000,Ghostface Killah,Cherchez LaGhost,3:04,R&B,2000-08-05,2000-08-05,x76th.week,
24088,2000,"Smith, Will",Freakin' It,3:58,Rap,2000-02-12,2000-02-12,x76th.week,
24089,2000,Zombie Nation,Kernkraft 400,3:30,Rock,2000-09-02,2000-09-02,x76th.week,
24090,2000,"Eastsidaz, The",Got Beef,3:58,Rap,2000-07-01,2000-07-01,x76th.week,


## Weather

In [0]:
data_url = "https://raw.githubusercontent.com/davemlz/Master_of_DataScience/master/Modelos_de_Datos/Pandas/Tidy/tidy/tidy-data/data/weather.csv"
weather = pd.read_csv(data_url)

In [7]:
weather.head()

Unnamed: 0,id,year,month,element,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,d16,d17,d18,d19,d20,d21,d22,d23,d24,d25,d26,d27,d28,d29,d30,d31
0,MX17004,2010,1,tmax,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,27.8,
1,MX17004,2010,1,tmin,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,14.5,
2,MX17004,2010,2,tmax,,27.3,24.1,,,,,,,,29.7,,,,,,,,,,,,29.9,,,,,,,,
3,MX17004,2010,2,tmin,,14.4,14.4,,,,,,,,13.4,,,,,,,,,,,,10.7,,,,,,,,
4,MX17004,2010,3,tmax,,,,,32.1,,,,,34.5,,,,,,31.1,,,,,,,,,,,,,,,


In [38]:
weatherMelted = weather.melt(id_vars=weather.iloc[:,0:4],var_name = 'dia',value_name='temp')
weatherPivot = weatherMelted.pivot_table(values = 'temp',index = ['id','year','month','dia'],columns = 'element')
weatherPivot

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,element,tmax,tmin
id,year,month,dia,Unnamed: 4_level_1,Unnamed: 5_level_1
MX17004,2010,1,d30,27.8,14.5
MX17004,2010,2,d11,29.7,13.4
MX17004,2010,2,d2,27.3,14.4
MX17004,2010,2,d23,29.9,10.7
MX17004,2010,2,d3,24.1,14.4
MX17004,2010,3,d10,34.5,16.8
MX17004,2010,3,d16,31.1,17.6
MX17004,2010,3,d5,32.1,14.2
MX17004,2010,4,d27,36.3,16.7
MX17004,2010,5,d27,33.2,18.2


# Billboard 2

In [39]:
billboardMelted

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,x1st.week,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,x1st.week,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,x1st.week,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,x1st.week,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,x1st.week,57.0
...,...,...,...,...,...,...,...,...,...
24087,2000,Ghostface Killah,Cherchez LaGhost,3:04,R&B,2000-08-05,2000-08-05,x76th.week,
24088,2000,"Smith, Will",Freakin' It,3:58,Rap,2000-02-12,2000-02-12,x76th.week,
24089,2000,Zombie Nation,Kernkraft 400,3:30,Rock,2000-09-02,2000-09-02,x76th.week,
24090,2000,"Eastsidaz, The",Got Beef,3:58,Rap,2000-07-01,2000-07-01,x76th.week,


In [43]:
songs = billboardMelted.iloc[:,1:4].drop_duplicates()
songs

Unnamed: 0,artist.inverted,track,time
0,Destiny's Child,Independent Women Part I,3:38
1,Santana,"Maria, Maria",4:18
2,Savage Garden,I Knew I Loved You,4:07
3,Madonna,Music,3:45
4,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38
...,...,...,...
312,Ghostface Killah,Cherchez LaGhost,3:04
313,"Smith, Will",Freakin' It,3:58
314,Zombie Nation,Kernkraft 400,3:30
315,"Eastsidaz, The",Got Beef,3:58


In [45]:
songs.merge(billboardMelted,on = ['artist.inverted','track','time'])

Unnamed: 0,artist.inverted,track,time,year,genre,date.entered,date.peaked,week,rank
0,Destiny's Child,Independent Women Part I,3:38,2000,Rock,2000-09-23,2000-11-18,x1st.week,78.0
1,Destiny's Child,Independent Women Part I,3:38,2000,Rock,2000-09-23,2000-11-18,x2nd.week,63.0
2,Destiny's Child,Independent Women Part I,3:38,2000,Rock,2000-09-23,2000-11-18,x3rd.week,49.0
3,Destiny's Child,Independent Women Part I,3:38,2000,Rock,2000-09-23,2000-11-18,x4th.week,33.0
4,Destiny's Child,Independent Women Part I,3:38,2000,Rock,2000-09-23,2000-11-18,x5th.week,23.0
...,...,...,...,...,...,...,...,...,...
24087,Fragma,Toca's Miracle,3:22,2000,R&B,2000-10-28,2000-10-28,x72nd.week,
24088,Fragma,Toca's Miracle,3:22,2000,R&B,2000-10-28,2000-10-28,x73rd.week,
24089,Fragma,Toca's Miracle,3:22,2000,R&B,2000-10-28,2000-10-28,x74th.week,
24090,Fragma,Toca's Miracle,3:22,2000,R&B,2000-10-28,2000-10-28,x75th.week,
