## <font color='firebrick'> Tidying and Cleaning Messy Data in Python
    

Sources:

    - https://www.jeannicholashould.com/tidy-data-in-python.html
    - http://vita.had.co.nz/papers/tidy-data.pdf
    - https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
    - https://pandas.pydata.org/docs/reference/api/pandas.melt.html

Datasets:


    

In [1]:
import pandas as pd
import seaborn as sns

## What is tidy data? 

In order to have our data ready for analysis, we need to make sure that it is tidy. According to Hadley Wickham (http://vita.had.co.nz/papers/tidy-data.pdf), a dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

The data is tidy when:

- Each variable forms a column and contains values
- Each observation forms a row
- Each type of observational unit forms a table

An example of a messy dataset:

|**Name**   |**Treatment A**   |**Treatment B**|
|:----------:|:-----------------:|:--------------:|
John | - |2
Jane | 16 | 11
Mary | 3 | 1

An example of a tidy dataset:

|**Name**   |**Treatment**   |**Result**|
|:----------:|:-----------------:|:--------------:|
John | a | - 
Jane | a | 16
Mary | a | 3
John | b | 2
Jane | b | 11
Mary | b | 1

#### Example 1: Pew Research Center Dataset

For this section, we will use the Pew Research Center dataset (`pew-raw.csv`) which explores the relationship between income and religion. 

We can see that the data is not tidy because the column headers are composed of the possible income values.

A tidy version of this dataset is one in which the income values would not be columns headers but rather in an `income` variable. 

https://pandas.pydata.org/docs/reference/api/pandas.melt.html

In [3]:
df = pd.read_csv('../pew-raw.csv')
df

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k
0,Agnostic,27,34,60,81,76,137
1,Atheist,12,27,37,52,35,70
2,Buddhist,27,21,30,34,33,58
3,Catholic,418,617,732,670,638,1116
4,Dont know/refused,15,14,15,11,10,35
5,Evangelical Prot,575,869,1064,982,881,1486
6,Hindu,1,9,7,9,11,34
7,Historically Black Prot,228,244,236,238,197,223
8,Jehovahs Witness,20,27,24,24,21,30
9,Jewish,19,19,25,25,30,95


In [4]:
df.columns

Index(['religion', ' <$10k', ' $10-20k', '$20-30k', '$30-40k', ' $40-50k',
       '$50-75k'],
      dtype='object')

In [5]:
# id_vars = Column(s) to use as identifier variables.
# value_vars = Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
# var_name = Name to use for the ‘variable’ column. If None it uses ‘variable’. 
# value_name = Name to use for the ‘value’ column

tidydf = pd.melt(df, 
            id_vars = ['religion'],
            value_vars = [' <$10k', ' $10-20k', '$20-30k', '$30-40k', ' $40-50k', '$50-75k'],
            var_name = 'income', 
            value_name = 'freq')

tidydf.head(10)

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
1,Atheist,<$10k,12
2,Buddhist,<$10k,27
3,Catholic,<$10k,418
4,Dont know/refused,<$10k,15
5,Evangelical Prot,<$10k,575
6,Hindu,<$10k,1
7,Historically Black Prot,<$10k,228
8,Jehovahs Witness,<$10k,20
9,Jewish,<$10k,19


In [6]:
# value_vars = If not specified, uses all columns that are not set as id_vars.

pd.melt(df, 
            id_vars = ['religion'],
            # value_vars = [' <$10k', ' $10-20k', '$20-30k', '$30-40k', ' $40-50k', '$50-75k'],
            var_name = 'income', 
            value_name = 'freq').head()

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
1,Atheist,<$10k,12
2,Buddhist,<$10k,27
3,Catholic,<$10k,418
4,Dont know/refused,<$10k,15


####  Example 2: Billboard Top 100 Dataset

This dataset represents the weekly rank of songs from the moment they enter the Billboard Top 100 to the subsequent 75 weeks.

Problems:

- The column headers are composed of values: the week number (x1st.week, x2ndweek, etc.)


In [7]:
df = pd.read_csv('../billboard.csv')
df.sample(2)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
102,2000,Avant,My First Love,4:28,Rock,11/4/00,12/16/00,70,62.0,56.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,8/5/00,10/14/00,57,47.0,45.0,...,,,,,,,,,,


In [8]:

tidydf = pd.melt(df, 
                id_vars = ['year', 'artist.inverted', 'track', 'time', 'genre', 'date.entered', 'date.peaked'], 
                # value_vars = ,
                var_name = 'week', 
                value_name = 'rank')

tidydf.head(5)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,9/23/00,11/18/00,x1st.week,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2/12/00,4/8/00,x1st.week,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,10/23/99,1/29/00,x1st.week,71.0
3,2000,Madonna,Music,3:45,Rock,8/12/00,9/16/00,x1st.week,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,8/5/00,10/14/00,x1st.week,57.0


In [9]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

tidydf['week'] = tidydf['week'].str.extract('(\d+)').astype(int)

In [10]:
tidydf.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,9/23/00,11/18/00,1,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2/12/00,4/8/00,1,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,10/23/99,1/29/00,1,71.0
3,2000,Madonna,Music,3:45,Rock,8/12/00,9/16/00,1,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,8/5/00,10/14/00,1,57.0
