# Tidying data for analysis

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Tidy-data" data-toc-modified-id="Tidy-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Tidy data</a></span></li><li><span><a href="#Pivoting-data" data-toc-modified-id="Pivoting-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Pivoting data</a></span></li><li><span><a href="#Beyond-melt-and-pivot" data-toc-modified-id="Beyond-melt-and-pivot-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Beyond melt and pivot</a></span></li></ul></div>

## Tidy data

- Tidy Data
    - “Tidy Data” paper by Hadley Wickham, PhD
    - Formalize the way we describe the shape of data
    - Gives us a goal when forma!ing our data
    - “Standard way to organize data values within a dataset”
- Principles of tidy data
    - Columns represent separate variables 
    - Rows represent individual observations 
    - Observational units form tables
-  Converting to tidy data
    - Better for reporting to better for analysis
    - Tidy data makes it easier to fix common data problems
        - Columns containing values, instead of variables
    - Solution: pd.melt()
                 In [1]: pd.melt(frame=df, id_vars='name',value_vars=['treatment a', 'treatment b'])
                Out[1]:
                               name    variable       value
                            0 Daniel treatment a     _ 
                            1 John    treatment a    12 
                            2 Jane   treatment a    24 
                            3 Daniel treatment b    42 
                            4 John   treatment b    31 
                            5 Jane   treatment b    27
                 In [2]: pd.melt(frame=df, id_vars='name',value_vars=['treatment a', 'treatment b’],var_name='treatment', value_name='result')
                Out[2]:
                                name    treatment   result
                            0  Daniel  treatment a      _
                            1    John  treatment a     12
                            2    Jane  treatment a     24
                            3  Daniel  treatment b     42
                            4    John  treatment b     31
                            5    Jane  treatment b     27
        - id_vars 
            - represent the columns of the data you do not want to melt (i.e., keep it in its current shape)
        - value_vars 
            - represent the columns you do wish to melt into rows.
        -  if no value_vars are provided, all columns not set in the id_vars will be melted. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import jupyterthemes.jtplot as jtplot
%matplotlib inline
jtplot.style(theme='onedork')

In [2]:
airquality = pd.read_csv('exercise/airquality.csv')
# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month','Day'])

# Print the head of airquality_melt
print(airquality_melt.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement'
                                      , value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())


   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5
   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN


## Pivoting data

- Pivot: un-melting data
    - Opposite of melting
    - Turn unique values into separate columns
    - Analysis friendly shape to reporting friendly shape
    - Violates tidy data principle: rows contain observations
        - Multiple variables stored in the same column
                In [1]: weather_tidy = weather.pivot(index='date',columns='element',values='value')
                In [2]: print(weather_tidy)
                element     tmax tmin
                date
                2010-01-30  27.8 14.5
                2010-02-02  27.3 14.4
    - Using pivot when you have duplicate entries
        - ValueError: Index contains duplicate entries, cannot reshape
- Pivot table
    - Has a parameter that specifies how to deal with duplicate values
    - Example: Can aggregate the duplicate values by taking their average
             In [5]: weather2_tidy = weather.pivot_table(values='value',index='date',columns='element',aggfunc=np.mean)
            Out[5]:
            element
            date            tmax tmin
            2010-01-30  27.8 14.5
            2010-02-02  27.3 15.4
    - index 
        - Specify the columns that you don't want pivoted
    - columns 
        - The name of the column you want to pivot
    - Values
        - The values to be used when the column is pivoted 
    - aggfunc
        - aggregation function

In [3]:
# Print the head of airquality_melt
print(airquality_melt.head())

# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())
print('airquality_pivot\'s index also call multiIndex, see in airquality_pivot.head(), \
Month value \'5\' contains 5 different Day value')

   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3
airquality_pivot's index also call multiIndex, see in airquality_pivot.head(), Month value '5' contains 5 different Day value


In [4]:
# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)

# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())


MultiIndex([(5,  1),
            (5,  2),
            (5,  3),
            (5,  4),
            (5,  5),
            (5,  6),
            (5,  7),
            (5,  8),
            (5,  9),
            (5, 10),
            ...
            (9, 21),
            (9, 22),
            (9, 23),
            (9, 24),
            (9, 25),
            (9, 26),
            (9, 27),
            (9, 28),
            (9, 29),
            (9, 30)],
           names=['Month', 'Day'], length=153)
RangeIndex(start=0, stop=153, step=1)
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3


In [5]:
# Pivot table the airquality_dup: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], 
                                          columns='measurement', values='reading', aggfunc=np.mean)

# Print the head of airquality_pivot before reset_index
print(airquality_pivot.head())

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())

# Print the head of airquality
print(airquality.head())

measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5


## Beyond melt and pivot

- Melting and pivoting are basic tools 
- Another common problem:
    - Columns contain multiple bits of information
- Melting and parsing
    - tb.csv
        - Nothing inherently wrong about original data shape 
        - Not conducive for analysis
        - 'm014' column
            - represents males aged 0-14 years of age
            - to parse this value
                - a new column for gender
                - a new column for age_group
- Splitting a column with .split() and .get()
    - column names such as Cases_Guinea and Deaths_Guinea
        - cannot directly slice the variable by position
    - .split()
        - Python's built-in string method
            - have to access the str attribute before use .split()
        - 'Cases_Guinea'.split('_')
            - returns list ['Cases', 'Guinea']
    - .get()
        - Python's built-in string method
            - have to access the str attribute before use .get()
        - to extract elements
            - by index
        


In [6]:
tb = pd.read_csv('exercise/tb.csv')
#print(tb.head())

# Melt tb: tb_melt
tb_melt = tb.melt(id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())


  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014


In [7]:
ebola = pd.read_csv('exercise/ebola.csv')
#print(ebola.columns)

# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'],
                     var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt['type_country'].str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt['str_split'].str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())

         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea
