# Tidy data

# Reshaping your data using melt
Melting data is the process of turning columns of your data into rows of data. 

1. Print the head of airquality.
2. Use pd.melt() to melt the Ozone, Solar.R, Wind, and Temp columns of airquality into rows. Do this by using id_vars to the column you do not wish to melt: 'Date'.
3. Print the head of airquality_melt.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = r'F:\Data Analysis\Springboard\Data Science Career Track\11.Cleaning Data in Python\Datasets\airquality1.csv'

airquality = pd.read_csv(data)

airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Date
0,41.0,190.0,7.4,67,5/1/1973
1,36.0,118.0,8.0,72,5/2/1973
2,12.0,149.0,12.6,74,5/3/1973
3,18.0,313.0,11.5,62,5/4/1973
4,,,14.3,56,5/5/1973


In [3]:
airquality

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Date
0,41.0,190.0,7.4,67,5/1/1973
1,36.0,118.0,8.0,72,5/2/1973
2,12.0,149.0,12.6,74,5/3/1973
3,18.0,313.0,11.5,62,5/4/1973
4,,,14.3,56,5/5/1973
...,...,...,...,...,...
148,30.0,193.0,6.9,70,9/26/1973
149,,145.0,13.2,77,9/27/1973
150,14.0,191.0,14.3,75,9/28/1973
151,18.0,131.0,8.0,76,9/29/1973


In [4]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality,  id_vars='Date')

# Print the head of airquality_melt
print(airquality_melt.head())

       Date variable  value
0  5/1/1973    Ozone   41.0
1  5/2/1973    Ozone   36.0
2  5/3/1973    Ozone   12.0
3  5/4/1973    Ozone   18.0
4  5/5/1973    Ozone    NaN


In [5]:
airquality_melt

Unnamed: 0,Date,variable,value
0,5/1/1973,Ozone,41.0
1,5/2/1973,Ozone,36.0
2,5/3/1973,Ozone,12.0
3,5/4/1973,Ozone,18.0
4,5/5/1973,Ozone,
...,...,...,...
607,9/26/1973,Temp,70.0
608,9/27/1973,Temp,77.0
609,9/28/1973,Temp,75.0
610,9/29/1973,Temp,76.0


# Customizing melted data

1. Melt the columns of airquality with the default variable column renamed to 'measurement' and the default value column renamed to 'reading'. You can do this by specifying, respectively, the var_name and value_name parameters.
2. Print the head of airquality_melt.

In [6]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars='Date', var_name='measurement', value_name='reading')

# Print the head of airquality_melt
print(airquality_melt.head())

       Date measurement  reading
0  5/1/1973       Ozone     41.0
1  5/2/1973       Ozone     36.0
2  5/3/1973       Ozone     12.0
3  5/4/1973       Ozone     18.0
4  5/5/1973       Ozone      NaN


In [7]:
airquality_melt

Unnamed: 0,Date,measurement,reading
0,5/1/1973,Ozone,41.0
1,5/2/1973,Ozone,36.0
2,5/3/1973,Ozone,12.0
3,5/4/1973,Ozone,18.0
4,5/5/1973,Ozone,
...,...,...,...
607,9/26/1973,Temp,70.0
608,9/27/1973,Temp,77.0
609,9/28/1973,Temp,75.0
610,9/29/1973,Temp,76.0


# Pivoting data

Pivot data
Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before you melted it? You'll now begin pivoting it back into that form using the .pivot_table() method!

1. Pivot airquality_melt by using .pivot_table() with the rows indexed by 'Date', the columns indexed by 'measurement', and the values populated with 'reading'.
2. Print the head of airquality_pivot.

In [8]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index='Date', columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())

measurement  Ozone  Solar.R  Temp  Wind
Date                                   
5/1/1973      41.0    190.0  67.0   7.4
5/10/1973      NaN    194.0  69.0   8.6
5/11/1973      7.0      NaN  74.0   6.9
5/12/1973     16.0    256.0  69.0   9.7
5/13/1973     11.0    290.0  66.0   9.2


In [9]:
airquality_pivot

measurement,Ozone,Solar.R,Temp,Wind
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5/1/1973,41.0,190.0,67.0,7.4
5/10/1973,,194.0,69.0,8.6
5/11/1973,7.0,,74.0,6.9
5/12/1973,16.0,256.0,69.0,9.7
5/13/1973,11.0,290.0,66.0,9.2
...,...,...,...,...
9/5/1973,47.0,95.0,87.0,7.4
9/6/1973,32.0,92.0,84.0,15.5
9/7/1973,20.0,252.0,80.0,10.9
9/8/1973,23.0,220.0,78.0,10.3


# Resetting the index of a DataFrame
There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: .reset_index()

1. Print the index of airquality_pivot by accessing its .index attribute. This has been done for you.
2. Reset the index of airquality_pivot using its .reset_index() method.
3. Print the new index of airquality_pivot_reset.
4. Print the head of airquality_pivot_reset

In [10]:
# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)

# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())

Index(['5/1/1973', '5/10/1973', '5/11/1973', '5/12/1973', '5/13/1973',
       '5/14/1973', '5/15/1973', '5/16/1973', '5/17/1973', '5/18/1973',
       ...
       '9/28/1973', '9/29/1973', '9/3/1973', '9/30/1973', '9/4/1973',
       '9/5/1973', '9/6/1973', '9/7/1973', '9/8/1973', '9/9/1973'],
      dtype='object', name='Date', length=153)
RangeIndex(start=0, stop=153, step=1)
measurement       Date  Ozone  Solar.R  Temp  Wind
0             5/1/1973   41.0    190.0  67.0   7.4
1            5/10/1973    NaN    194.0  69.0   8.6
2            5/11/1973    7.0      NaN  74.0   6.9
3            5/12/1973   16.0    256.0  69.0   9.7
4            5/13/1973   11.0    290.0  66.0   9.2


In [12]:
airquality_pivot_reset

measurement,Date,Ozone,Solar.R,Temp,Wind
0,5/1/1973,41.0,190.0,67.0,7.4
1,5/10/1973,,194.0,69.0,8.6
2,5/11/1973,7.0,,74.0,6.9
3,5/12/1973,16.0,256.0,69.0,9.7
4,5/13/1973,11.0,290.0,66.0,9.2
...,...,...,...,...,...
148,9/5/1973,47.0,95.0,87.0,7.4
149,9/6/1973,32.0,92.0,84.0,15.5
150,9/7/1973,20.0,252.0,80.0,10.9
151,9/8/1973,23.0,220.0,78.0,10.3


# Pivoting duplicate values

Let's say your data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as airquality_dup. In addition, the airquality_melt DataFrame from the previous exercise has been pre-loaded. Explore their shapes in the IPython Shell by accessing their .shape attributes to confirm the duplicate rows present in airquality_dup

1. Pivot airquality_dup by using .pivot_table() with the rows indexed by 'Date', the columns indexed by 'measurement', and the values populated with 'reading'. Use np.mean for the aggregation function.
2. Print the head of airquality_pivot.
3. Flatten airquality_pivot by resetting its index.
4. Print the head of airquality_pivot and then the original airquality DataFrame to compare their structure.

In [None]:
#No data for airquality_dup. Running codse will result in error

# Pivot table the airquality_dup: airquality_pivot
airquality_pivot =airquality_dup.pivot_table(index='Date', columns='measurement', values='reading', aggfunc=np.mean)

# Print the head of airquality_pivot before reset_index
print(airquality_pivot.head())

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())

# Print the head of airquality
print(airquality.head())

# Beyond melt() and pivot()


# Splitting a column with .str

1. Melt tb keeping 'country' and 'year' fixed.
2. Create a 'gender' column by slicing the first letter of the variable column of tb_melt.
3. Create an 'age_group' column by slicing the rest of the variable column of tb_melt.
4. Print the head of tb_melt. This has been done for you, so hit 'Submit Answer' to see the results!

In [13]:
data = r'F:\Data Analysis\Springboard\Data Science Career Track\11.Cleaning Data in Python\Datasets\tb.csv'

tb = pd.read_csv(data)

print(tb.head())

# Melt tb: tb_melt
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())

  country  year  m014  m1524  m2534  m3544  m4554  m5564   m65  mu  f014  \
0      AD  2000   0.0    0.0    1.0    0.0    0.0    0.0   0.0 NaN   NaN   
1      AE  2000   2.0    4.0    4.0    6.0    5.0   12.0  10.0 NaN   3.0   
2      AF  2000  52.0  228.0  183.0  149.0  129.0   94.0  80.0 NaN  93.0   
3      AG  2000   0.0    0.0    0.0    0.0    0.0    0.0   1.0 NaN   1.0   
4      AL  2000   2.0   19.0   21.0   14.0   24.0   19.0  16.0 NaN   3.0   

   f1524  f2534  f3544  f4554  f5564   f65  fu  
0    NaN    NaN    NaN    NaN    NaN   NaN NaN  
1   16.0    1.0    3.0    0.0    0.0   4.0 NaN  
2  414.0  565.0  339.0  205.0   99.0  36.0 NaN  
3    1.0    1.0    0.0    0.0    0.0   0.0 NaN  
4   11.0   10.0    8.0    8.0    5.0  11.0 NaN  
  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4   

# Splitting a column with .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. You'll learn how to deal with such cases in this exercise, using a dataset consisting of Ebola cases and death counts by state and country. It has been pre-loaded into a DataFrame as ebola.

1. Melt ebola using 'Date' and 'Day' as the id_vars, 'type_country' as the var_name, and 'counts' as the value_name.
2. Create a column called 'str_split' by splitting the 'type_country' column of ebola_melt on '_'. Note that you will first have to access the str attribute of type_country before you can use .split().
3. Create a column called 'type' by using the .get() method to retrieve index 0 of the 'str_split' column of ebola_melt.
4. Create a column called 'country' by using the .get() method to retrieve index 1 of the 'str_split' column of ebola_melt.
5. Print the head of ebola_melt. This has been done for you, so hit 'Submit Answer' to view the results!


In [14]:
data = r'F:\Data Analysis\Springboard\Data Science Career Track\11.Cleaning Data in Python\Datasets\ebola.csv'

ebola = pd.read_csv(data)

print(ebola.head())

# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())

         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0          