# Common Messy Datasets

### Objectives
After this lesson you should be able to...
+ Explain what tidy data is
+ Spot messy data
+ Identify the type of messy data
+ Transform any messy dataset into a tidy data set
+ Use the **`str`** accessor methods to parse strings
+ Know why Tidy data works and where it works best
+ Tidy real datasets

### Prepare for this lesson by...
+ Read Hadley Wickham's paper on [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf)
+ Watch Hadley Wickham's talk on [tidy data](https://vimeo.com/33727555)
+ Watch Jeff Leek's video on [tidy data](https://www.youtube.com/watch?v=whDilsFoLVY)
+ Read the [reshaping pandas documentation page](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)
+ Read the entire page [working with text data](http://pandas.pydata.org/pandas-docs/stable/text.html)

### Introduction
The previous notebook focused on one particular type of messy dataset. A dataset where the column names are actually values and not variables. This was illustrated with the simple dataset of counts of fruits. **`stack`** or **`melt`** will quickly tidy these basic datasets, but often is the case that datasets take more manipulation to make them tidy.

In [2]:
import pandas as pd
import numpy as np

In [3]:
# looks so nice and clean!
df = pd.DataFrame(data=[[12, 10, 40], [9, 7, 12], [0, 14, 190]], 
                  columns=['Apple', 'Orange', 'Banana'],
                  index=['Ted', 'Penelope', 'Niko'])
df

Unnamed: 0,Apple,Orange,Banana
Ted,12,10,40
Penelope,9,7,12
Niko,0,14,190


### Most Common Messy Data Problems
Again, we will rely upon Hadley's paper to describe the most common problems that appear in messy datasets. 
1. Column headers are values, not variable names.
1. Multiple variables are stored in one column.
1. Variables are stored in both rows and columns.
1. Multiple types of observational units are stored in the same table.
1. A single observational unit is stored in multiple tables

The first type of messy data was covered in the previous notebook. This notebook will cover examples of all but the last type.

# Multiple variables are stored in one column
A tidy data set needs values of a single variable stored in one column.

### Column names are values in the column
Column names appear directly as values in a single column and the value of these variables are in another column.

Notice below how the **`Value`** column has both numeric and string data types and the **`Info`** column isn't a variable at all but column names.

In [4]:
df = pd.DataFrame(data={'Name': ['Ted', 'Penelope', 'Niko'] * 3,
                        'Info': ['Age'] * 3 + ['Salary'] * 3 + ['Hair Color'] * 3, 
                        'Value': [10, 15, 20, 3, 4, 5, 'Brown', 'Pink','Red']},
                 columns=['Name', 'Info', 'Value'])
df

Unnamed: 0,Name,Info,Value
0,Ted,Age,10
1,Penelope,Age,15
2,Niko,Age,20
3,Ted,Salary,3
4,Penelope,Salary,4
5,Niko,Salary,5
6,Ted,Hair Color,Brown
7,Penelope,Hair Color,Pink
8,Niko,Hair Color,Red


### The fix
This dataset is 'overly stacked', so pivoting it (which normally creates a messy dataset) will make it tidy. Both **`pivot`** and **`unstack`** will make this work.

In [5]:
df.pivot(index='Name', columns='Info', values='Value')

Info,Age,Hair Color,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Niko,20,Red,5
Penelope,15,Pink,4
Ted,10,Brown,3


In [6]:
# can also use unstack
df.set_index(['Name', 'Info']).unstack()

Unnamed: 0_level_0,Value,Value,Value
Info,Age,Hair Color,Salary
Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Niko,20,Red,5
Penelope,15,Pink,4
Ted,10,Brown,3


In [7]:
# lots of extra nusaince names: Value, Info, Name still remain
# squeeze forces the one column dataframe to be a Series so unstack doesn't create a multi-index
# rename_axis removes 'Info' and 'Name' level names
df.set_index(['Name', 'Info'])\
  .squeeze()\
  .rename_axis([None, None])\
  .unstack()

Unnamed: 0,Age,Hair Color,Salary
Niko,20,Red,5
Penelope,15,Pink,4
Ted,10,Brown,3


# Two or more values are stored in the same cell
Two or more values of the same variable or different variable can be stored in the same cell in a DataFrame. You will need to extract the desired quantities. These values are typically strings (formally the **`object`** data type in DataFrames). You will need to parse the strings to extract the relevant variables.

### Parsing string columns with the `str` accessor to extract values
pandas provides a **`str`** accessor with a couple dozen methods, each with the ability to extract different pieces of information from string columns. The following DataFrame contains quite a lot of information in the **`Geolocation`** column.

In [8]:
df = pd.DataFrame({'City':['Houston', 'Dallas', 'Austin'], 
                   'Geolocation':['(29.7604° N, 95.3698° W)', '32.7767° N, 96.7970° W', '30.2672° N, 97.7431° W']})
df

Unnamed: 0,City,Geolocation
0,Houston,"(29.7604° N, 95.3698° W)"
1,Dallas,"32.7767° N, 96.7970° W"
2,Austin,"30.2672° N, 97.7431° W"


The **`str`** accessor is available only to Series objects. Most of its methods are self explanatory. Let's see several examples on the **`City`** column

In [9]:
# get the length of each value
df.City.str.len()

0    7
1    6
2    6
Name: City, dtype: int64

In [10]:
# make all uppercase
df.City.str.upper()

0    HOUSTON
1     DALLAS
2     AUSTIN
Name: City, dtype: object

In [11]:
# make title case after uppercase
df.City.str.upper().str.title()

0    Houston
1     Dallas
2     Austin
Name: City, dtype: object

In [12]:
# get the 4th character
df.City.str.get(3)

0    s
1    l
2    t
Name: City, dtype: object

In [13]:
# split strings by a letter and expand each split into its own column
df.City.str.split('s', expand=True)

Unnamed: 0,0,1
0,Hou,ton
1,Dalla,
2,Au,tin


### Extracting coordinates
The **`Geolocation`** column has quite a lot of information packed into it. We will parse it into 4 separate variables.
+ latitude 
+ latitude direction
+ longitude
+ longitude direction

In [14]:
# strip off parentheses from ends
df.Geolocation.str.strip('()')

0    29.7604° N, 95.3698° W
1    32.7767° N, 96.7970° W
2    30.2672° N, 97.7431° W
Name: Geolocation, dtype: object

In [15]:
# split on a comma
geo_split = df.Geolocation.str.strip('()')\
              .str.split(',', expand=True)
geo_split

Unnamed: 0,0,1
0,29.7604° N,95.3698° W
1,32.7767° N,96.7970° W
2,30.2672° N,97.7431° W


In [16]:
# assign a variable to each column
lat = geo_split[0].str.split(' ', expand=True)
long = geo_split[1].str.split(' ', expand=True)

In [17]:
lat

Unnamed: 0,0,1
0,29.7604°,N
1,32.7767°,N
2,30.2672°,N


In [18]:
# give meaningful columns
lat.columns = ['latitude', 'latitude direction']
lat

Unnamed: 0,latitude,latitude direction
0,29.7604°,N
1,32.7767°,N
2,30.2672°,N


In [19]:
long

Unnamed: 0,0,1,2
0,,95.3698°,W
1,,96.7970°,W
2,,97.7431°,W


In [20]:
# an extra column. lets drop it
long = long.drop(0, axis=1)
long.columns = ['longitude', 'longitude direction']
long

Unnamed: 0,longitude,longitude direction
0,95.3698°,W
1,96.7970°,W
2,97.7431°,W


In [21]:
# use regex to replace non digit/decimals with nothing
long['longitude'] = long.longitude.str.replace('[^0-9.]+', '')
lat['latitude'] = lat.latitude.str.replace('[^0-9.]+', '')

In [22]:
long

Unnamed: 0,longitude,longitude direction
0,95.3698,W
1,96.797,W
2,97.7431,W


In [23]:
lat

Unnamed: 0,latitude,latitude direction
0,29.7604,N
1,32.7767,N
2,30.2672,N


In [24]:
# data types are not right. Lets change lat and long to numeric
long.dtypes

longitude              object
longitude direction    object
dtype: object

In [25]:
lat['latitude'] = pd.to_numeric(lat['latitude'])
long['longitude'] = pd.to_numeric(long['longitude'])
lat.dtypes

latitude              float64
latitude direction     object
dtype: object

In [26]:
# concatenate city column from original DataFrame with 
# two new transformed DataFrames
df_final = pd.concat([df['City'], lat, long], axis=1)
df_final

Unnamed: 0,City,latitude,latitude direction,longitude,longitude direction
0,Houston,29.7604,N,95.3698,W
1,Dallas,32.7767,N,96.797,W
2,Austin,30.2672,N,97.7431,W


### Mini Summary of `str`
+ **`str`** is very powerful and works directly with text column data
+ **`str`** only works with Series
+ You will have to learn regular expressions to make **`str`** more useful
+ Messy datasets with multiple values in a single cell of data need **`str`** functionality to tidy them up
+ There is a different notebooks dedicated to the regular expressions and the **`str`** accessor.

# Variables are stored in both rows and columns
A more difficult situation occurs when variables are stored down a column and across the column names. Pivoting and melting may have to be used together to make it tidy. Let's take a look at the example below. 

The **`Property`** column has names of variables. The years in the columns are all values of variables. There are a few ways to tidy this set.

In [27]:
df = pd.read_csv('data/temp_flow_pressure.csv')
df

Unnamed: 0,Group,Property,2012,2013,2014,2015,2016
0,A,Pressure,928,873,814,973,870
1,A,Temperature,1026,1038,1009,1036,1042
2,A,Flow,819,806,861,882,856
3,B,Pressure,817,877,914,806,942
4,B,Temperature,1008,1041,1009,1002,1013
5,B,Flow,887,899,837,824,873


In [28]:
# melt the years and then pivot the columns
df_melt = df.melt(id_vars=['Group', 'Property'], 
                  value_vars=['2012', '2013', '2014', '2015', '2016'],
                  var_name='Year')
df_melt.head()

Unnamed: 0,Group,Property,Year,value
0,A,Pressure,2012,928
1,A,Temperature,2012,1026
2,A,Flow,2012,819
3,B,Pressure,2012,817
4,B,Temperature,2012,1008


In [29]:
# you must use pivot_table instead of pivot because
# pivot does not allow multiple columns in the index
df_tidy = df_melt.pivot_table(index=['Group', 'Year'], columns='Property', values='value')
df_tidy

Unnamed: 0_level_0,Property,Flow,Pressure,Temperature
Group,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,2012,819,928,1026
A,2013,806,873,1038
A,2014,861,814,1009
A,2015,882,973,1036
A,2016,856,870,1042
B,2012,887,817,1008
B,2013,899,877,1041
B,2014,837,914,1009
B,2015,824,806,1002
B,2016,873,942,1013


In [30]:
# get rid of level name and move out of the index as columns
df_tidy.rename_axis(None, axis=1).reset_index()

Unnamed: 0,Group,Year,Flow,Pressure,Temperature
0,A,2012,819,928,1026
1,A,2013,806,873,1038
2,A,2014,861,814,1009
3,A,2015,882,973,1036
4,A,2016,856,870,1042
5,B,2012,887,817,1008
6,B,2013,899,877,1041
7,B,2014,837,914,1009
8,B,2015,824,806,1002
9,B,2016,873,942,1013


In [31]:
# the transformation is also possible with stack and unstack
df.set_index(['Group', 'Property'])\
  .stack()\
  .unstack('Property')\
  .rename_axis(['Group', 'Year'])\
  .rename_axis(None, axis=1)\
  .reset_index()

Unnamed: 0,Group,Year,Flow,Pressure,Temperature
0,A,2012,819,928,1026
1,A,2013,806,873,1038
2,A,2014,861,814,1009
3,A,2015,882,973,1036
4,A,2016,856,870,1042
5,B,2012,887,817,1008
6,B,2013,899,877,1041
7,B,2014,837,914,1009
8,B,2015,824,806,1002
9,B,2016,873,942,1013


# Problems

### Problem 1
<span  style="color:green; font-size:16px">Make the following dataset tidy by putting all the `HOUR` columns into a single column</span>

In [337]:
df = pd.read_csv('data/country_hour_price.csv')
df

Unnamed: 0,ASID,BORDER,HOUR1,HOUR2
0,21,GERMANY,2,3
1,32,FRANCE,2,3
2,99,ITALY,2,3
3,77,USA,4,5
4,66,CANADA,4,5
5,55,MEXICO,4,5
6,44,INDIA,6,7
7,88,CHINA,6,7
8,111,JAPAN,6,7
