# Tidying Data

Tidying Data refers to the process of structuring data to facilitate analysis. That is, first acknowledge that Data is not in the format you wish and then organize it. 

## Standard Pandas Operations

### Database-style DataFrame Merges

Merge or join operations combine data sets by linking rows using one or more keys. These operations are central to relational databases. The merge function in pandas is the main entry point for using these algorithms on your data.

pandas merge, pd.merge, will join the two dataframes will use the columns with overlapping names as keys.

In [2]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
pd.merge(df1, df2) # Overlapping name: key

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


If the name of the key column is the same in both dataframes, use the argument on. If it's different, use both arguments: left_on, right_on.

In [4]:
pd.merge(df1, df2, on = 'key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


To specify the type of relational database join that you want, use the argument how.

In [5]:
pd.merge(df1, df2, on = 'key', how = 'outer') # inner, left, right. 

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


Usually, you'll want to merge on the indeces of the dataframes. Use the arguments left_index and right index, which are both boolean. 

In [6]:
left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'], 
                  columns=['Ohio', 'Nevada'])
right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                   index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])
pd.merge(left2, right2, how = 'inner', left_index = True, right_index = True)

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
c,3.0,4.0,9.0,10.0
e,5.0,6.0,13.0,14.0


### Concatenate along an axis

There are two axis: the horizontal, 0, and the vertical. There are two things you must remember when concatenating: what to do with the components of the other axis that do not overlap, and if you want to be able to distinguish what came from where in the resulting dataframe. The first problem of the extra axis can be solved with the argument join; the second problem can be solved by using hierarchical index using the keys argument.

In [8]:
df1 = DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four'])
pd.concat([df1, df2], join = 'inner', axis=1, keys=['level1', 'level2']) # add columns

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5,6
c,4,5,7,8


## Reshaping and pivoting

There are a number of fundamental operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.

### With hierarchical index

- stack: from the columns to the rows.
- unstack: from the rows to the columns.

In [9]:
data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [10]:
data.stack()

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [11]:
data.stack().unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [12]:
data.stack().unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


### Reshape: Pivot and Melt

Without using the index.

In [7]:
import pandas_datareader.data as web
end = '2015-01-01'
start = '2007-01-01'
get_px = lambda x: web.DataReader(x, 'yahoo', start=start, end=end)['Adj Close']
symbols = ['SPY','TLT','MSFT']
# raw adjusted close prices
data = pd.DataFrame({sym:get_px(sym) for sym in symbols})
data = data.reset_index()
data2 = pd.melt(data, id_vars ='Date',var_name = 'Index', value_name = 'Value')
data2.iloc[[100, 2000, 5000]]

Unnamed: 0,Date,Index,Value
100,2007-05-29,MSFT,24.372302
2000,2014-12-11,MSFT,44.693488
5000,2010-11-10,TLT,80.580022


In [9]:
data2.set_index(['Date', 'Index']).unstack(1).head(5)

Unnamed: 0_level_0,Value,Value,Value
Index,MSFT,SPY,TLT
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2007-01-03,23.478417,114.809403,63.358304
2007-01-04,23.439102,115.053042,63.742467
2007-01-05,23.305433,114.135342,63.465017
2007-01-08,23.533456,114.663228,63.578846
2007-01-09,23.557044,114.565777,63.578846


In [13]:
data3 = data2.pivot(index = 'Date', columns = 'Index', values = 'Value')
data3.head(5)

Index,MSFT,SPY,TLT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007-01-03,23.478417,114.809403,63.358304
2007-01-04,23.439102,115.053042,63.742467
2007-01-05,23.305433,114.135342,63.465017
2007-01-08,23.533456,114.663228,63.578846
2007-01-09,23.557044,114.565777,63.578846


In [18]:
pd.melt(data3.reset_index(), id_vars = 'Date', var_name = 'index', value_name = 'value').head(5)

Unnamed: 0,Date,index,value
0,2007-01-03,MSFT,23.478417
1,2007-01-04,MSFT,23.439102
2,2007-01-05,MSFT,23.305433
3,2007-01-08,MSFT,23.533456
4,2007-01-09,MSFT,23.557044


## Others

To create a new variable based on a dictionary that takes the values in another variable.

In [19]:
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
                           'corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
                  'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [24]:
meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 
                  'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon'}
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon
