# Join, Combine and Reshape



## Hierarchical Indexing

- Enables you to have multiple (two or more) index levels on an axis. 
- It provides a way for you to work with higher dimentional data in a lower dimentional form

In [43]:

import pandas as pd
import numpy as np

In [44]:
data = pd.Series(np.random.randn(9), index=[['a','a','a','b','b','c','c','d','d'],[1,2,3,1,3,1,2,2,3]])
data

a  1   -2.198675
   2    0.822389
   3   -0.274280
b  1    0.997504
   3   -1.135659
c  1    0.466063
   2    0.524727
d  2    1.153080
   3   -0.137732
dtype: float64

In [45]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [46]:
# WIth a hierarchially indexed object, partial indexing is possible, 
# enabling you to concisely select subsets of the data

data['b']

1    0.997504
3   -1.135659
dtype: float64

In [47]:
data['b':'c']

b  1    0.997504
   3   -1.135659
c  1    0.466063
   2    0.524727
dtype: float64

In [48]:
data.loc[['b', 'd']]

b  1    0.997504
   3   -1.135659
d  2    1.153080
   3   -0.137732
dtype: float64

In [49]:
# selection is also possible on the inner level 

data.loc[:,2]

a    0.822389
c    0.524727
d    1.153080
dtype: float64

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. 

For example you could rearrange the data into a DataFrame using the ```unstack``` method.

In [50]:
data.unstack() # note how the 2.nd index is now axis 1

Unnamed: 0,1,2,3
a,-2.198675,0.822389,-0.27428
b,0.997504,,-1.135659
c,0.466063,0.524727,
d,,1.15308,-0.137732


In [51]:
# the inverse operation of unstack is stack

data.unstack().stack()

a  1   -2.198675
   2    0.822389
   3   -0.274280
b  1    0.997504
   3   -1.135659
c  1    0.466063
   2    0.524727
d  2    1.153080
   3   -0.137732
dtype: float64

In [52]:
# with a DataFrame either axis can have hierarchical index

frame = pd.DataFrame(np.arange(12).reshape((4,3)), 
    index=[['a','a','b','b'],[1,2,1,2]],
    columns=[['ohio','ohio','colorado'],['Green', 'Red', 'Green']])

frame

Unnamed: 0_level_0,Unnamed: 1_level_0,ohio,ohio,colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [53]:
# The different levels can have names assigned

frame.index.names = ['key1', 'key2']

frame.columns.names = ['state', 'color']

frame

Unnamed: 0_level_0,state,ohio,ohio,colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [54]:
# with partial indexing upi can select groups of columns

frame['ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### Reordering and Sorting levels

```swaplevel``` takes two level numbers or names and returns a new object with the levele interchanged

In [55]:
frame.swaplevel('key1','key2')

Unnamed: 0_level_0,state,ohio,ohio,colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


```sort_index``` sorts the data useing only the vaules in a single level. 
When swapping levels, it is not uncommon to also use sort_index so taht the result is lexicographically sorted by the indicated level. 

In [56]:
frame.sort_index(level=1) # note it sorts on key2

Unnamed: 0_level_0,state,ohio,ohio,colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [57]:
frame.swaplevel(0,1).sort_index(level=0)

Unnamed: 0_level_0,state,ohio,ohio,colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### Summary statistics by Level

you can specify the level you want to aggreate  by on a particular axis.
We can aggregate either by rows or columns

In [58]:
# frame.sum(level='key2') # is depricated

 Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
  frame.sum(level='key2')

In [59]:
frame.groupby(level='key2').sum() # use this instead

state,ohio,ohio,colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [60]:
frame.groupby(level='color', axis=1).sum()

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### Indexing with a dataframes columns

it is not unusual to want to use one or more columns as the row index or the other way around. 

In [61]:
frame = pd.DataFrame({'a': range(7), 
                    'b': range(7,0,-1),
                    'c': ['one','one','one','two','two','two','two'],
                    'd':[0,1,2,0,1,2,3]})

frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [62]:
# set_index functino will create a new DataFrame using one or more of its columns as the index

frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [63]:
# by default the columns are removed from the dataframe, though you can leave them in. 

frame.set_index(['c','d'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [64]:
# reset_index does the opposite of set:index

frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## Combining and Merging Datasets

- ```pandas.merge``` connects rows ins dataframes based on one or more keys. This will be familiar to sql users as it implements database join operations

- ```pandas.concat``` concatinates or 'stacks' together objects along axis

- the ```combine_first``` method enables splicing together overlapping data to fill in missing values in one object with vaules from another

### Database-Style DataFrame Joins

In [65]:
df1 = pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data':range(7)})
df1

Unnamed: 0,key,data
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [66]:
df2 = pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [67]:
# example of many to one join (joined on 'a' and 'b')

pd.merge(df1,df2) # note that d is not added to the table

Unnamed: 0,key,data,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


Note that I dident specify which column to join on. If that information is not specified, ```merge``` uses the overlapping column names as the keys. its a good practice to specify explicitly: 

In [68]:
pd.merge(df1,df2,on='key')

Unnamed: 0,key,data,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [69]:
# if the column names are different you can specify them seperately

df3 = pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data':range(7)})

df4 = pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)})

In [70]:
pd.merge(df3,df4,left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


You may notice that the 'c' and 'd' values and associated data are missing from the result.
By default ```merge``` does an 'inner' join; the keys in the result are the intersection, or the common set found in both tables.

Other possible options are 'lef', 'right', and 'outer'. 

- The outer join takes the union of the kets, combining hte effect of applying both left and right joins

In [72]:
pd.merge(df1,df2, how='outer') # note taht 'c' and 'd' is included

Unnamed: 0,key,data,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


Different join types with how argument

Option | Behavior
-------|---------
'inner' | Use only the key combinations observed in ***both*** tables
'left' | Use allkey combinatoins found in the left table
'right' | Use all key combinations found in the right table
'outer' | Use all key combinations observed in both tables together

In [74]:
# Many to many merges have awell-defined, though not intuitive, behavior
df1 = pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})

df2 = pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})

In [75]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [76]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In [77]:
pd.merge(df1,df2, on='key', how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


Many-to-many joins form the Cartesian product of the rows. 
Since there were three 'b' rows in the left DataFrame and two in the right one, there are six 'b' rows in the restult. 
The join method only affects the distinct key vaules appearing in the result. 

In [78]:
pd.merge(df1,df2,how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


In [79]:
# to merge with multiple kays, pass a list of column names

left = pd.DataFrame(({'key1':['foo','foo','bar'],'key2':['one','two','one'], 'lval':[1,2,3]}))

right = pd.DataFrame(({'key1':['foo','foo','bar','bar'],'key2':['one','one','one', 'two'], 'rval':[4,5,6,7]}))

In [81]:
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [82]:
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [80]:
pd.merge(left, right, on=['key1','key2'], how='outer') # not that index 2 and 4 contains NaN vaules

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


In [83]:
# overlapping column name with the option 'suffixes' to specify strings to append to overlapping names 
# in the left andright datafra objects

pd.merge(left, right, on='key1') 

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [84]:
pd.merge(left, right, on='key1', suffixes=('_left','_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


Merge function arguments

Argument | Description
---------|------------
left | DataFrame to be merged on the left side
right | DataFrame to be merged on the right side
how | One of 'inner, 'left', 'right', 'outer'; defaults to 'inner'
on | Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys
left_on | Columns in left DataFrame to use as join keys
right_on  | Columns in right DataFrame to use as join keys
sort | Sorte merged data lexicographically by join keys; True by default
suffixes | Tuple of string values to append to column names in case of overlap (defaults to '_x','_y')
copy | if False, avoid copying dadta into resulting dadta structure in some exceptional cases,; by default always copies
indicator | Adds a special column _merge that indicates the source of each row; vaules will be 'left_only', 'right_only', or 'both' based on the orifin of the joined data in each row. 

### Merging on Index

Sometimes the merge key(s) in a DataFrame will be found in its index. In this case you can pass ```left_index=True``` or ```right_index=True``` (or both) to indicate that the index should be used as the merge key. 

In [86]:
left1 = pd.DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})

right1 = pd.DataFrame({'group_val':[3.5,7]},index=['a','b'])

In [87]:
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [88]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [89]:
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [90]:
# since the default merge method is to intersect the join keys, 
# you can insted form the union of them with an outer join

pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


With hierarchically indexed data, things are more complicated, as joining on index is implicitly a multiple-key merge

In [93]:
lefth = pd.DataFrame({'key1':['ohio','ohio','ohio', 'nevada','nevada'], 
                    'key2':[2000,2001,2002,2001,2002], 
                    'data': np.arange(5.)})

righth = pd.DataFrame(np.arange(12).reshape((6,2)),
                    index=[['nevada','nevada','ohio','ohio','ohio','ohio'], [2001,2000,2000,2000,2001,2002]], 
                    columns=['event1','event2'])

In [94]:
lefth

Unnamed: 0,key1,key2,data
0,ohio,2000,0.0
1,ohio,2001,1.0
2,ohio,2002,2.0
3,nevada,2001,3.0
4,nevada,2002,4.0


In [95]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
nevada,2001,0,1
nevada,2000,2,3
ohio,2000,4,5
ohio,2000,6,7
ohio,2001,8,9
ohio,2002,10,11


In [96]:
# in this case you have to indicae multiple columns to merge on as a list

pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,ohio,2000,0.0,4,5
0,ohio,2000,0.0,6,7
1,ohio,2001,1.0,8,9
2,ohio,2002,2.0,10,11
3,nevada,2001,3.0,0,1


In [98]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')

Unnamed: 0,key1,key2,data,event1,event2
0,ohio,2000,0.0,4.0,5.0
0,ohio,2000,0.0,6.0,7.0
1,ohio,2001,1.0,8.0,9.0
2,ohio,2002,2.0,10.0,11.0
3,nevada,2001,3.0,0.0,1.0
4,nevada,2002,4.0,,
4,nevada,2000,,2.0,3.0


DataFrame has an convenient ```join``` instance for merging by index. IT can also be used to combine together many DataFrame objects having the same indexes but non-overlapping columns. 

- ```join``` performs a left join on the join keys, exactly preserving the left frame row index
- It also supports joining the index of the passed DataFrame on one of the columns of the calling DataFrame

In [100]:
left2 = pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]], index=['a','c','e'], columns=['ohio', 'nevada'])

right2 = pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]], index=['b','c', 'd','e'], columns=['Missouri', 'Alabama'])

In [102]:
left2

Unnamed: 0,ohio,nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [103]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [101]:
left2.join(right2, how='outer')

Unnamed: 0,ohio,nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


Lastly, for simple index-on-index merges, you can pass a list of DataFrames to join as an alternative to using the more general concat function

In [104]:
another = pd.DataFrame(
    [[7.,8.],[9.,10.],[11.,12.],[16.,17.]], 
    index=['b','c', 'e','f'], 
    columns=['New York', 'Oregon'])

another

Unnamed: 0,New York,Oregon
b,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [105]:
left2.join([right2, another])

Unnamed: 0,ohio,nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,,
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


In [106]:
left2.join([right2, another], how='outer')

Unnamed: 0,ohio,nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,,
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0
b,,,7.0,8.0,7.0,8.0
d,,,11.0,12.0,,
f,,,,,16.0,17.0


### Concatenating along an axis
