# Data Wrangling: Join, Combine, and Reshape

##### Focues on ***combine, join, and rearrange*** data

In [5]:
import pandas as pd
import numpy as np

## 1. Hierarchical Indexing  
**multiple**(two or more) index levels on the axis

In [10]:
data = pd.Series(np.random.randn(9),
                 index = [['a','a','a','b','b','c','c','d','d'],list('123131223')])

In [11]:
data

a  1   -0.595939
   2    0.577280
   3   -0.353964
b  1    0.516423
   3   -0.610737
c  1   -0.010285
   2   -0.835307
d  2    1.249767
   3    0.000324
dtype: float64

In [12]:
data.index

MultiIndex([('a', '1'),
            ('a', '2'),
            ('a', '3'),
            ('b', '1'),
            ('b', '3'),
            ('c', '1'),
            ('c', '2'),
            ('d', '2'),
            ('d', '3')],
           )

*Partial* indexing:

In [14]:
data['b':'c']

b  1    0.516423
   3   -0.610737
c  1   -0.010285
   2   -0.835307
dtype: float64

In [16]:
data.loc[['b','d']]

b  1    0.516423
   3   -0.610737
d  2    1.249767
   3    0.000324
dtype: float64

In [46]:
data.loc[:, '2']
#selection from a inner level

a    0.577280
c   -0.835307
d    1.249767
dtype: float64

>unstack()

Rearrange the data into a DataFrame
>stack()

The inverse operation of the above

In [28]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.595939,0.57728,-0.353964
b,0.516423,,-0.610737
c,-0.010285,-0.835307,
d,,1.249767,0.000324


In [29]:
data.unstack().stack()

a  1   -0.595939
   2    0.577280
   3   -0.353964
b  1    0.516423
   3   -0.610737
c  1   -0.010285
   2   -0.835307
d  2    1.249767
   3    0.000324
dtype: float64

With a DataFrame, either axis can have a hierarchical index:

In [30]:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index = [['a','a','b','b'], [1,2,1,2]],
                     columns = [['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])

In [31]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchical levels can have **names**:

In [36]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']

In [37]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [38]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### 1.1 Reordering and Sorting Levels  
reagrrange the order of the levels on an axis, or sort the data by values in one specific level

>swaplevel()

In [41]:
frame.swaplevel('key1', 'key2')
#frame.swaplevel(0,1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


>sort_index(level=0)

Using values in a **single** level

In [43]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [44]:
frame.swaplevel(0,1).sort_index()

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11
