# Chapter 8 - Data Wrangling: Join, Combine, and Reshape

## 8.1 Hierarchical Index

In [1]:
import re

import pandas as pd
import numpy as np

- Getting an index using `.index`

- `Series.unstack()` to convert a `Series` to a `DataFrame` and `DataFrame.stack()` for otherwise

- `DataFrame.set_index()` to create an index on an existing `df` and `DataFrame.reset_index()` to set the index back to numerical running order index

In [2]:
es3_si_df = pd.read_csv('ES3.SI.csv')
es3_si_df.head()

Unnamed: 0,Date,Date_YYYY,Date_MM,Volume,Close
0,2017-01-03,2017,1,819500.0,2.96
1,2017-01-04,2017,1,439000.0,2.98
2,2017-01-05,2017,1,772500.0,3.01
3,2017-01-06,2017,1,893700.0,3.03
4,2017-01-09,2017,1,1096900.0,3.03


Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis.

In [3]:
es3_agg = es3_si_df.groupby(['Date_YYYY', 'Date_MM'])['Close'].mean()
print(es3_agg)
print()
print(es3_agg.index)

Date_YYYY  Date_MM
2017       1          3.062000
           2          3.101000
           3          3.154348
           4          3.177895
2018       1          3.583636
           2          3.472105
           3          3.480952
           4          3.511476
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
Name: Close, dtype: float64

MultiIndex(levels=[[2017, 2018, 2019], [1, 2, 3, 4]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]],
           names=['Date_YYYY', 'Date_MM'])


In [4]:
# Pulling from outer index: 1 value
print(es3_agg[2017])

# Pulling from outer index: Using a range
print(es3_agg.loc[2018:])

# Pulling from outer index: Using multiple distinct values
# (Note the double square brackets used.)
print(es3_agg.loc[[2017, 2019]])

Date_MM
1    3.062000
2    3.101000
3    3.154348
4    3.177895
Name: Close, dtype: float64
Date_YYYY  Date_MM
2018       1          3.583636
           2          3.472105
           3          3.480952
           4          3.511476
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
Name: Close, dtype: float64
Date_YYYY  Date_MM
2017       1          3.062000
           2          3.101000
           3          3.154348
           4          3.177895
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
Name: Close, dtype: float64


In [5]:
# Pulling from inner index: 1 value
print(es3_agg[2017, 2])

# Pulling from inner index: Using a range
print(es3_agg.loc[2017,3:4])

3.101
Date_YYYY  Date_MM
2017       3          3.154348
           4          3.177895
Name: Close, dtype: float64


Rearranging the data using `Series.unstack()` will bring the inner index to multiple columns.

In [6]:
es3_agg_df = es3_agg.unstack()
display(es3_agg_df)

Date_MM,1,2,3,4
Date_YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017,3.062,3.101,3.154348,3.177895
2018,3.583636,3.472105,3.480952,3.511476
2019,3.211783,3.234632,3.21181,3.340095


And the opposite of `.unstack()` is `.stack()`.

In [7]:
display(es3_agg_df.stack())

Date_YYYY  Date_MM
2017       1          3.062000
           2          3.101000
           3          3.154348
           4          3.177895
2018       1          3.583636
           2          3.472105
           3          3.480952
           4          3.511476
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
dtype: float64

In [8]:
df = pd.read_csv('dataset-B-membership.csv')
display(df.head())

Unnamed: 0,year,membership
0,2009,526089
1,2010,549878
2,2011,588014
3,2012,613418
4,2013,655126


To create an index using an existing column, use `df.set_index()`. Its opposite function is `df.reset_index()`.

In [9]:
df_y = df.copy().set_index('year')
display(df_y.head())

Unnamed: 0_level_0,membership
year,Unnamed: 1_level_1
2009,526089
2010,549878
2011,588014
2012,613418
2013,655126


In [10]:
display(df_y.reset_index())

Unnamed: 0,year,membership
0,2009,526089
1,2010,549878
2,2011,588014
3,2012,613418
4,2013,655126
5,2014,686676
6,2015,718723
7,2016,740750
8,2017,755217
9,2018,762807


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)