# Chapter 8 - Data Wrangling: Join, Combine, and Reshape

## 8.1 Hierarchical Index

In [1]:
import pandas as pd

- Getting an index using `.index`

- `Series.unstack()` to convert a `Series` to a `DataFrame` and `DataFrame.stack()` for otherwise

- `DataFrame.set_index()` to create an index on an existing `df` and `DataFrame.reset_index()` to set the index back to numerical running order index

In [2]:
es3_si_df = pd.read_csv('dataset-I-ES3.csv')
es3_si_df.head()

Unnamed: 0,Date,Date_YYYY,Date_MM,Volume,Close
0,2017-01-03,2017,1,819500.0,2.96
1,2017-01-04,2017,1,439000.0,2.98
2,2017-01-05,2017,1,772500.0,3.01
3,2017-01-06,2017,1,893700.0,3.03
4,2017-01-09,2017,1,1096900.0,3.03


Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis.

In [3]:
es3_agg = es3_si_df.groupby(['Date_YYYY', 'Date_MM'])['Close'].mean()
print(es3_agg)
print()
print(es3_agg.index)

Date_YYYY  Date_MM
2017       1          3.062000
           2          3.101000
           3          3.154348
           4          3.177895
2018       1          3.583636
           2          3.472105
           3          3.480952
           4          3.511476
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
Name: Close, dtype: float64

MultiIndex(levels=[[2017, 2018, 2019], [1, 2, 3, 4]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]],
           names=['Date_YYYY', 'Date_MM'])


The `labels` have 2 sets of values. The first refers to the position of the index in the first level while the second refers to the position of the index in the second level. So if the label is `0` in the first list and `2` in the second list then the index is `[2017, 3]`.

In [4]:
# Pulling from outer index: 1 value
print(es3_agg[2017])

# Pulling from outer index: Using a range
print(es3_agg.loc[2018:])

# Pulling from outer index: Using multiple distinct values
# (Note the double square brackets used.)
print(es3_agg.loc[[2017, 2019]])

Date_MM
1    3.062000
2    3.101000
3    3.154348
4    3.177895
Name: Close, dtype: float64
Date_YYYY  Date_MM
2018       1          3.583636
           2          3.472105
           3          3.480952
           4          3.511476
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
Name: Close, dtype: float64
Date_YYYY  Date_MM
2017       1          3.062000
           2          3.101000
           3          3.154348
           4          3.177895
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
Name: Close, dtype: float64


In [5]:
# Pulling from inner index: 1 value
print(es3_agg[2017, 2])

# Pulling from inner index: Using a range
print(es3_agg.loc[2017,2:4])

3.101
Date_YYYY  Date_MM
2017       2          3.101000
           3          3.154348
           4          3.177895
Name: Close, dtype: float64


Rearranging the data using `Series.unstack()` will bring the inner index to multiple columns.

In [6]:
es3_agg_df = es3_agg.unstack()
display(es3_agg_df)

Date_MM,1,2,3,4
Date_YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017,3.062,3.101,3.154348,3.177895
2018,3.583636,3.472105,3.480952,3.511476
2019,3.211783,3.234632,3.21181,3.340095


Change the name of the index using `df.index.names` and change the index of the columns using `df.columns.names`.

In [7]:
es3_agg_df2 = es3_agg_df.copy()
es3_agg_df2.index.names = ['Year']
display(es3_agg_df2)
es3_agg_df2.columns.names = ['Month']
display(es3_agg_df2)

Date_MM,1,2,3,4
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017,3.062,3.101,3.154348,3.177895
2018,3.583636,3.472105,3.480952,3.511476
2019,3.211783,3.234632,3.21181,3.340095


Month,1,2,3,4
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017,3.062,3.101,3.154348,3.177895
2018,3.583636,3.472105,3.480952,3.511476
2019,3.211783,3.234632,3.21181,3.340095


To remove the names of the levels, just assign them to `[None]`.

In [8]:
es3_agg_df2.index.names = [None]
es3_agg_df2.columns.names = [None]
display(es3_agg_df2)

Unnamed: 0,1,2,3,4
2017,3.062,3.101,3.154348,3.177895
2018,3.583636,3.472105,3.480952,3.511476
2019,3.211783,3.234632,3.21181,3.340095


In [9]:
es3_agg_mean = es3_si_df.groupby(['Date_YYYY', 'Date_MM'])['Close'].mean().reset_index(name='M')
es3_agg_mean.set_index(['Date_YYYY', 'Date_MM'], inplace=True)
display(es3_agg_mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,M
Date_YYYY,Date_MM,Unnamed: 2_level_1
2017,1,3.062
2017,2,3.101
2017,3,3.154348
2017,4,3.177895
2018,1,3.583636
2018,2,3.472105
2018,3,3.480952
2018,4,3.511476
2019,1,3.211783
2019,2,3.234632


When wanting to analyse data using a different order of indices, use `df.swaplevel()`. The returned `df` will have their index levels swapped. The data in the `df` remains the same.

In [10]:
es3_agg_mean = es3_agg_mean.swaplevel('Date_YYYY','Date_MM')
display(es3_agg_mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,M
Date_MM,Date_YYYY,Unnamed: 2_level_1
1,2017,3.062
2,2017,3.101
3,2017,3.154348
4,2017,3.177895
1,2018,3.583636
2,2018,3.472105
3,2018,3.480952
4,2018,3.511476
1,2019,3.211783
2,2019,3.234632


When intending to sort by different levels of indces, using `df.sort_index(level=0)`. The dataset will be sorted using the `level` specified.

In [11]:
es3_agg_mean = es3_agg_mean.sort_index(level=0)
es3_agg_mean

Unnamed: 0_level_0,Unnamed: 1_level_0,M
Date_MM,Date_YYYY,Unnamed: 2_level_1
1,2017,3.062
1,2018,3.583636
1,2019,3.211783
2,2017,3.101
2,2018,3.472105
2,2019,3.234632
3,2017,3.154348
3,2018,3.480952
3,2019,3.21181
4,2017,3.177895


And the opposite of `.unstack()` is `.stack()`.

In [12]:
display(es3_agg_df.stack())

Date_YYYY  Date_MM
2017       1          3.062000
           2          3.101000
           3          3.154348
           4          3.177895
2018       1          3.583636
           2          3.472105
           3          3.480952
           4          3.511476
2019       1          3.211783
           2          3.234632
           3          3.211810
           4          3.340095
dtype: float64

In [13]:
df = pd.read_csv('dataset-B-membership.csv')
display(df.head())

Unnamed: 0,year,membership
0,2009,526089
1,2010,549878
2,2011,588014
3,2012,613418
4,2013,655126


To create an index using an existing column, use `df.set_index()`. Its opposite function is `df.reset_index()`. Note that `set_index()` can take in multiple columns.

In [14]:
df_y = df.copy().set_index('year')
display(df_y.head())

Unnamed: 0_level_0,membership
year,Unnamed: 1_level_1
2009,526089
2010,549878
2011,588014
2012,613418
2013,655126


In [15]:
display(df_y.reset_index())

Unnamed: 0,year,membership
0,2009,526089
1,2010,549878
2,2011,588014
3,2012,613418
4,2013,655126
5,2014,686676
6,2015,718723
7,2016,740750
8,2017,755217
9,2018,762807


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)