### Hierarchical Indexing

* Higher-dimensional data : data indexed by more than one or two keys
* commonly referred to as multi-indexing
* In this section, we’ll explore the direct creation of MultiIndex objects; 
    * considerations around indexing, slicing, and computing statistics across multiply indexed data
    * useful routines for converting between simple and hierarchically indexed representations of your data.


In [1]:
import numpy as np
import pandas as pd

#### First Though ... "This Is Not The Way"

In [2]:
index = [('California', 2000), ('California', 2010), ('New York', 2000), 
        ('New York', 2010), ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]

In [3]:
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [4]:
# Can still access fairly straightforward
pop[('California', 2010): ('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

In [5]:
# However something like selecting all values from 2010 .. you have to get a bit messy
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

#### The better way: Pandas MultiIndex (.. This is the Way)

In [6]:
# Createa multi-index from the tuples as follows
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [7]:
index.levels

FrozenList([['California', 'New York', 'Texas'], [2000, 2010]])

In [8]:
# Reindex our series w/above MultiIndex
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

* In this multi-index representation, any blank entry indicates the same values as the line above it

In [9]:
# Access all data for which the  second index is 2010 (remember still a series)
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

#### MultiIndex as extra dimension
* You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. 
* In fact, Pandas is built with this equivalence in mind. 
    * The unstack() method will quickly convert a multiply-indexed Series into a conventionally indexed DataFrame:

In [10]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [11]:
# Back to Series
pop_df.stack(), type(pop_df.stack())

(California  2000    33871648
             2010    37253956
 New York    2000    18976457
             2010    19378102
 Texas       2000    20851820
             2010    25145561
 dtype: int64,
 pandas.core.series.Series)

* What's the point?
    * Each extra level in a multi-index represents an extra dimension of data; taking advatanges of the flexibility

In [12]:
# Concretely, we might want to add another column of demographic data for each state at each year 
# (say, population under 18); with a MultiIndex this is as easy as adding another column to the DataFrame:
# Mindufl here that the multi-index series is set to the total and thus the length of the series has to be the same as the 'under18' column set below
pop_df = pd.DataFrame({'total':pop, 'under18':[9267089, 9284094, 4687374, 4318033, 5906301,6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [13]:
frac_u18 = pop_df['under18'] - pop_df['total']
frac_u18

California  2000   -24604559
            2010   -27969862
New York    2000   -14289083
            2010   -15060069
Texas       2000   -14945519
            2010   -18266547
dtype: int64

In [14]:
frac_u18.unstack()

Unnamed: 0,2000,2010
California,-24604559,-27969862
New York,-14289083,-15060069
Texas,-14945519,-18266547


### Methods of MultiIndex Creation

In [15]:
# index[[0]] holds the two outer levels and each inner numeric level is assigned to each outer or index[[1]]
df_1 = pd.DataFrame(np.random.rand(4,2), index=[['a', 'a', 'b', 'b'],[1,2,1,2]], 
                    columns=['data1', 'data2'])
df_1

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.453884,0.8362
a,2,0.681689,0.464407
b,1,0.91377,0.955022
b,2,0.947166,0.8426


In [16]:
data_mic = {('California', 2000): 33871648,('California', 2010): 37253956,('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561, ('New York', 2000): 18976457,('New York', 2010):19378102}
data_mic

{('California', 2000): 33871648,
 ('California', 2010): 37253956,
 ('Texas', 2000): 20851820,
 ('Texas', 2010): 25145561,
 ('New York', 2000): 18976457,
 ('New York', 2010): 19378102}

In [17]:
pd.Series(data_mic)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

#### Explicit MultiIndex constructors

In [18]:
# you can construct the MultiIndex from a simple list of arrays, giving the index values within each level:
pd_multi = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1,2,1,2]])
pd_multi

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [19]:
pd_multi.levels, pd_multi.values

(FrozenList([['a', 'b'], [1, 2]]),
 array([('a', 1), ('a', 2), ('b', 1), ('b', 2)], dtype=object))

In [20]:
pd_multi[:2]

MultiIndex([('a', 1),
            ('a', 2)],
           )

In [21]:
pd_tups = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

In [22]:
pd_tups.values

array([('a', 1), ('a', 2), ('b', 1), ('b', 2)], dtype=object)

* Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):


In [23]:
# MultiIndex level names
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [24]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### MultiIndex for Columns

In [25]:
# MultiIndex for Columns

# hieararchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1,2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type']) 

# mock some data
data_muic = np.round(np.random.randn(4,6), 1)
data_muic[:, ::2] *= 10
data_muic += 37

# create the DataFrame
health_data = pd.DataFrame(data_muic, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,40.0,37.5,38.0,37.2,29.0,35.8
2013,2,39.0,37.6,36.0,35.9,45.0,37.4
2014,1,40.0,35.7,57.0,38.0,49.0,36.3
2014,2,38.0,37.9,35.0,37.3,24.0,37.5


* Here we see where the multi-indexing for both rows and columns can come in very handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. * With this in place we can, for example, index the top-level column by the person’s name and get a full DataFrame containing just that person’s information:

In [26]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,38.0,37.2
2013,2,36.0,35.9
2014,1,57.0,38.0
2014,2,35.0,37.3


In [27]:
health_data.index, health_data.columns

(MultiIndex([(2013, 1),
             (2013, 2),
             (2014, 1),
             (2014, 2)],
            names=['year', 'visit']),
 MultiIndex([(  'Bob',   'HR'),
             (  'Bob', 'Temp'),
             ('Guido',   'HR'),
             ('Guido', 'Temp'),
             (  'Sue',   'HR'),
             (  'Sue', 'Temp')],
            names=['subject', 'type']))

### Indexing and Slicing a MultiIndex

In [28]:
# A Series
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [29]:
# Single Elements by indexing w/multiple terms 
pop['California', 2000]

33871648

In [30]:
# MultiIndex also supports partial indexing, or indexing just one of the levels in the index
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

In [31]:
pop.loc['California':'New York'] # inclusive recommend to use .loc for clarity but not required

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

In [32]:
# Partial Indexing on lower levels (just have to empty slice up until desired index level)
pop[:, 2010] # Just 2010 for each outer level state

state
California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [36]:
# Selection Based on Boolean Masks w/&w/o .loc
print(pop[pop > 22000000], '\n\n', pop.loc[pop > 22000000])

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64 

 state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64


In [37]:
# Fancy Indexing
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

In [38]:
# Fancy Multiple Levels
pop.loc[[('California', 2000), ('Texas', 2010)]]

state       year
California  2000    33871648
Texas       2010    25145561
dtype: int64

In [39]:
## Back to Health Data
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,40.0,37.5,38.0,37.2,29.0,35.8
2013,2,39.0,37.6,36.0,35.9,45.0,37.4
2014,1,40.0,35.7,57.0,38.0,49.0,36.3
2014,2,38.0,37.9,35.0,37.3,24.0,37.5


In [40]:
health_data['Guido', 'HR'] # Guido's Heart Rate Data w/simple operation (subject then type) for multi-index column level

year  visit
2013  1        38.0
      2        36.0
2014  1        57.0
      2        35.0
Name: (Guido, HR), dtype: float64

In [41]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,40.0,37.5
2013,2,39.0,37.6


In [42]:
health_data.index, health_data.columns

(MultiIndex([(2013, 1),
             (2013, 2),
             (2014, 1),
             (2014, 2)],
            names=['year', 'visit']),
 MultiIndex([(  'Bob',   'HR'),
             (  'Bob', 'Temp'),
             ('Guido',   'HR'),
             ('Guido', 'Temp'),
             (  'Sue',   'HR'),
             (  'Sue', 'Temp')],
            names=['subject', 'type']))

In [43]:
# Each Individual index in loc or iloc can be passed a tuple of multiple indices
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        40.0
      2        39.0
2014  1        40.0
      2        38.0
Name: (Bob, HR), dtype: float64

In [47]:
health_data.loc[:, 'Bob':'Guido']

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido
Unnamed: 0_level_1,type,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2013,1,40.0,37.5,38.0,37.2
2013,2,39.0,37.6,36.0,35.9
2014,1,40.0,35.7,57.0,38.0
2014,2,38.0,37.9,35.0,37.3


In [48]:
health_data.loc[:, [('Bob', 'Temp'), ('Sue', 'HR')]]

Unnamed: 0_level_0,subject,Bob,Sue
Unnamed: 0_level_1,type,Temp,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,37.5,29.0
2013,2,37.6,45.0
2014,1,35.7,49.0
2014,2,37.9,24.0


In [49]:
health_data.loc[[(2013, 2), (2014, 1)], [('Bob', 'HR'), ('Guido', 'Temp')]]

Unnamed: 0_level_0,subject,Bob,Guido
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,2,39.0,35.9
2014,1,40.0,38.0


In [50]:
# Count from 
health_data.iloc[:1, 3:5]

Unnamed: 0_level_0,subject,Guido,Sue
Unnamed: 0_level_1,type,Temp,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,37.2,29.0


#### IndexSlice Object

In [51]:
# Review health_data
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,40.0,37.5,38.0,37.2,29.0,35.8
2013,2,39.0,37.6,36.0,35.9,45.0,37.4
2014,1,40.0,35.7,57.0,38.0,49.0,36.3
2014,2,38.0,37.9,35.0,37.3,24.0,37.5


In [52]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,40.0,38.0,29.0
2014,1,40.0,57.0,49.0


In [59]:
health_data.loc[idx[2013, :], idx['Guido', :]]

Unnamed: 0_level_0,subject,Guido,Guido
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,38.0,37.2
2013,2,36.0,35.9


### Rearranging Multi-Indices
* Many of the MultiIndex slicing operations will fail if the index is not sorted (Example below)
    * Not "lexographically" sorted

In [64]:
idx = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data_ridx = pd.Series(np.random.rand(6), index=idx)
data_ridx.index.names = ['char', 'int']
data_ridx

char  int
a     1      0.809697
      2      0.466859
c     1      0.716325
      2      0.952034
b     1      0.546289
      2      0.616055
dtype: float64

In [65]:
try:
    data_ridx['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


* (Not the clearest message) but this is the result of the MultiIndex not being sorted
* For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted (ie, lexographical) order
* sort_index(), and sortlevel() will be our friends here

In [67]:
data_ridx = data_ridx.sort_index()
data_ridx

char  int
a     1      0.809697
      2      0.466859
b     1      0.546289
      2      0.616055
c     1      0.716325
      2      0.952034
dtype: float64

In [68]:
data_ridx['a':'b']

char  int
a     1      0.809697
      2      0.466859
b     1      0.546289
      2      0.616055
dtype: float64

### Stacking and unstacking indices
* Common to push a hierarchical index to a column (this can be done by defining the level you'd like to transform (aka index level)

In [69]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [70]:
pop.unstack(level=0) # outer index (state level is pushed to column)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [71]:
pop.unstack(level=1) # nested index (year pushed to column)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


### Index setting & resetting
* Another way to rearrange hierarchical data is to turn the index labels into columns
    * This can be accomplished with the reset_index method
* Calling this will result in a DataFrame with your indexes push to a separate column (like state & year above)

In [72]:
print(type(pop))
display(pop)

<class 'pandas.core.series.Series'>


state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [73]:
# Series being transformed can have values named in the result DataFrame column (that aren't the indexes from the multi-index Series)
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [74]:
type(pop_flat)

pandas.core.frame.DataFrame

* set_index allows to build a MultiIndex from the column values
    * hierarchical set by level included in the call

In [75]:
# Original
pop_set_state_first = pop_flat.set_index(['state', 'year'])
pop_set_state_first

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [76]:
# Reset by year (looks a little silly and the state has multiple years so not the most useful but possible!)
pop_set_year_first = pop_flat.set_index(['year', 'state'])
pop_set_year_first

Unnamed: 0_level_0,Unnamed: 1_level_0,population
year,state,Unnamed: 2_level_1
2000,California,33871648
2010,California,37253956
2000,New York,18976457
2010,New York,19378102
2000,Texas,20851820
2010,Texas,25145561


### Data Aggregations on Multi-Indices
* For hierarchically indexed data, data agg methods (ex. mean(), min()) can be passed a level parameter
    * level controls which subset of the data the aggregate is computed on

In [77]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,40.0,37.5,38.0,37.2,29.0,35.8
2013,2,39.0,37.6,36.0,35.9,45.0,37.4
2014,1,40.0,35.7,57.0,38.0,49.0,36.3
2014,2,38.0,37.9,35.0,37.3,24.0,37.5


In [78]:
# Average measurements in the two visits each year (explore index level year)
hd_year_mean = health_data.mean(level='year')
hd_year_mean

  hd_year_mean = health_data.mean(level='year')


subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,39.5,37.55,37.0,36.55,37.0,36.6
2014,39.0,36.8,46.0,37.65,36.5,36.9


* Note here the textbook version has a soon to be deprecated way of accessing the level, see same result below for future usage

In [80]:
# Year average for each type of column measurement for subject 
data_mean = health_data.groupby(level='year').mean()
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,39.5,37.55,37.0,36.55,37.0,36.6
2014,39.0,36.8,46.0,37.65,36.5,36.9


In [81]:
data_mean.mean(axis=1, level='type')

  data_mean.mean(axis=1, level='type')


type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,37.833333,36.9
2014,40.5,37.116667


In [83]:
# Return similar to above we can take the mean among levels 
# Specificy the level to groupby (we take the second level for columns, then perform the mean across the rows of data_mean)
type_agg_means = data_mean.groupby(level=[1], axis=1).mean()
type_agg_means

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,37.833333,36.9
2014,40.5,37.116667


In [84]:
subj_agg_means_yearly = data_mean.groupby(level=[0], axis=1).mean()
subj_agg_means_yearly

subject,Bob,Guido,Sue
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,38.525,36.775,36.8
2014,37.9,41.825,36.7


In [85]:
# this will simply get mean for each rows and return the hierarchical column order for mean over years for type
data_mean.mean()

subject  type
Bob      HR      39.250
         Temp    37.175
Guido    HR      41.500
         Temp    37.100
Sue      HR      36.750
         Temp    36.750
dtype: float64