### Hierarchical Indexing

* Higher-dimensional data : data indexed by more than one or two keys
* commonly referred to as multi-indexing
* In this section, we’ll explore the direct creation of MultiIndex objects; 
    * considerations around indexing, slicing, and computing statistics across multiply indexed data
    * useful routines for converting between simple and hierarchically indexed representations of your data.


In [1]:
import numpy as np
import pandas as pd

#### First Though ... "This Is Not The Way"

In [2]:
index = [('California', 2000), ('California', 2010), ('New York', 2000), 
        ('New York', 2010), ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]

In [3]:
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [4]:
# Can still access fairly straightforward
pop[('California', 2010): ('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

In [5]:
# However something like selecting all values from 2010 .. you have to get a bit messy
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

#### The better way: Pandas MultiIndex (.. This is the Way)

In [7]:
# Createa multi-index from the tuples as follows
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [8]:
index.levels

FrozenList([['California', 'New York', 'Texas'], [2000, 2010]])

In [9]:
# Reindex our series w/above MultiIndex
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

* In this multi-index representation, any blank entry indicates the same values as the line above it

In [10]:
# Access all data for which the  second index is 2010
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

#### MultiIndex as extra dimension
* You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. 
* In fact, Pandas is built with this equivalence in mind. 
    * The unstack() method will quickly convert a multiply-indexed Series into a conventionally indexed DataFrame:

In [11]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [13]:
# Back to Series
pop_df.stack(), type(pop_df.stack())

(California  2000    33871648
             2010    37253956
 New York    2000    18976457
             2010    19378102
 Texas       2000    20851820
             2010    25145561
 dtype: int64,
 pandas.core.series.Series)

* What's the point?
    * Each extra level in a multi-index represents an extra dimension of data; taking advatanges of the flexibility

In [14]:
# Concretely, we might want to add another column of demographic data for each state at each year 
# (say, population under 18); with a MultiIndex this is as easy as adding another column to the DataFrame:

pop_df = pd.DataFrame({'total':pop, 'under18':[9267089, 9284094, 4687374, 4318033, 5906301,6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [15]:
frac_u18 = pop_df['under18'] - pop_df['total']
frac_u18

California  2000   -24604559
            2010   -27969862
New York    2000   -14289083
            2010   -15060069
Texas       2000   -14945519
            2010   -18266547
dtype: int64

In [16]:
frac_u18.unstack()

Unnamed: 0,2000,2010
California,-24604559,-27969862
New York,-14289083,-15060069
Texas,-14945519,-18266547


### Methods of MultiIndex Creation

In [17]:
df_1 = pd.DataFrame(np.random.rand(4,2), index=[['a', 'a', 'b', 'b'],[1,2,1,2]], 
                    columns=['data1', 'data2'])
df_1

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.97724,0.72092
a,2,0.649254,0.500668
b,1,0.380793,0.060952
b,2,0.581576,0.09713


In [19]:
data_mic = {('California', 2000): 33871648,('California', 2010): 37253956,('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561, ('New York', 2000): 18976457,('New York', 2010):19378102}
data_mic

{('California', 2000): 33871648,
 ('California', 2010): 37253956,
 ('Texas', 2000): 20851820,
 ('Texas', 2010): 25145561,
 ('New York', 2000): 18976457,
 ('New York', 2010): 19378102}

In [20]:
pd.Series(data_mic)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

#### Explicit MultiIndex constructors

In [30]:
# you can construct the MultiIndex from a simple list of arrays, giving the index values within each level:
pd_multi = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1,2,1,2]])
pd_multi

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [31]:
pd_multi.levels, pd_multi.values

(FrozenList([['a', 'b'], [1, 2]]),
 array([('a', 1), ('a', 2), ('b', 1), ('b', 2)], dtype=object))

In [37]:
pd_multi[:2]

MultiIndex([('a', 1),
            ('a', 2)],
           )

In [41]:
pd_tups = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

In [42]:
pd_tups.values

array([('a', 1), ('a', 2), ('b', 1), ('b', 2)], dtype=object)

* Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):


In [45]:
# MultiIndex level names
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [46]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [47]:
# MultiIndex for Columns

# hieararchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1,2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type']) 

# mock some data
data_muic = np.round(np.random.randn(4,6), 1)
data_muic[:, ::2] *= 10
data_muic += 37

# create the DataFrame
health_data = pd.DataFrame(data_muic, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,30.0,35.8,34.0,36.8,58.0,37.9
2013,2,31.0,37.1,6.0,38.1,13.0,36.8
2014,1,33.0,38.1,50.0,36.5,22.0,38.0
2014,2,46.0,36.4,25.0,37.4,43.0,37.5


* Here we see where the multi-indexing for both rows and columns can come in very handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. * With this in place we can, for example, index the top-level column by the person’s name and get a full DataFrame containing just that person’s information:

In [48]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,34.0,36.8
2013,2,6.0,38.1
2014,1,50.0,36.5
2014,2,25.0,37.4


In [51]:
health_data.index, health_data.columns

(MultiIndex([(2013, 1),
             (2013, 2),
             (2014, 1),
             (2014, 2)],
            names=['year', 'visit']),
 MultiIndex([(  'Bob',   'HR'),
             (  'Bob', 'Temp'),
             ('Guido',   'HR'),
             ('Guido', 'Temp'),
             (  'Sue',   'HR'),
             (  'Sue', 'Temp')],
            names=['subject', 'type']))

### Indexing and Slicing a MultiIndex

In [54]:
# A Series
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [55]:
# Single Elements by indexing w/multiple terms
pop['California', 2000]

33871648

In [56]:
# MultiIndex also supports partial indexing, or indexing just one of the levels in the index
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

In [60]:
pop.loc['California':'New York'] # inclusive recommend to use .loc for clarity but not required

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

In [61]:
# Partial Indexing on lower levels (just have to empty slice up until desired index level)
pop[:, 2010] # Just 2010 for each outer level state

state
California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [62]:
# Selection Based on Boolean Masks
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [63]:
# Fancy Indexing
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

In [64]:
# Fancy Multiple Levels
pop.loc[[('California', 2000), ('Texas', 2010)]]

state       year
California  2000    33871648
Texas       2010    25145561
dtype: int64

In [67]:
## Back to Health Data
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,30.0,35.8,34.0,36.8,58.0,37.9
2013,2,31.0,37.1,6.0,38.1,13.0,36.8
2014,1,33.0,38.1,50.0,36.5,22.0,38.0
2014,2,46.0,36.4,25.0,37.4,43.0,37.5


In [69]:
health_data['Guido', 'HR'] # Guido's Heart Rate Data w/simple operation

year  visit
2013  1        34.0
      2         6.0
2014  1        50.0
      2        25.0
Name: (Guido, HR), dtype: float64

In [73]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,30.0,35.8
2013,2,31.0,37.1


In [74]:
health_data.index, health_data.columns

(MultiIndex([(2013, 1),
             (2013, 2),
             (2014, 1),
             (2014, 2)],
            names=['year', 'visit']),
 MultiIndex([(  'Bob',   'HR'),
             (  'Bob', 'Temp'),
             ('Guido',   'HR'),
             ('Guido', 'Temp'),
             (  'Sue',   'HR'),
             (  'Sue', 'Temp')],
            names=['subject', 'type']))

In [75]:
# Each Individual index in loc or iloc can be passed a tuple of multiple indices
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        30.0
      2        31.0
2014  1        33.0
      2        46.0
Name: (Bob, HR), dtype: float64

In [91]:
# Count from 
health_data.iloc[:1, 3:5]

Unnamed: 0_level_0,subject,Guido,Sue
Unnamed: 0_level_1,type,Temp,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,36.8,58.0


#### IndexSlice Object

In [94]:
# Review health_data
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,30.0,35.8,34.0,36.8,58.0,37.9
2013,2,31.0,37.1,6.0,38.1,13.0,36.8
2014,1,33.0,38.1,50.0,36.5,22.0,38.0
2014,2,46.0,36.4,25.0,37.4,43.0,37.5


In [95]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,30.0,34.0,58.0
2014,1,33.0,50.0,22.0
