# Hierarchical indexing

Often it is useful to go beyond one- and two-dimensional data, and store higher-dimensional data–that is, data indexed by more than one or two keys. 

A common pattern in practice is to make **use of hierarchical indexing**(also known as multi-indexing) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In [1]:
import pandas as pd
import numpy as np

## A Multiply Indexed Series

Let's start by considering how we might represent **two-dimensional data within a one-dimensional Series**. For concreteness, we will consider a series of data where each point has a character and numerical key.

### The bad way

Suppose you would like to track data about states from two different years. Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:

In [2]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

pop     = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

In [3]:
pop[ ('California', 2010):('Texas', 2000)  ]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there.

For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [4]:
sel = [i for i in pop.index if i[1]==2000 ]
print(sel)
pop[ sel ]

[('California', 2000), ('New York', 2000), ('Texas', 2000)]


(California, 2000)    33871648
(New York, 2000)      18976457
(Texas, 2000)         20851820
dtype: int64

## The Better Way: Pandas MultiIndex

Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows:

In [5]:
print(index)
print()
ind = pd.MultiIndex.from_tuples(index)
ind



[('California', 2000), ('California', 2010), ('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas', 2010)]



MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

In [6]:
index_ext = []
for u in index:
    ul = list(u)
    ul.append( np.random.randint(19) )
    index_ext.append( tuple(ul)  )

ind_ext = pd.MultiIndex.from_tuples( index_ext )
ind_ext

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010], [0, 1, 2, 7, 9, 11]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1], [0, 2, 5, 4, 3, 1]])

If we re-index our series with this MultiIndex, we see the hierarchical representation of the data:

In [7]:
pop_ri = pop.reindex( ind )
pop_ri

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [8]:
pop_ext = pd.Series(populations, index_ext )
pop_ext

(California, 2000, 0)    33871648
(California, 2010, 2)    37253956
(New York, 2000, 11)     18976457
(New York, 2010, 9)      19378102
(Texas, 2000, 7)         20851820
(Texas, 2010, 1)         25145561
dtype: int64

In [9]:
pop_ext_ri = pop_ext.reindex( ind_ext )
pop_ext_ri

California  2000  0     33871648
            2010  2     37253956
New York    2000  11    18976457
            2010  9     19378102
Texas       2000  7     20851820
            2010  1     25145561
dtype: int64

Notice that some entries are missing in the first column:
in this multi-index representation,
any blank entry indicates the same value as the line above it.

## MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. 

The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

In [10]:
pop_ri

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [11]:
# the first layer index is kept as index
# all other layer indices are turned into columns
df_pop_ri = pop_ri.unstack()
df_pop_ri

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [12]:
pop_ext_ri

California  2000  0     33871648
            2010  2     37253956
New York    2000  11    18976457
            2010  9     19378102
Texas       2000  7     20851820
            2010  1     25145561
dtype: int64

In [13]:
# lots of NaN's appear because there's no coverage
# with real values for most of the values of the third index

pop_ext_ri.unstack()

Unnamed: 0,Unnamed: 1,0,1,2,7,9,11
California,2000,33871648.0,,,,,
California,2010,,,37253956.0,,,
New York,2000,,,,,,18976457.0
New York,2010,,,,,19378102.0,
Texas,2000,,,,20851820.0,,
Texas,2010,,25145561.0,,,,


In [14]:
# the columns now are multi-index based

pop_ext_ri.unstack().unstack()

Unnamed: 0_level_0,0,0,1,1,2,2,7,7,9,9,11,11
Unnamed: 0_level_1,2000,2010,2000,2010,2000,2010,2000,2010,2000,2010,2000,2010
California,33871648.0,,,,,37253956.0,,,,,,
New York,,,,,,,,,,19378102.0,18976457.0,
Texas,,,,25145561.0,,,20851820.0,,,,,


In [15]:
# unstack-ing the third time brings all 3 levels of indices to the columns
# which is the 'transpose' of what we started from,
# except all missing raws corresponding to level-3 indices that had no values
# are now present with NaN's

pop_ext_ri.unstack().unstack().unstack()

0   2000  California    33871648.0
          New York             NaN
          Texas                NaN
    2010  California           NaN
          New York             NaN
          Texas                NaN
1   2000  California           NaN
          New York             NaN
          Texas                NaN
    2010  California           NaN
          New York             NaN
          Texas         25145561.0
2   2000  California           NaN
          New York             NaN
          Texas                NaN
    2010  California    37253956.0
          New York             NaN
          Texas                NaN
7   2000  California           NaN
          New York             NaN
          Texas         20851820.0
    2010  California           NaN
          New York             NaN
          Texas                NaN
9   2000  California           NaN
          New York             NaN
          Texas                NaN
    2010  California           NaN
          New York  

Naturally, the stack() method provides the opposite operation:

In [16]:
df_pop_ri

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [17]:
# turn a dataframe into a pd.Series,
# where the second level of multi-indexing
# are the columns of the DF
df_pop_ri.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [21]:
pop_df = pd.DataFrame(
        {
            'total': pop_ri                   # this pd.Series as column makes multi-intex
            , 'under18': [9267089, 9284094,   # this is just another normal column added
                                   4687374, 4318033,
                                   5906301, 6879014]
        }    
                         )
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In addition, all the ufuncs and other functionality discussed in Operating on Data in Pandas work with hierarchical indices as well. 

Here we compute the fraction of people under 18 by year, given the above data:

In [22]:
# the division of two columns results in a series which is multi-indiced
f_u18 = pop_df['under18'] / pop_df['total']
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [23]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


In [25]:
# the division of two columns fits in the pd.Dataframe structured w/ multi-index
pop_df['rel_under18'] = pop_df['under18'] / pop_df['total']
pop_df

Unnamed: 0,Unnamed: 1,total,under18,rel_under18
California,2000,33871648,9267089,0.273594
California,2010,37253956,9284094,0.249211
New York,2000,18976457,4687374,0.24701
New York,2010,19378102,4318033,0.222831
Texas,2000,20851820,5906301,0.283251
Texas,2010,25145561,6879014,0.273568


## Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:

In [27]:
df = pd.DataFrame(
          np.random.rand(4,2)
          , index   = [['a','a','b','b'],[1,2,1,2]]
          , columns = ['data1','data2']
    )
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.969558,0.66804
a,2,0.546886,0.658323
b,1,0.292653,0.755386
b,2,0.849751,0.223269


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default:

In [28]:
# data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex.

For example, as we did before, you can construct the MultiIndex from a simple **list of arrays giving the index values within each level**:

In [29]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

You can construct it from a list of tuples giving the **multiple index values** of each point:

In [30]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

You can even construct it from a **Cartesian product of single indices**:

In [39]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]], names=['pippo','pluto'])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['pippo', 'pluto'])

In [37]:
pop_ri.index.names = ['state', 'year']
pop_ri

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

## MultiIndex for columns
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:

In [38]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10   # expand 0th 2nd ... 6th columns (to mimik heart date)
data += 37           # re-center

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,36.7,34.0,37.7,16.0,36.5
2013,2,32.0,39.1,37.0,37.0,20.0,37.3
2014,1,43.0,36.1,32.0,37.4,33.0,35.8
2014,2,46.0,37.1,52.0,37.6,12.0,37.7


In [48]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,34.0,37.7
2013,2,37.0,37.0
2014,1,32.0,37.4
2014,2,52.0,37.6


In [52]:
health_data.iloc[3,5]

37.7

In [57]:
health_data.loc[2013]

subject  type
Bob      HR      27.0
         Temp    36.7
Guido    HR      34.0
         Temp    37.7
Sue      HR      16.0
         Temp    36.5
Name: (2013, 1), dtype: float64

In [59]:
print( type(health_data.loc[2013,1]) )
health_data.loc[2013,1]

<class 'pandas.core.series.Series'>


subject  type
Bob      HR      27.0
         Temp    36.7
Guido    HR      34.0
         Temp    37.7
Sue      HR      16.0
         Temp    36.5
Name: (2013, 1), dtype: float64

## Indexing and Slicing a MultiIndex

Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We'll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.

In [56]:
pop_ri

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [62]:
pop_ri['California',2000]

33871648

The MultiIndex also supports **partial indexing, or indexing just one of the levels** in the index. The result is another Series, with the lower-level indices maintained:

In [65]:
pop_ri['California']  # this sticks to the 1st level key

year
2000    33871648
2010    37253956
dtype: int64

**Partial slicing** is available as well, as long as the MultiIndex is sorted (see discussion in Sorted and Unsorted Indices)

In [74]:
pop_ri['California':'New York']  # this sticks to the 1st level key

# pop_ri['California':'New York',2000]  # ERR

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [72]:
pop_ri[:,2000]   # this goes to 2nd level key
# pop_ri[:,2000:2010] # ERR

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

Other types of indexing and selection (discussed in Data Indexing and Selection) work as well; for example, selection based on **Boolean masks**:

In [77]:
pop_ri[pop_ri > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [87]:
pop_ri[['California', 'Texas']]  # this sticks to the 1st level key
# pop_ri[['California', 'Texas'],2000]   # does not work

# it would appear that both 1st (directly )
# and second (slicing on the 1st via ':') level keys
# can be accessed separately
# however you cannot slice on one level and smart-select on the other

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

In [85]:
pop_ri[:,2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

## Multiply indexed DataFrames
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical DataFrame from before:

In [82]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,36.7,34.0,37.7,16.0,36.5
2013,2,32.0,39.1,37.0,37.0,20.0,37.3
2014,1,43.0,36.1,32.0,37.4,33.0,35.8
2014,2,46.0,37.1,52.0,37.6,12.0,37.7


Remember that **columns are primary in a DataFrame**, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido's heart rate data with a simple operation:

In [86]:
health_data['Guido', 'HR']

year  visit
2013  1        34.0
      2        37.0
2014  1        32.0
      2        52.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection. For example:

In [88]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,27.0,36.7
2013,2,32.0,39.1


These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:

In [89]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        27.0
      2        32.0
2014  1        43.0
      2        46.0
Name: (Bob, HR), dtype: float64

In [90]:
health_data.loc[(2013,2),:]

subject  type
Bob      HR      32.0
         Temp    39.1
Guido    HR      37.0
         Temp    37.0
Sue      HR      20.0
         Temp    37.3
Name: (2013, 2), dtype: float64