# Hierarchical indexing

Often it is useful to go beyond one- and two-dimensional data, and store higher-dimensional data–that is, data indexed by more than one or two keys. 

A common pattern in practice is to make **use of hierarchical indexing**(also known as multi-indexing) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In [10]:
import pandas as pd
import numpy as np

## A Multiply Indexed Series

Let's start by considering how we might represent **two-dimensional data within a one-dimensional Series**. For concreteness, we will consider a series of data where each point has a character and numerical key.

### The bad way

Suppose you would like to track data about states from two different years. Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:

In [11]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

In [12]:
pop[ ('California', 2010):('Texas', 2000)  ]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there.

For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [13]:
sel = [i for i in pop.index if i[1]==2000 ]
print(sel)
pop[ sel ]

[('California', 2000), ('New York', 2000), ('Texas', 2000)]


(California, 2000)    33871648
(New York, 2000)      18976457
(Texas, 2000)         20851820
dtype: int64

## The Better Way: Pandas MultiIndex

Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows:

In [32]:
print(index)
print()
ind = pd.MultiIndex.from_tuples(index)
ind



MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])



[('California', 2000, 99),
 ('California', 2010, 99),
 ('New York', 2000, 99),
 ('New York', 2010, 99),
 ('Texas', 2000, 99),
 ('Texas', 2010, 99)]

In [39]:
index_ext = []
for u in index:
    ul = list(u)
    ul.append( np.random.randint(19) )
    index_ext.append( tuple(ul)  )

ind_ext = pd.MultiIndex.from_tuples( index_ext )
ind_ext

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010], [1, 2, 6, 7, 14]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1], [3, 1, 4, 2, 0, 0]])

If we re-index our series with this MultiIndex, we see the hierarchical representation of the data:

In [42]:
pop = pop.reindex( ind )
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.