In [1]:
import pandas as pd
import numpy as np

In [2]:
print(pd.__version__)
print(np.__version__)

1.4.3
1.21.2


##### points to notify



- hierachical indexing
- multiply indexed series
- multi index as extra dimension
- methods of multi index creation
- index setting and resetting

###### Hierarchical Indexing

Up to this point we’ve been focused primarily on one-dimensional and twodimensional data, stored in
Pandas
Series
and
DataFrame
objects,
respectively.
Often
it is useful to go beyond this and store
higher-dimensional
data—that is,
data
indexed
by more than one or two keys. While
Pandas does provide
Panel
and
Panel4D
objects
that
natively handle three-dimensional and
four-dimensional
data, a far more common
pattern in practice is to make use of
hierarchical
indexing (also known as
multi-indexing) to
incorporate
multiple index
levels within
a
single index. In this
way,
higher-dimensional
data can be
compactly
represented
within the familiar one-dimensional
Series and two-dimensional
DataFrame
objects.

In this section, we’ll explore the direct creation of MultiIndex objects; considerations
around indexing, slicing, and computing statistics across multiply indexed data; and
useful routines for converting between simple and hierarchically indexed representa‐
tions of your data.

##### A Multiply Indexed Series

Let’s start by considering how we might represent two-dimensional data within a
one-dimensional Series. For concreteness, we will consider a series of data where
each point has a character and numerical key.

##### The bad way

Suppose you would like to track data about states from two different years. Using the
Pandas tools we’ve already covered, you might be tempted to simply use Python
tuples as keys:

In [3]:
index = [('California', 2000), ('California', 2010),
                ('New York', 2000), ('New York', 2010),
                ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
                      18976457, 19378102,
                      20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based
on this multiple index:

In [5]:
pop[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there. For example, if you need to select all values from
2010, you’ll need to do some messy (and potentially slow) munging to make it
happen:

In [6]:
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

This produces the desired result, but is not as clean (or as efficient for large datasets)
as the slicing syntax we’ve grown to love in Pandas.

##### The better way: Pandas MultiIndex

Fortunately, Pandas provides a better way. Our tuple-based indexing is essentially a
rudimentary multi-index, and the Pandas MultiIndex type gives us the type of opera‐
tions we wish to have. We can create a multi-index from the tuples as follows:

In [7]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

Notice that the MultiIndex contains multiple levels of indexing—in this case, the state
names and the years, as well as multiple labels for each data point which encode these
levels.

If we reindex our series with this MultiIndex, we see the hierarchical representation
of the data:

In [8]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Here the first two columns of the Series representation show the multiple index val‐
ues, while the third column shows the data. Notice that some entries are missing in
the first column: in this multi-index representation, any blank entry indicates the
same value as the line above it.

Now to access all data for which the second index is 2010, we can simply use the Pan‐
das slicing notation:

In [9]:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

The result is a singly indexed array with just the keys we’re interested in. This syntax
is much more convenient (and the operation is much more efficient!) than the homespun tuple-based
multi-indexing solution
that we started with.
We’ll now further
dis‐
cuss this sort of indexing
operation on hierarchically indexed
data.

##### MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data
using a simple DataFrame with index and column labels. In fact, Pandas is built with
this equivalence in mind. The unstack() method will quickly convert a multiplyindexed
Series
into a
conventionally indexed
DataFrame:

In [10]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


Naturally, the stack() method provides the opposite operation:

In [13]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Seeing this, you might wonder why would we would bother with hierarchical index‐
ing at all. The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it to repre‐
sent data of three or more dimensions in a Series or DataFrame. Each extra level in a
multi-index represents an extra dimension of data; taking advantage of this property
gives us much more flexibility in the types of data we can represent. Concretely, we
might want to add another column of demographic data for each state at each year
(say, population under 18); with a MultiIndex this is as easy as adding another col‐
umn to the DataFrame:


In [14]:
pop_df = pd.DataFrame({'total': pop,
                        'under18': [9267089, 9284094,
                                    4687374, 4318033,
                                    5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [17]:
f_u18 = pop_df['under18'] / pop_df['total']

In [18]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


##### Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series or DataFrame
is to simply pass a list of two or more index arrays to the constructor. For example:

In [19]:
df = pd.DataFrame(np.random.rand(4, 2),
                          index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                          columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.292547,0.076809
a,2,0.303776,0.388062
b,1,0.666284,0.117722
b,2,0.154699,0.527541


The work of creating the MultiIndex is done in the background.
Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will auto‐
matically recognize this and use a MultiIndex by default:

In [21]:
 data = {('California', 2000): 33871648,
                ('California', 2010): 37253956,
                ('Texas', 2000): 20851820,
                ('Texas', 2010): 25145561,
                ('New York', 2000): 18976457,
                ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we’ll see a cou‐
ple of these methods here.

##### Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class
method constructors available in the pd.MultiIndex. For example, as we did before,
you can construct the MultiIndex from a simple list of arrays, giving the index values
within each level:

In [22]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can construct it from a list of tuples, giving the multiple index values of each
point:

In [23]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can even construct it from a Cartesian product of single indices:

In [24]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Similarly, you can construct the MultiIndex directly using its internal encoding by
passing levels (a list of lists containing available index values for each level) and
labels (a list of lists that reference these labels):

In [29]:
 pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
                       codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can pass any of these objects as the index argument when creating a Series or
DataFrame, or to the reindex method of an existing Series or DataFrame.

##### MultiIndex level names

Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish
this by passing the names argument to any of the above MultiIndex constructors, or
by setting the names attribute of the index after the fact:

In [31]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

With more involved datasets, this can be a useful way to keep track of the meaning of
various index values.

##### MultiIndex for columns

In a DataFrame, the rows and columns are completely symmetric, and just as the rows
can have multiple levels of indices, the columns can have multiple levels as well. Con‐
sider the following, which is a mock-up of some (somewhat realistic) medical data:

In [34]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,57.0,36.3,20.0,37.0,39.0,37.8
2013,2,30.0,36.9,40.0,36.4,28.0,37.3
2014,1,30.0,36.9,33.0,36.0,23.0,35.1
2014,2,31.0,36.9,35.0,36.2,35.0,37.4


Here we see where the multi-indexing for both rows and columns can come in very
handy. This is fundamentally four-dimensional data, where the dimensions are the
subject, the measurement type, the year, and the visit number. With this in place we
can, for example, index the top-level column by the person’s name and get a full Data
Frame containing just that person’s information:

In [35]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,20.0,37.0
2013,2,40.0,36.4
2014,1,33.0,36.0
2014,2,35.0,36.2


For complicated records containing multiple labeled measurements across multiple
times for many subjects (people, countries, cities, etc.), use of hierarchical rows and
columns can be extremely convenient!

##### Indexing and Slicing a MultiIndex

Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you
think about the indices as added dimensions. We’ll first look at indexing multiply
indexed Series, and then multiply indexed DataFrames.

##### Multiply indexed Series

Consider the multiply indexed Series of state populations we saw earlier:

In [36]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

We can access single elements by indexing with multiple terms:

In [37]:
pop['California', 2000]

33871648

The MultiIndex also supports partial indexing, or indexing just one of the levels in
the index. The result is another Series, with the lower-level indices maintained:

In [38]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

With sorted indices, we can perform partial indexing on lower levels by passing an
empty slice in the first index:

In [39]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

##### Multiply indexed DataFrames

A multiply indexed DataFrame behaves in a similar manner. Consider our toy medi‐
cal DataFrame from before:

In [40]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,57.0,36.3,20.0,37.0,39.0,37.8
2013,2,30.0,36.9,40.0,36.4,28.0,37.3
2014,1,30.0,36.9,33.0,36.0,23.0,35.1
2014,2,31.0,36.9,35.0,36.2,35.0,37.4


Remember that columns are primary in a DataFrame, and the syntax used for multi‐
ply indexed Series applies to the columns. For example, we can recover Guido’s heart
rate data with a simple operation:

In [41]:
health_data['Guido', 'HR']

year  visit
2013  1        20.0
      2        40.0
2014  1        33.0
      2        35.0
Name: (Guido, HR), dtype: float64

##### Rearranging Multi-Indices

One of the keys to working with multiply indexed data is knowing how to effectively
transform the data. There are a number of operations that will preserve all the infor‐
mation in the dataset, but rearrange it for the purposes of various computations. We
saw a brief example of this in the stack() and unstack() methods, but there are
many more ways to finely control the rearrangement of data between hierarchical
indices and columns, and we’ll explore them here.

##### Sorted and unsorted indices

Earlier, we briefly mentioned a caveat, but we should emphasize it more here. Many of
the MultiIndex slicing operations will fail if the index is not sorted. Let’s take a look at
this here.

We’ll start by creating some simple multiply indexed data where the indices are not
lexographically sorted:

In [42]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.562736
      2      0.715643
c     1      0.635150
      2      0.640536
b     1      0.427456
      2      0.121056
dtype: float64

Although it is not entirely clear from the error message, this is the result of the Multi
Index not being sorted. For various reasons, partial slices and other similar opera‐
tions require the levels in the MultiIndex to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting;
examples are the sort_index() and sortlevel() methods of the DataFrame. We’ll
use the simplest, sort_index(), here:

In [43]:
data = data.sort_index()
data

char  int
a     1      0.562736
      2      0.715643
b     1      0.427456
      2      0.121056
c     1      0.635150
      2      0.640536
dtype: float64

##### Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns;
this can be accomplished with the reset_index method. Calling this on the popula‐
tion dictionary will result in a DataFrame with a state and year column holding the
information that was formerly in the index. For clarity, we can optionally specify the
name of the data for the column representation:

In [44]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


##### Data Aggregations on Multi-Indices

We’ve previously seen that Pandas has built-in data aggregation methods, such as
mean(), sum(), and max(). For hierarchically indexed data, these can be passed a
level parameter that controls which subset of the data the aggregate is computed on.

In [46]:
data_mean = health_data.groupby(by='year').mean()
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,43.5,36.6,30.0,36.7,33.5,37.55
2014,30.5,36.9,34.0,36.1,29.0,36.25


By further making use of the axis keyword, we can take the mean among levels on
the columns as well:

In [50]:
health_data.groupby(by='type', axis=1).mean()

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,38.666667,37.033333
2013,2,32.666667,36.866667
2014,1,28.666667,36.0
2014,2,33.666667,36.833333
