# Hierarchical indexing

Often it is useful to go beyond one- and two-dimensional data, and store higher-dimensional data–that is, data indexed by more than one or two keys. 

A common pattern in practice is to make **use of hierarchical indexing**(also known as multi-indexing) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In [1]:
import pandas as pd
import numpy as np

## A Multiply Indexed Series

Let's start by considering how we might represent **two-dimensional data within a one-dimensional Series**. For concreteness, we will consider a series of data where each point has a character and numerical key.

### The bad way

Suppose you would like to track data about states from two different years. Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:

In [2]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

pop     = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

In [3]:
pop[ ('California', 2010):('Texas', 2000)  ]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there.

For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [4]:
sel = [i for i in pop.index if i[1]==2000 ]
print(sel)
pop[ sel ]

[('California', 2000), ('New York', 2000), ('Texas', 2000)]


(California, 2000)    33871648
(New York, 2000)      18976457
(Texas, 2000)         20851820
dtype: int64

## The Better Way: Pandas MultiIndex

Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows:

In [5]:
print(index)
print()
ind = pd.MultiIndex.from_tuples(index)
ind



[('California', 2000), ('California', 2010), ('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas', 2010)]



MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

In [6]:
index_ext = []
for u in index:
    ul = list(u)
    ul.append( np.random.randint(19) )
    index_ext.append( tuple(ul)  )

ind_ext = pd.MultiIndex.from_tuples( index_ext )
ind_ext

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010], [3, 4, 8, 10, 13, 16]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1], [1, 0, 3, 2, 5, 4]])

If we re-index our series with this MultiIndex, we see the hierarchical representation of the data:

In [7]:
pop_ri = pop.reindex( ind )
pop_ri

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [8]:
pop_ext = pd.Series(populations, index_ext )
pop_ext

(California, 2000, 4)    33871648
(California, 2010, 3)    37253956
(New York, 2000, 10)     18976457
(New York, 2010, 8)      19378102
(Texas, 2000, 16)        20851820
(Texas, 2010, 13)        25145561
dtype: int64

In [9]:
pop_ext_ri = pop_ext.reindex( ind_ext )
pop_ext_ri

California  2000  4     33871648
            2010  3     37253956
New York    2000  10    18976457
            2010  8     19378102
Texas       2000  16    20851820
            2010  13    25145561
dtype: int64

Notice that some entries are missing in the first column:
in this multi-index representation,
any blank entry indicates the same value as the line above it.

## MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. 

The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

In [10]:
pop_ri

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [11]:
# the first layer index is kept as index
# all other layer indices are turned into columns
df_pop_ri = pop_ri.unstack()
df_pop_ri

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [12]:
pop_ext_ri

California  2000  4     33871648
            2010  3     37253956
New York    2000  10    18976457
            2010  8     19378102
Texas       2000  16    20851820
            2010  13    25145561
dtype: int64

In [13]:
# lots of NaN's appear because there's no coverage
# with real values for most of the values of the third index

pop_ext_ri.unstack()

Unnamed: 0,Unnamed: 1,3,4,8,10,13,16
California,2000,,33871648.0,,,,
California,2010,37253956.0,,,,,
New York,2000,,,,18976457.0,,
New York,2010,,,19378102.0,,,
Texas,2000,,,,,,20851820.0
Texas,2010,,,,,25145561.0,


In [14]:
# the columns now are multi-index based

pop_ext_ri.unstack().unstack()

Unnamed: 0_level_0,3,3,4,4,8,8,10,10,13,13,16,16
Unnamed: 0_level_1,2000,2010,2000,2010,2000,2010,2000,2010,2000,2010,2000,2010
California,,37253956.0,33871648.0,,,,,,,,,
New York,,,,,,19378102.0,18976457.0,,,,,
Texas,,,,,,,,,,25145561.0,20851820.0,


In [15]:
# unstack-ing the third time brings all 3 levels of indices to the columns
# which is the 'transpose' of what we started from,
# except all missing raws corresponding to level-3 indices that had no values
# are now present with NaN's

pop_ext_ri.unstack().unstack().unstack()

3   2000  California           NaN
          New York             NaN
          Texas                NaN
    2010  California    37253956.0
          New York             NaN
          Texas                NaN
4   2000  California    33871648.0
          New York             NaN
          Texas                NaN
    2010  California           NaN
          New York             NaN
          Texas                NaN
8   2000  California           NaN
          New York             NaN
          Texas                NaN
    2010  California           NaN
          New York      19378102.0
          Texas                NaN
10  2000  California           NaN
          New York      18976457.0
          Texas                NaN
    2010  California           NaN
          New York             NaN
          Texas                NaN
13  2000  California           NaN
          New York             NaN
          Texas                NaN
    2010  California           NaN
          New York  

Naturally, the stack() method provides the opposite operation:

In [16]:
df_pop_ri

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [17]:
# turn a dataframe into a pd.Series,
# where the second level of multi-indexing
# are the columns of the DF
df_pop_ri.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [18]:
pop_df = pd.DataFrame(
        {
            'total': pop_ri                   # this pd.Series as column makes multi-intex
            , 'under18': [9267089, 9284094,   # this is just another normal column added
                                   4687374, 4318033,
                                   5906301, 6879014]
        }    
                         )
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In addition, all the ufuncs and other functionality discussed in Operating on Data in Pandas work with hierarchical indices as well. 

Here we compute the fraction of people under 18 by year, given the above data:

In [19]:
# the division of two columns results in a series which is multi-indiced
f_u18 = pop_df['under18'] / pop_df['total']
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [20]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


In [21]:
# the division of two columns fits in the pd.Dataframe structured w/ multi-index
pop_df['rel_under18'] = pop_df['under18'] / pop_df['total']
pop_df

Unnamed: 0,Unnamed: 1,total,under18,rel_under18
California,2000,33871648,9267089,0.273594
California,2010,37253956,9284094,0.249211
New York,2000,18976457,4687374,0.24701
New York,2010,19378102,4318033,0.222831
Texas,2000,20851820,5906301,0.283251
Texas,2010,25145561,6879014,0.273568


## Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:

In [22]:
df = pd.DataFrame(
          np.random.rand(4,2)
          , index   = [['a','a','b','b'],[1,2,1,2]]
          , columns = ['data1','data2']
    )
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.371249,0.664439
a,2,0.857435,0.472166
b,1,0.063398,0.35268
b,2,0.387557,0.631367


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default:

In [23]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex.

For example, as we did before, you can construct the MultiIndex from a simple **list of arrays giving the index values within each level**:

In [24]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

You can construct it from a list of tuples giving the **multiple index values** of each point:

In [25]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

You can even construct it from a **Cartesian product of single indices**:

In [26]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]], names=['pippo','pluto'])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['pippo', 'pluto'])

In [27]:
pop_ri.index.names = ['state', 'year']
pop_ri

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

## MultiIndex for columns
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:

In [28]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10   # expand 0th 2nd ... 6th columns (to mimik heart date)
data += 37           # re-center

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,36.9,33.0,38.4,28.0,37.6
2013,2,34.0,37.9,35.0,36.8,44.0,37.3
2014,1,32.0,34.8,30.0,38.5,42.0,36.5
2014,2,36.0,35.1,48.0,36.7,46.0,38.9


In [29]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,33.0,38.4
2013,2,35.0,36.8
2014,1,30.0,38.5
2014,2,48.0,36.7


In [30]:
health_data.iloc[3,5]

38.9

In [31]:
health_data.loc[2013]

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
visit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,27.0,36.9,33.0,38.4,28.0,37.6
2,34.0,37.9,35.0,36.8,44.0,37.3


In [32]:
print( type(health_data.loc[2013,1]) )
health_data.loc[2013,1]

<class 'pandas.core.series.Series'>


subject  type
Bob      HR      27.0
         Temp    36.9
Guido    HR      33.0
         Temp    38.4
Sue      HR      28.0
         Temp    37.6
Name: (2013, 1), dtype: float64

## Indexing and Slicing a MultiIndex

Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We'll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.

In [33]:
pop_ri

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [34]:
pop_ri['California',2000]

33871648

The MultiIndex also supports **partial indexing, or indexing just one of the levels** in the index. The result is another Series, with the lower-level indices maintained:

In [35]:
pop_ri['California']  # this sticks to the 1st level key

year
2000    33871648
2010    37253956
dtype: int64

**Partial slicing** is available as well, as long as the MultiIndex is sorted (see discussion in Sorted and Unsorted Indices)

In [36]:
pop_ri['California':'New York']  # this sticks to the 1st level key

# pop_ri['California':'New York',2000]  # ERR

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [37]:
pop_ri[:,2000]   # this goes to 2nd level key
# pop_ri[:,2000:2010] # ERR

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

Other types of indexing and selection (discussed in Data Indexing and Selection) work as well; for example, selection based on **Boolean masks**:

In [38]:
pop_ri[pop_ri > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [39]:
pop_ri[['California', 'Texas']]  # this sticks to the 1st level key
# pop_ri[['California', 'Texas'],2000]   # does not work

# it would appear that both 1st (directly )
# and second (slicing on the 1st via ':') level keys
# can be accessed separately
# however you cannot slice on one level and smart-select on the other

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

In [40]:
pop_ri[:,2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

## Multiply indexed DataFrames
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical DataFrame from before:

In [41]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,36.9,33.0,38.4,28.0,37.6
2013,2,34.0,37.9,35.0,36.8,44.0,37.3
2014,1,32.0,34.8,30.0,38.5,42.0,36.5
2014,2,36.0,35.1,48.0,36.7,46.0,38.9


Remember that **columns are primary in a DataFrame**, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido's heart rate data with a simple operation:

In [42]:
health_data['Guido', 'HR']

year  visit
2013  1        33.0
      2        35.0
2014  1        30.0
      2        48.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection. For example:

In [43]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,27.0,36.9
2013,2,34.0,37.9


These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:

In [44]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        27.0
      2        34.0
2014  1        32.0
      2        36.0
Name: (Bob, HR), dtype: float64

In [45]:
health_data.loc[(2013,2),:]

subject  type
Bob      HR      34.0
         Temp    37.9
Guido    HR      35.0
         Temp    36.8
Sue      HR      44.0
         Temp    37.3
Name: (2013, 2), dtype: float64

## Rearranging Multi-Indices

One of the keys to working with multiply indexed data is knowing how to effectively transform the data. There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations. 

We saw a brief example of this in the `stack()` and `unstack()` methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we'll explore them here.

We'll start by creating some simple multiply indexed data where the indices are not lexographically sorted:

In [46]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.522700
      2      0.397967
c     1      0.989260
      2      0.363968
b     1      0.744447
      2      0.973702
dtype: float64

In [47]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


this is the result of the **`MultiIndex` not being sorted.**

For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted (i.e., lexographical) order. Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index() and sortlevel() methods of the DataFrame. We'll use the simplest, sort_index(), here:

In [48]:
data = data.sort_index()
data

char  int
a     1      0.522700
      2      0.397967
b     1      0.744447
      2      0.973702
c     1      0.989260
      2      0.363968
dtype: float64

With the index sorted in this way, partial slicing will work as expected:

In [49]:
data['a':'b']

char  int
a     1      0.522700
      2      0.397967
b     1      0.744447
      2      0.973702
dtype: float64

## Stacking and unstacking indices
## i.e. turn multi-index into columns, and vice-versa

As we saw briefly before, it is possible to convert a _dataset from a stacked multi-index_ to a _simple two-dimensional representation_ , optionally specifying the level to use:

In [50]:
pop_ri

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [51]:
# the lowerst **level** index is turned into a column
# same as is pop_ri.unstack(level=1)

pop_ri.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [52]:
# the specified level, 0, which is the principal
# is turned into a column

pop_ri.unstack( level=0 )

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [53]:
pop_ri.unstack( level=0 ).stack()

year  state     
2000  California    33871648
      New York      18976457
      Texas         20851820
2010  California    37253956
      New York      19378102
      Texas         25145561
dtype: int64

## Methods of MultiIndex Creation¶

The most straightforward way to construct a multiply indexed Series or DataFrame is to simply **pass a list of two or more index arrays to the constructor**. For example:

In [54]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.690174,0.423048
a,2,0.102281,0.71143
b,1,0.947439,0.285592
b,2,0.701684,0.331114


Similarly, if you pass a **dictionary with appropriate tuples as keys**, Pandas will automatically recognize this and use a `MultiIndex` by default:

In [55]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

## Index setting and resetting

Another way to rearrange hierarchical data is to **turn the index labels into columns**; this can be accomplished with the reset_index method. 

Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:

In [56]:
pop_ri

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [57]:
pop_flat = pop_ri.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


Often when working with data in the real world, the raw input data looks like this and it's **useful to build a MultiIndex from the column values**.

This is what Pedro did in his notebook magic.

This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:

In [58]:
pop_flat.set_index(['state','year', ])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [59]:
pop_flat.set_index(['year', 'state', ])
# why the output is not properly formatted grouping the years ??

#    ##you need to sort the index##

Unnamed: 0_level_0,Unnamed: 1_level_0,population
year,state,Unnamed: 2_level_1
2000,California,33871648
2010,California,37253956
2000,New York,18976457
2010,New York,19378102
2000,Texas,20851820
2010,Texas,25145561


In [60]:
pop_flat.set_index(['year', 'state', ]).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,population
year,state,Unnamed: 2_level_1
2000,California,33871648
2000,New York,18976457
2000,Texas,20851820
2010,California,37253956
2010,New York,19378102
2010,Texas,25145561


## Data Aggregations on Multi-Indices

We've previously seen that Pandas has built-in data aggregation methods, such as `mean(), sum(), and max()`. For hierarchically indexed data, these **can be passed a `level` parameter** that controls 

    which subset of the data the aggregate is computed on.
    
I.e. which variable will be left to undex the results of the aggregation operator.

In [61]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,27.0,36.9,33.0,38.4,28.0,37.6
2013,2,34.0,37.9,35.0,36.8,44.0,37.3
2014,1,32.0,34.8,30.0,38.5,42.0,36.5
2014,2,36.0,35.1,48.0,36.7,46.0,38.9


Perhaps we'd like to **average-out the measurements in the two visits each year**. We can do this by naming the index level we'd like to explore, in this case the year:

In [73]:
health_data.mean( level='year' )

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,30.5,37.4,34.0,37.6,36.0,37.45
2014,34.0,34.95,39.0,37.6,44.0,37.7


In [71]:
health_data.mean( axis=1, level='type' )

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,29.333333,37.633333
2013,2,37.666667,37.333333
2014,1,34.666667,36.6
2014,2,43.333333,36.9


In [72]:
health_data.mean( axis=1, level='type' ).mean( level='year' )

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,33.5,37.483333
2014,39.0,36.75
