# Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that allows you to have multiple index levels on one axis. This gives you the opportunity to work with higher dimensional data in a lower dimensional form.

Let's start with a simple example: Let's create a series of lists as an index:

In [1]:
import pandas as pd
import numpy as np

In [2]:
hits = pd.Series([83080,20336,11376,1228,468],
                 index=[['Jupyter Tutorial',
                         'Jupyter Tutorial',
                         'PyViz Tutorial',
                         'Python Basics',
                         'Python Basics'],
                        ['de', 'en', 'de', 'de', 'en']])

hits

Jupyter Tutorial  de    83080
                  en    20336
PyViz Tutorial    de    11376
Python Basics     de     1228
                  en      468
dtype: int64

What you see is a graphical view of a series with a [pandas.MultiIndex](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html). The *gaps* in the index display mean that the label above it is to be used.

In [3]:
hits.index

MultiIndex([('Jupyter Tutorial', 'de'),
            ('Jupyter Tutorial', 'en'),
            (  'PyViz Tutorial', 'de'),
            (   'Python Basics', 'de'),
            (   'Python Basics', 'en')],
           )

With a hierarchically indexed object, so-called partial indexing is possible, with which you can select subsets of the data:

In [4]:
hits['Jupyter Tutorial']

de    83080
en    20336
dtype: int64

In [5]:
hits['Jupyter Tutorial':'Python Basics']

Jupyter Tutorial  de    83080
                  en    20336
PyViz Tutorial    de    11376
Python Basics     de     1228
                  en      468
dtype: int64

In [6]:
hits.loc[['Jupyter Tutorial', 'Python Basics']]

Jupyter Tutorial  de    83080
                  en    20336
Python Basics     de     1228
                  en      468
dtype: int64

The selection is even possible from an *inner* level. In the following I select all values with the value `1` from the second index level:

In [7]:
hits.loc[:, 'de']

Jupyter Tutorial    83080
PyViz Tutorial      11376
Python Basics        1228
dtype: int64

Hierarchical indexing plays an important role in data reshaping and group-based operations such as forming a [pivot table](https://en.wikipedia.org/wiki/Pivot_table). For example, you can reorder this data into a DataFrame using the [pandas.Series.unstack](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unstack.html) method:

In [8]:
hits.unstack()

Unnamed: 0,de,en
Jupyter Tutorial,83080.0,20336.0
PyViz Tutorial,11376.0,
Python Basics,1228.0,468.0


The reverse operation of unstack is [stack](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html):

In [9]:
hits.unstack().stack()

Jupyter Tutorial  de    83080.0
                  en    20336.0
PyViz Tutorial    de    11376.0
Python Basics     de     1228.0
                  en      468.0
dtype: float64

`stack` and `unstack` is discussed in more detail in the chapter [Reshaping und Pivoting](reshaping-pivoting.ipynb).

In a DataFrame, each axis can have a hierarchical index:

In [10]:
version_hits = [[19651,0,30134,0,33295,0],
                [4722,1825,3497,2576,4009,3707],
                [2573,0,4873,0,3930,0],
                [525,0,427,0,276,0],
                [157,0,85,0,226,0]]

df = pd.DataFrame(version_hits,
                  index=[['Jupyter Tutorial',
                          'Jupyter Tutorial',
                          'PyViz Tutorial',
                          'Python Basics',
                          'Python Basics'],
                         ['de', 'en', 'de', 'de', 'en']],
                  columns=[['12/2021', '12/2021',
                            '01/2022', '01/2022', 
                            '02/2022', '02/2022'],
                           ['latest', 'stable',
                            'latest', 'stable',
                            'latest', 'stable']])

df

Unnamed: 0_level_0,Unnamed: 1_level_0,12/2021,12/2021,01/2022,01/2022,02/2022,02/2022
Unnamed: 0_level_1,Unnamed: 1_level_1,latest,stable,latest,stable,latest,stable
Jupyter Tutorial,de,19651,0,30134,0,33295,0
Jupyter Tutorial,en,4722,1825,3497,2576,4009,3707
PyViz Tutorial,de,2573,0,4873,0,3930,0
Python Basics,de,525,0,427,0,276,0
Python Basics,en,157,0,85,0,226,0


The hierarchy levels can have names (as strings or any Python objects). If this is the case, they are displayed in the console output:

In [11]:
df.index.names = ['Title', 'Language']

In [12]:
df.columns.names = ['Month', 'Version']

In [13]:
df

Unnamed: 0_level_0,Month,12/2021,12/2021,01/2022,01/2022,02/2022,02/2022
Unnamed: 0_level_1,Version,latest,stable,latest,stable,latest,stable
Title,Language,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Jupyter Tutorial,de,19651,0,30134,0,33295,0
Jupyter Tutorial,en,4722,1825,3497,2576,4009,3707
PyViz Tutorial,de,2573,0,4873,0,3930,0
Python Basics,de,525,0,427,0,276,0
Python Basics,en,157,0,85,0,226,0


> **Warning:**
> 
> Make sure that the index names `Month` and `Version` are not part of the row names (of the `df.index` values).

With the partial column indexing you can select column groups in a similar way:

In [14]:
df['12/2021']

Unnamed: 0_level_0,Version,latest,stable
Title,Language,Unnamed: 2_level_1,Unnamed: 3_level_1
Jupyter Tutorial,de,19651,0
Jupyter Tutorial,en,4722,1825
PyViz Tutorial,de,2573,0
Python Basics,de,525,0
Python Basics,en,157,0


With [pandas.MultiIndex.from_arrays](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_arrays.html), a `MultiIndex` can be created itself and then reused; the columns in the preceding DataFrame with level names could be created in this way:

In [15]:
pd.MultiIndex.from_arrays([['Jupyter Tutorial',
                            'Jupyter Tutorial',
                            'PyViz Tutorial',
                            'Python Basics',
                            'Python Basics'],
                           ['de', 'en', 'de', 'de', 'en']],
                          names=['Title', 'Language'])

MultiIndex([('Jupyter Tutorial', 'de'),
            ('Jupyter Tutorial', 'en'),
            (  'PyViz Tutorial', 'de'),
            (   'Python Basics', 'de'),
            (   'Python Basics', 'en')],
           names=['Title', 'Language'])

## Rearranging and Sorting Levels

There may be times when you want to rearrange the order of the levels on an axis or sort the data by the values in a particular level. The function [pandas.DataFrame.swaplevel](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.swaplevel.html) takes two level numbers or names and returns a new object in which the levels are swapped (but the data remains unchanged):

In [16]:
df.swaplevel('Language', 'Title')

Unnamed: 0_level_0,Month,12/2021,12/2021,01/2022,01/2022,02/2022,02/2022
Unnamed: 0_level_1,Version,latest,stable,latest,stable,latest,stable
Language,Title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
de,Jupyter Tutorial,19651,0,30134,0,33295,0
en,Jupyter Tutorial,4722,1825,3497,2576,4009,3707
de,PyViz Tutorial,2573,0,4873,0,3930,0
de,Python Basics,525,0,427,0,276,0
en,Python Basics,157,0,85,0,226,0


[pandas.DataFrame.sort_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html), on the other hand, sorts the data only by the values in a single level. When swapping levels, it is not uncommon to also use `sort_index` so that the result is lexicographically sorted by the specified level:

In [17]:
df.sort_index(level=0)

Unnamed: 0_level_0,Month,12/2021,12/2021,01/2022,01/2022,02/2022,02/2022
Unnamed: 0_level_1,Version,latest,stable,latest,stable,latest,stable
Title,Language,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Jupyter Tutorial,de,19651,0,30134,0,33295,0
Jupyter Tutorial,en,4722,1825,3497,2576,4009,3707
PyViz Tutorial,de,2573,0,4873,0,3930,0
Python Basics,de,525,0,427,0,276,0
Python Basics,en,157,0,85,0,226,0


In [18]:
df.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,Month,12/2021,12/2021,01/2022,01/2022,02/2022,02/2022
Unnamed: 0_level_1,Version,latest,stable,latest,stable,latest,stable
Language,Title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
de,Jupyter Tutorial,19651,0,30134,0,33295,0
de,PyViz Tutorial,2573,0,4873,0,3930,0
de,Python Basics,525,0,427,0,276,0
en,Jupyter Tutorial,4722,1825,3497,2576,4009,3707
en,Python Basics,157,0,85,0,226,0


> **Note:**
> 
> Data selection performance is much better for hierarchically indexed objects if the index is sorted lexicographically, starting with the outermost level, i.e. the result of calling `sort_index(level=0)` or `sort_index()`.

## Summary statistics by level

Many descriptive and summary statistics for `DataFrame` and `Series`  have a level option that allows you to specify the level by which you can aggregate on a particular axis. Consider the `DataFrame` above; we can aggregate either the rows or the columns by level as follows:

In [19]:
df.groupby(level='Language').sum()

Month,12/2021,12/2021,01/2022,01/2022,02/2022,02/2022
Version,latest,stable,latest,stable,latest,stable
Language,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
de,22749,0,35434,0,37501,0
en,4879,1825,3582,2576,4235,3707


In [20]:
df.groupby(level='Month', axis=1).sum()

Unnamed: 0_level_0,Month,01/2022,02/2022,12/2021
Title,Language,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jupyter Tutorial,de,30134,33295,19651
Jupyter Tutorial,en,6073,7716,6547
PyViz Tutorial,de,4873,3930,2573
Python Basics,de,427,276,525
Python Basics,en,85,226,157


Internally, Pandas’ [pandas.DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) machinery is used for this purpose, which is explained in more detail in [Gruppenoperationen](group-operations.ipynb).

## Indexing with the columns of a DataFrame

It is not uncommon to use one or more columns of a DataFrame as a row index; alternatively, you can move the row index into the columns of the DataFrame. Here is an example DataFrame:

In [21]:
data = [['Jupyter Tutorial', 'de', 19651,0,30134,0,33295,0],
        ['Jupyter Tutorial', 'en', 4722,1825,3497,2576,4009,3707],
        ['PyViz Tutorial', 'de', 2573,0,4873,0,3930,0],
        ['Python Basics', 'de', 525,0,427,0,276,0],
        ['Python Basics', 'en', 157,0,85,0,226,0]]
    
df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2,3,4,5,6,7
0,Jupyter Tutorial,de,19651,0,30134,0,33295,0
1,Jupyter Tutorial,en,4722,1825,3497,2576,4009,3707
2,PyViz Tutorial,de,2573,0,4873,0,3930,0
3,Python Basics,de,525,0,427,0,276,0
4,Python Basics,en,157,0,85,0,226,0


The function [pandas.DataFrame.set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) creates a new DataFrame that uses one or more of its columns as an index:

In [22]:
df2 = df.set_index([0,1])

df2

Unnamed: 0_level_0,Unnamed: 1_level_0,2,3,4,5,6,7
0,1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jupyter Tutorial,de,19651,0,30134,0,33295,0
Jupyter Tutorial,en,4722,1825,3497,2576,4009,3707
PyViz Tutorial,de,2573,0,4873,0,3930,0
Python Basics,de,525,0,427,0,276,0
Python Basics,en,157,0,85,0,226,0


By default, the columns are removed from the DataFrame, but you can also leave them in by passing `drop=False` to `set_index`:

In [23]:
df.set_index([0,1], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7
0,1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Jupyter Tutorial,de,Jupyter Tutorial,de,19651,0,30134,0,33295,0
Jupyter Tutorial,en,Jupyter Tutorial,en,4722,1825,3497,2576,4009,3707
PyViz Tutorial,de,PyViz Tutorial,de,2573,0,4873,0,3930,0
Python Basics,de,Python Basics,de,525,0,427,0,276,0
Python Basics,en,Python Basics,en,157,0,85,0,226,0


[pandas.DataFrame.reset_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html), on the other hand, does the opposite of `set_index`; the hierarchical index levels are moved into the columns:

In [24]:
df2.reset_index()

Unnamed: 0,0,1,2,3,4,5,6,7
0,Jupyter Tutorial,de,19651,0,30134,0,33295,0
1,Jupyter Tutorial,en,4722,1825,3497,2576,4009,3707
2,PyViz Tutorial,de,2573,0,4873,0,3930,0
3,Python Basics,de,525,0,427,0,276,0
4,Python Basics,en,157,0,85,0,226,0
