# Pandas MultiIndex Tutorial

In [1]:
import pandas as pd

# What is a MultiIndex DataFrame?

Pandas' MultiIndex \[DataFrame\] enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame).

While the displayed version of a MultiIndex df doesn't appear to be much more than a prettily-organized regular df, it's actually a pretty powerful structure if the data warrants its use.

# When should you use one?

1. When a single column’s value isn’t enough to uniquely identify a row (e.g. multiple records on the same date means date alone isn’t a good index).
2. When data is logically hierarchical - meaning that is has multiple dimensions or “levels.”

Besides structure, multiindexes offer us two benefits:
- Relatively easy retreival of complex data retreival.
- Improved efficiency if lookups and merges will be frequent..? (NEED TO EXPLORE THIS)

# First, some quick groundwork

- 2-minute anatomy of a dataframe
- What’s an index in pandas?
  - The index of a DataFrame is a set that consists of a label for each row. To be helpful, those labels should be meaningful and unique.
- Example:
  - Start w/ range index - unique, but not super useful
  - Date
  - But what about data with multiple transactions per date?

# Realistic Demo Data

xxx Description of the data xxx

In [2]:
data = [["2018-07-10", "Store 1", "Beer", "Ales", "0736920111112", "Goose Island - Honkers Ale - 6 Pack", 90, 9],
        ["2018-07-11", "Store 1", "Beer", "Lagers", "0736920222222", "Brand2 - RandomName1 - 6 Pack", 47, 6],
        ["2018-07-11", "Store 2", "Beer", "Stouts", "0736920333333", "Brand2 - RandomName2 - 6 Pack", 47, 6],
        ["2018-07-12", "Store 1", "Beer", "Ales", "0736920111112", "Goose Island - Honkers Ale - 6 Pack", 104, 9],
        ["2018-07-12", "Store 3", "Beer", "Malts", "0736920555555", "Goose Island - Honkers Ale - 6 Pack", 90, 9],
        ["2018-07-10", "Store 3", "Wine", "Red", "0736920666666", "Goose Island - Honkers Ale - 6 Pack", 90, 9],
        ["2018-07-13", "Store 2", "Wine", "White", "0736920777777", "Goose Island - Honkers Ale - 6 Pack", 90, 9],
        ["2018-07-13", "Store 3", "Wine", "Rose", "0736920999999", "Goose Island - Honkers Ale - 6 Pack", 90, 9],
        ["2018-07-12", "Store 1", "Alcohol","Liqour", "9736920111635", "Goose Island - Honkers Ale - 6 Pack", 90, 9],
        ["2018-07-12", "Store 2", "Alcohol","Liquor", "9736920111897", "Goose Island - Honkers Ale - 6 Pack", 90, 9],
        ["2018-07-12", "Store 3", "Alcohol","Liquor", "9736920111343", "Goose Island - Honkers Ale - 6 Pack", 90, 9]]

df = pd.DataFrame(data, columns=["Date","Store","Category","Subcategory", "UPC EAN",
                                 "Description", "Dollars", "Units"])
df

Unnamed: 0,Date,Store,Category,Subcategory,UPC EAN,Description,Dollars,Units
0,2018-07-10,Store 1,Beer,Ales,736920111112,Goose Island - Honkers Ale - 6 Pack,90,9
1,2018-07-11,Store 1,Beer,Lagers,736920222222,Brand2 - RandomName1 - 6 Pack,47,6
2,2018-07-11,Store 2,Beer,Stouts,736920333333,Brand2 - RandomName2 - 6 Pack,47,6
3,2018-07-12,Store 1,Beer,Ales,736920111112,Goose Island - Honkers Ale - 6 Pack,104,9
4,2018-07-12,Store 3,Beer,Malts,736920555555,Goose Island - Honkers Ale - 6 Pack,90,9
5,2018-07-10,Store 3,Wine,Red,736920666666,Goose Island - Honkers Ale - 6 Pack,90,9
6,2018-07-13,Store 2,Wine,White,736920777777,Goose Island - Honkers Ale - 6 Pack,90,9
7,2018-07-13,Store 3,Wine,Rose,736920999999,Goose Island - Honkers Ale - 6 Pack,90,9
8,2018-07-12,Store 1,Alcohol,Liqour,9736920111635,Goose Island - Honkers Ale - 6 Pack,90,9
9,2018-07-12,Store 2,Alcohol,Liquor,9736920111897,Goose Island - Honkers Ale - 6 Pack,90,9


# Setting and Manipulating Indexes



xxx reference the data. Explain the format we want. xxxxxxx Let's take a look at how we can create our multiindex from our regular ol' DataFrame. We'll walk through the basics of setting, reordering, and resetting indexes, along with some useful tips/tricks.

In [3]:
# Set just like the index for a DataFrame...
# ...except we give a list of column names instead of a single string column name
df.set_index(['Date', 'Store', 'Category', 'UPC EAN'], inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Subcategory,Description,Dollars,Units
Date,Store,Category,UPC EAN,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-07-10,Store 1,Beer,736920111112,Ales,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-11,Store 1,Beer,736920222222,Lagers,Brand2 - RandomName1 - 6 Pack,47,6
2018-07-11,Store 2,Beer,736920333333,Stouts,Brand2 - RandomName2 - 6 Pack,47,6
2018-07-12,Store 1,Beer,736920111112,Ales,Goose Island - Honkers Ale - 6 Pack,104,9
2018-07-12,Store 3,Beer,736920555555,Malts,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-10,Store 3,Wine,736920666666,Red,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-13,Store 2,Wine,736920777777,White,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-13,Store 3,Wine,736920999999,Rose,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-12,Store 1,Alcohol,9736920111635,Liqour,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-12,Store 2,Alcohol,9736920111897,Liquor,Goose Island - Honkers Ale - 6 Pack,90,9


Uh oh - it looks like we forgot to add the 'Subcategory' column to our index, but don't worry - pandas has us covered with extra set_index parameters for MultiIndexes:

In [4]:
# We can append a column to our existing index
df.set_index('Subcategory', append=True, inplace=True)
df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Description,Dollars,Units
Date,Store,Category,UPC EAN,Subcategory,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-07-10,Store 1,Beer,736920111112,Ales,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-11,Store 1,Beer,736920222222,Lagers,Brand2 - RandomName1 - 6 Pack,47,6
2018-07-11,Store 2,Beer,736920333333,Stouts,Brand2 - RandomName2 - 6 Pack,47,6


That's almost right, but we'd actually like 'Subcategory' to show up after 'Category'. We have a couple of options to get things in the right order:

In [5]:
# Option 1 is the generalized solution to reorder the index levels
# Note: We're not making an inplace change in this cell,
#       but it's worth noting that this method doesn't have an inplace parameter.
df.reorder_levels(order=['Date', 'Store', 'Category', 'Subcategory', 'UPC EAN']).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Description,Dollars,Units
Date,Store,Category,Subcategory,UPC EAN,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-07-10,Store 1,Beer,Ales,736920111112,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-11,Store 1,Beer,Lagers,736920222222,Brand2 - RandomName1 - 6 Pack,47,6
2018-07-11,Store 2,Beer,Stouts,736920333333,Brand2 - RandomName2 - 6 Pack,47,6


reorder_levels() is useful, but it was a pain to have to type all five levels just two switch two. In cases like this we have a second, less verbose option:

In [6]:
# Option 2 just switches two index levels (a more common need than you'd think)
# Note: This time we're doing an inplace change, but there's no parameter for this method either.
df = df.swaplevel('UPC EAN', 'Subcategory')
df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Description,Dollars,Units
Date,Store,Category,Subcategory,UPC EAN,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-07-10,Store 1,Beer,Ales,736920111112,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-11,Store 1,Beer,Lagers,736920222222,Brand2 - RandomName1 - 6 Pack,47,6
2018-07-11,Store 2,Beer,Stouts,736920333333,Brand2 - RandomName2 - 6 Pack,47,6


Just when we thought we were done, it turns our we forgot to add the highest level of the product hierarchy - the Department - not just to our index, but to our DataFrame altogether. Luckily all of our records belong in the same Department, so here's a neat trick to add a new column with all the same values as a level in an existing index:

In [7]:
# A handy function to keep around for projects
def add_constant_index_level(df, value, level_name):
    return pd.concat([df], keys=[value], names=[level_name])

df = add_constant_index_level(df, "Booooze", "Department")
df = df.reorder_levels(order=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN'])
df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Description,Dollars,Units
Date,Store,Department,Category,Subcategory,UPC EAN,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-07-10,Store 1,Booooze,Beer,Ales,736920111112,Goose Island - Honkers Ale - 6 Pack,90,9
2018-07-11,Store 1,Booooze,Beer,Lagers,736920222222,Brand2 - RandomName1 - 6 Pack,47,6
2018-07-11,Store 2,Booooze,Beer,Stouts,736920333333,Brand2 - RandomName2 - 6 Pack,47,6
