# [Multi-Indexing in Pandas](#)

Multi-indexing, also known as hierarchical indexing, is a powerful feature in Pandas that allows you to work with higher-dimensional data in a lower-dimensional form. It enables you to represent and manipulate complex datasets with multiple levels of indexing on rows or columns.


Multi-indexing extends the concept of a single-level index to multiple levels. Instead of having a single label for each row or column, you can have multiple labels, creating a hierarchy of indices. This hierarchical structure allows you to represent and analyze data with more complex relationships.


For example, consider a DataFrame with sales data:


In [1]:
import pandas as pd
import numpy as np

# Create a sample multi-index DataFrame
index = pd.MultiIndex.from_product(
    [['A', 'B', 'C'], ['Q1', 'Q2', 'Q3', 'Q4']],
    names=['Store', 'Quarter']
)
data = np.random.randn(12, 2)
df = pd.DataFrame(data, index=index, columns=['Sales', 'Profit'])

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Profit
Store,Quarter,Unnamed: 2_level_1,Unnamed: 3_level_1
A,Q1,1.778306,1.666448
A,Q2,-0.675039,-0.695888
A,Q3,0.315251,-0.661563
A,Q4,1.17789,0.684904
B,Q1,-0.361242,-1.252062
B,Q2,1.096009,0.141477
B,Q3,0.33724,0.568699
B,Q4,-0.902728,2.100881
C,Q1,-0.945163,-0.426175
C,Q2,0.769792,0.107989


In this example, we have a two-level index: the first level represents stores (A, B, C), and the second level represents quarters (Q1, Q2, Q3, Q4).


Multi-indexing offers several advantages:

1. **Data Organization**: It allows you to represent complex, hierarchical data structures in a tabular format, making it easier to organize and understand your data.

2. **Efficient Data Selection**: You can select and manipulate data based on multiple levels of indexing, enabling more precise and flexible data access.

3. **Advanced Analysis**: Multi-indexing facilitates advanced grouping, aggregation, and pivoting operations, which are particularly useful for analyzing hierarchical or panel data.

4. **Dimensionality Reduction**: It helps in representing higher-dimensional data in a lower-dimensional form, which can be more manageable and easier to visualize.

5. **Memory Efficiency**: For large datasets with repetitive index values, multi-indexing can be more memory-efficient compared to storing the same information in separate columns.


Here's a simple example demonstrating the power of multi-indexing:


In [3]:
# Selecting data for Store A
df.loc['A']

Unnamed: 0_level_0,Sales,Profit
Quarter,Unnamed: 1_level_1,Unnamed: 2_level_1
Q1,1.778306,1.666448
Q2,-0.675039,-0.695888
Q3,0.315251,-0.661563
Q4,1.17789,0.684904


In [4]:
# Selecting data for Q1 across all stores
df.xs('Q1', level='Quarter')

Unnamed: 0_level_0,Sales,Profit
Store,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1.778306,1.666448
B,-0.361242,-1.252062
C,-0.945163,-0.426175


In [5]:
# Calculating mean sales for each store
df['Sales'].groupby(level='Store').mean()

Store
A    0.649102
B    0.042320
C   -0.461509
Name: Sales, dtype: float64

These operations become intuitive and straightforward with multi-indexing, allowing for complex data manipulations with simple, readable code.


As we progress through this lecture, we'll explore how to create, manipulate, and leverage multi-index DataFrames to enhance your data analysis capabilities. Multi-indexing is a key feature that sets Pandas apart in handling complex, real-world datasets, and mastering it will significantly boost your data manipulation and analysis skills.

## <a id='toc1_'></a>[Creating Multi-Index DataFrames](#toc0_)

There are several ways to create multi-index DataFrames in Pandas. We'll explore three main methods: creating from scratch, from existing DataFrames, and using specialized Pandas functions.

### <a id='toc1_1_'></a>[From Scratch](#toc0_)


You can create a multi-index DataFrame directly by specifying a list of tuples for the index:


In [6]:
# Create a multi-index DataFrame from scratch
tuples = [('A', 2020), ('A', 2021), ('B', 2020), ('B', 2021)]
index = pd.MultiIndex.from_tuples(tuples, names=['Company', 'Year'])
df = pd.DataFrame({'Revenue': [100, 120, 90, 110], 'Profit': [20, 25, 15, 22]}, index=index)

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,100,20
A,2021,120,25
B,2020,90,15
B,2021,110,22


### <a id='toc1_2_'></a>[From Existing DataFrames](#toc0_)


You can convert an existing DataFrame to a multi-index DataFrame using the `set_index()` method:


In [7]:
# Create a regular DataFrame
df = pd.DataFrame({
    'Company': ['A', 'A', 'B', 'B'],
    'Year': [2020, 2021, 2020, 2021],
    'Revenue': [100, 120, 90, 110],
    'Profit': [20, 25, 15, 22]
})

# Convert to multi-index DataFrame
df_multi = df.set_index(['Company', 'Year'])

df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,100,20
A,2021,120,25
B,2020,90,15
B,2021,110,22


### <a id='toc1_3_'></a>[Using pd.MultiIndex.from_tuples and pd.MultiIndex.from_product](#toc0_)


Pandas provides specialized functions to create multi-index objects:


**Using pd.MultiIndex.from_tuples:** This method creates a multi-index from a list of tuples, where each tuple represents an index value for each level.


In [8]:
# Create multi-index from tuples
tuples = [('A', 2020), ('A', 2021), ('B', 2020), ('B', 2021)]
index = pd.MultiIndex.from_tuples(tuples, names=['Company', 'Year'])
df = pd.DataFrame({'Revenue': [100, 120, 90, 110], 'Profit': [20, 25, 15, 22]}, index=index)

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,100,20
A,2021,120,25
B,2020,90,15
B,2021,110,22


**Using pd.MultiIndex.from_product:** This method creates a multi-index from the Cartesian product of multiple lists.


This is particularly useful when you want to create all combinations of multiple lists:


In [9]:
# Create multi-index from product
companies = ['A', 'B']
years = [2020, 2021, 2022]
index = pd.MultiIndex.from_product([companies, years], names=['Company', 'Year'])
df = pd.DataFrame(np.random.randn(6, 2), index=index, columns=['Revenue', 'Profit'])

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.147767,1.482246
A,2021,0.117703,-0.53013
A,2022,0.233433,-0.447835
B,2020,-0.214322,-1.225852
B,2021,1.439901,0.30656
B,2022,0.321553,-0.112776


### <a id='toc1_4_'></a>[Additional Methods](#toc0_)


You can also create multi-index DataFrames using other methods:


**Using a dictionary of tuples:** You can create a dictionary where the keys are tuples representing the multi-index and the values are dictionaries of column values.


In [10]:
data = {('A', 2020): {'Revenue': 100, 'Profit': 20},
        ('A', 2021): {'Revenue': 120, 'Profit': 25},
        ('B', 2020): {'Revenue': 90, 'Profit': 15},
        ('B', 2021): {'Revenue': 110, 'Profit': 22}}

df = pd.DataFrame(data).T
df.index.names = ['Company', 'Year']

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,100,20
A,2021,120,25
B,2020,90,15
B,2021,110,22


**Using pd.MultiIndex.from_arrays:** This method creates a multi-index from a list of arrays, where each array represents the index values for a level.


In [11]:
companies = ['A', 'A', 'B', 'B']
years = [2020, 2021, 2020, 2021]
index = pd.MultiIndex.from_arrays([companies, years], names=['Company', 'Year'])
df = pd.DataFrame({'Revenue': [100, 120, 90, 110], 'Profit': [20, 25, 15, 22]}, index=index)

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,100,20
A,2021,120,25
B,2020,90,15
B,2021,110,22


Each of these methods has its advantages depending on your data source and structure. The `from_product` method is particularly useful when you want to create a Cartesian product of index values, while `from_tuples` and `from_arrays` offer more control over the specific combinations of index values.


Remember that you can also create multi-index columns in a similar way, allowing for even more complex data representations:


In [12]:
# Creating multi-index columns
columns = pd.MultiIndex.from_product([['Financial', 'Operational'], ['Value', 'Change']])
index = pd.MultiIndex.from_product([['A', 'B'], [2020, 2021]])

df = pd.DataFrame(np.random.randn(4, 4), index=index, columns=columns)

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Financial,Financial,Operational,Operational
Unnamed: 0_level_1,Unnamed: 1_level_1,Value,Change,Value,Change
A,2020,1.104042,1.07841,-1.066391,-0.99491
A,2021,-1.295641,0.219914,0.75871,0.651273
B,2020,-0.570631,1.336141,-2.060688,1.000543
B,2021,-0.766599,0.958292,2.468706,0.109264


By mastering these methods of creating multi-index DataFrames, you'll be well-equipped to handle complex, hierarchical data structures in your Pandas workflows.

## <a id='toc2_'></a>[Accessing and Manipulating Multi-Index Data](#toc0_)

Multi-index DataFrames require specific techniques for efficient data access and manipulation. This section covers the fundamental methods for working with multi-indexed data in Pandas.


### <a id='toc2_1_'></a>[Basic Indexing and Slicing](#toc0_)


Accessing data in a multi-index DataFrame can be done using tuple-based indexing:


In [14]:
# Create a sample multi-index DataFrame
index = pd.MultiIndex.from_product([['A', 'B'], [2020, 2021]], names=['Company', 'Year'])
df = pd.DataFrame(np.random.randn(4, 2), index=index, columns=['Revenue', 'Profit'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.163309,-0.418731
A,2021,0.133003,-0.187639
B,2020,0.207918,0.604329
B,2021,-0.286341,-0.664919


In [15]:
# Basic indexing
df.loc['A']

Unnamed: 0_level_0,Revenue,Profit
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2020,0.163309,-0.418731
2021,0.133003,-0.187639


In [16]:
df.loc[('A', 2020)]

Revenue    0.163309
Profit    -0.418731
Name: (A, 2020), dtype: float64

In [17]:
# Slicing
df.loc['A':'B']

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.163309,-0.418731
A,2021,0.133003,-0.187639
B,2020,0.207918,0.604329
B,2021,-0.286341,-0.664919


In [18]:
df.loc[('A', 2020):('B', 2020)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.163309,-0.418731
A,2021,0.133003,-0.187639
B,2020,0.207918,0.604329


### <a id='toc2_2_'></a>[Using .loc and .iloc](#toc0_)


The `.loc` and `.iloc` accessors provide powerful ways to select data:


In [19]:
# Using .loc
df.loc['A', 'Revenue']

Year
2020    0.163309
2021    0.133003
Name: Revenue, dtype: float64

In [20]:
df.loc[('A', 2020), 'Revenue']

0.16330947092709558

In [21]:
# Using .iloc (integer-based indexing)
df.iloc[0]

Revenue    0.163309
Profit    -0.418731
Name: (A, 2020), dtype: float64

In [22]:
df.iloc[0, 0]

0.16330947092709558

In [23]:
# Selecting multiple levels
df.loc[:, 'Revenue']

Company  Year
A        2020    0.163309
         2021    0.133003
B        2020    0.207918
         2021   -0.286341
Name: Revenue, dtype: float64

In [24]:
df.loc[(slice(None), 2020), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.163309,-0.418731
B,2020,0.207918,0.604329


### <a id='toc2_3_'></a>[Cross-section Selection with .xs](#toc0_)


The `.xs` method allows for convenient cross-sectional selection:


In [25]:
# Select all data for 2020
df.xs(2020, level='Year')

Unnamed: 0_level_0,Revenue,Profit
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.163309,-0.418731
B,0.207918,0.604329


In [26]:
# Select all data for company A
df.xs('A', level='Company')

Unnamed: 0_level_0,Revenue,Profit
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2020,0.163309,-0.418731
2021,0.133003,-0.187639


In [27]:
# Using .xs with drop_level
df.xs('A', level='Company', drop_level=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.163309,-0.418731
A,2021,0.133003,-0.187639


### <a id='toc2_4_'></a>[Advanced Selection Techniques](#toc0_)


Pandas provides additional methods for complex selections:


In [28]:
# Using .query() for boolean indexing
df.query("Company == 'A' and Year == 2020")

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.163309,-0.418731


In [29]:
# Using .filter() to select based on index values
df.filter(like='A', axis=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2020,0.163309,-0.418731
A,2021,0.133003,-0.187639


In [30]:
# Using .get_level_values() to access index level values
df.index.get_level_values('Company')

Index(['A', 'A', 'B', 'B'], dtype='object', name='Company')

### <a id='toc2_5_'></a>[Modifying Data](#toc0_)


Modifying data in multi-index DataFrames follows similar patterns to regular DataFrames:


In [31]:
# Modifying a single value
df.loc[('A', 2020), 'Revenue'] = 1.5

In [32]:
# Modifying a slice of data
df.loc['A'] = 0

In [33]:
# Adding a new column
df['New_Column'] = np.random.randn(4)

In [35]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit,New_Column,2022
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,2020.0,0.0,0.0,0.414898,
A,2021.0,0.0,0.0,-0.703652,
B,2020.0,0.207918,0.604329,-0.650376,
B,2021.0,-0.286341,-0.664919,-1.007387,
C,,,,,


### <a id='toc2_6_'></a>[Reindexing](#toc0_)


Reindexing can be particularly useful for multi-index DataFrames:


In [42]:
# Reindex at a specific level
new_companies = ['A', 'B', 'C']
df_reindexed = df.reindex(new_companies, level='Company')
df_reindexed

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profit,New_Column,2022,2020
Company,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,2020.0,0.0,0.0,0.414898,,
A,2021.0,0.0,0.0,-0.703652,,
B,2020.0,0.207918,0.604329,-0.650376,,
B,2021.0,-0.286341,-0.664919,-1.007387,,
C,,,,,,


As you work with more complex datasets, mastering these techniques will allow you to efficiently navigate and modify your data, unlocking the full potential of Pandas' multi-indexing capabilities.

## <a id='toc3_'></a>[Conclusion and Further Resources](#toc0_)

Multi-indexing in Pandas is a powerful feature that allows for efficient representation and manipulation of complex, hierarchical data structures. Throughout this lecture, we've covered the fundamentals of creating, accessing, and manipulating multi-index DataFrames, as well as advanced operations and performance considerations.


Key Takeaways:
1. Multi-indexing enables you to work with higher-dimensional data in a lower-dimensional form.
2. It provides flexibility in data organization, selection, and analysis.
3. Creating multi-index DataFrames can be done in various ways, including from scratch, from existing DataFrames, or using specialized Pandas functions.
4. Proper use of multi-indexing can lead to more intuitive data manipulation and improved performance, especially for large datasets.
5. Understanding the performance implications of multi-indexing is crucial for optimizing your data analysis workflows.


Best Practices:
- Always name your index levels for better readability and easier data manipulation.
- Sort your multi-index when possible to improve performance of data selection operations.
- Use index levels effectively in your operations instead of treating them as regular columns.
- Be mindful of operations that cause reindexing, as they can be computationally expensive.
- Leverage Pandas functions designed for multi-index operations (e.g., `xs()`, `swaplevel()`) for efficient data manipulation.


To deepen your understanding of multi-indexing in Pandas, consider exploring **Official Pandas Documentation**:
- [Pandas User Guide: MultiIndex / Advanced Indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)
- [Pandas API Reference: MultiIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html)


Remember, the best way to master multi-indexing is through practice. Try to incorporate these concepts into your data analysis projects, experiment with different approaches, and don't hesitate to refer back to the documentation and these resources as needed.


By mastering multi-indexing, you'll be able to handle complex data structures more efficiently and effectively, opening up new possibilities in your data analysis and manipulation tasks with Pandas.