#### Part 17: MultiIndex and Advanced Indexing in Pandas

In this notebook, we'll explore:
- Set operations on Index objects
- Handling missing values in Index
- Working with MultiIndex
- Renaming levels and names
- Sorting a MultiIndex

##### Setup
First, let's import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import random

##### 1. Set Operations on Index Objects

The two main operations are union (`|`) and intersection (`&`). These can be directly called as instance methods or used via overloaded operators. Difference is provided via the `.difference()` method.

In [3]:
a = pd.Index(['c', 'b', 'a'])
b = pd.Index(['c', 'e', 'd'])

# Union
a.union(b) # OR pd.Index(np.array(a) | np.array(b))

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [6]:
# Intersection
a.intersection(b)

Index(['c'], dtype='object')

In [7]:
# Difference
a.difference(b)

Index(['a', 'b'], dtype='object')

Also available is the symmetric_difference (`^`) operation, which returns elements that appear in either idx1 or idx2, but not in both.

In [8]:
idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([2, 3, 4, 5])

# Symmetric difference using method
idx1.symmetric_difference(idx2)

Index([1, 5], dtype='int64')

In [9]:
# Symmetric difference using operator
idx1 ^ idx2

Index([3, 1, 7, 1], dtype='int64')

When performing `Index.union()` between indexes with different dtypes, the indexes must be cast to a common dtype. Typically, though not always, this is object dtype. The exception is when performing a union between integer and float data.

In [11]:
idx1 = pd.Index([0, 1, 2])
idx2 = pd.Index([0.5, 1.5])

# Union of integer and float indices
idx1.union(idx2)

Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')

##### 2. Missing Values in Index

Even though Index can hold missing values (NaN), it should be avoided if you do not want any unexpected results. For example, some operations exclude missing values implicitly.

`Index.fillna` fills missing values with specified scalar value.

In [12]:
idx1 = pd.Index([1, np.nan, 3, 4])
idx1

Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [13]:
# Fill NaN values with 2
idx1.fillna(2)

Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [14]:
# DatetimeIndex with NaT
idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
                         pd.NaT,
                         pd.Timestamp('2011-01-03')])
idx2

DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [15]:
# Fill NaT values with a timestamp
idx2.fillna(pd.Timestamp('2011-01-02'))

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

##### 3. Working with MultiIndex

A MultiIndex represents an ordered, tree-like structure of Python objects that provides multiple paths through the index to the same position in the data. It enables storing and manipulating data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

In [16]:
# Create a MultiIndex from product
index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
index

MultiIndex([(0, 'one'),
            (0, 'two'),
            (1, 'one'),
            (1, 'two'),
            (2, 'one'),
            (2, 'two')],
           names=['first', 'second'])

In [17]:
# Access levels of a MultiIndex
index.levels[1]

Index(['one', 'two'], dtype='object', name='second')

In [18]:
# Set levels of a MultiIndex
index.set_levels(["a", "b"], level=1)

MultiIndex([(0, 'a'),
            (0, 'b'),
            (1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['first', 'second'])

##### 4. Reordering Levels with reorder_levels

The `reorder_levels()` method generalizes the `swaplevel` method, allowing you to permute the hierarchical index levels in one step.

In [19]:
# Create a DataFrame with MultiIndex
arrays = [['one', 'one', 'zero', 'zero'], ['y', 'x', 'y', 'x']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.randn(4, 2), index=index)
df

Unnamed: 0,Unnamed: 1,0,1
one,y,0.007759,-0.974291
one,x,1.098466,-1.013737
zero,y,0.379129,-0.983222
zero,x,0.852342,1.3855


In [20]:
# Reorder levels
df.reorder_levels([1, 0], axis=0)

Unnamed: 0,Unnamed: 1,0,1
y,one,0.007759,-0.974291
x,one,1.098466,-1.013737
y,zero,0.379129,-0.983222
x,zero,0.852342,1.3855


##### 5. Renaming Names of an Index or MultiIndex

The `rename()` method is used to rename the labels of a MultiIndex, and is typically used to rename the columns of a DataFrame.

In [21]:
# Rename columns
df.rename(columns={0: "col0", 1: "col1"})

Unnamed: 0,Unnamed: 1,col0,col1
one,y,0.007759,-0.974291
one,x,1.098466,-1.013737
zero,y,0.379129,-0.983222
zero,x,0.852342,1.3855


In [22]:
# Rename specific labels of the main index
df.rename(index={"one": "two", "y": "z"})

Unnamed: 0,Unnamed: 1,0,1
two,z,0.007759,-0.974291
two,x,1.098466,-1.013737
zero,z,0.379129,-0.983222
zero,x,0.852342,1.3855


The `rename_axis()` method is used to rename the name of a Index or MultiIndex. In particular, the names of the levels of a MultiIndex can be specified.

In [23]:
# Rename axis names
df.rename_axis(index=['abc', 'def'])

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
abc,def,Unnamed: 2_level_1,Unnamed: 3_level_1
one,y,0.007759,-0.974291
one,x,1.098466,-1.013737
zero,y,0.379129,-0.983222
zero,x,0.852342,1.3855


In [24]:
# Rename column index name
df.rename_axis(columns="Cols").columns

RangeIndex(start=0, stop=2, step=1, name='Cols')

When working with an Index object directly, rather than via a DataFrame, `Index.set_names()` can be used to change the names.

In [25]:
mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])
mi.names

FrozenList(['x', 'y'])

In [26]:
# Rename a specific level
mi2 = mi.rename("new name", level=0)
mi2

MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

You cannot set the names of the MultiIndex via a level directly. This will raise a RuntimeError:

In [27]:
# This will raise an error
# mi.levels[0].name = "name via level"

##### 6. Sorting a MultiIndex

For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use `sort_index()`.

In [28]:
# Create a list of tuples for MultiIndex
tuples = [('foo', 'one'), ('foo', 'two'), ('bar', 'one'), ('bar', 'two'), ('qux', 'one'), ('qux', 'two')]

# Shuffle the tuples
random.shuffle(tuples)

# Create a Series with MultiIndex
s = pd.Series(np.random.randn(6), index=pd.MultiIndex.from_tuples(tuples))
s

qux  two    1.360318
foo  two   -0.038847
     one   -1.012477
bar  two   -2.048481
qux  one    0.730574
bar  one    0.227564
dtype: float64

In [29]:
# Sort the index
s = s.sort_index()
s

bar  one    0.227564
     two   -2.048481
foo  one   -1.012477
     two   -0.038847
qux  one    0.730574
     two    1.360318
dtype: float64