In [2]:
# Ignore, this code will be explained later
import os
import sys

module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path: sys.path.append(module_path)

from src import utils

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import random

In [None]:
%reload_ext post_content
%post_content register YOUR_USER_NAME

# Indexes - hidden hero of Pandas

![](images/dataframes.jpg)

Indexes are often an after-thought for Pandas programmers. Casual users generally don't know much about them, yet they are vitally important to understand.

## What is an index

Recall from the Series lecture that an index is a way to name values. Dictionary values can be looked up via keys and list elements can be looked up via index location.

SQL programmers can think of indexeds as primary keys on a table.

When creating a series or a dataframe, if an explicit index is not provided, one is created automatically:

In [4]:
simps = pd.Series(['Homer', 'Marge', 'Lisa'])
simps

0    Homer
1    Marge
2     Lisa
dtype: object

In [5]:
simps.values

array(['Homer', 'Marge', 'Lisa'], dtype=object)

In [6]:
simps.index

RangeIndex(start=0, stop=3, step=1)

An explicit index an be provided:

In [8]:
simps_i = pd.Series(['Homer', 'Marge', 'Lisa'], index=['Dad', 'Mom', 'Daughter'])
simps_i

Dad         Homer
Mom         Marge
Daughter     Lisa
dtype: object

In [9]:
simps_i.index

Index(['Dad', 'Mom', 'Daughter'], dtype='object')

## Indexes or indices?

I don't know

## Combining series and dataframes based on indexes

Operations on two series are done by matching indexes:

In [13]:
hml = pd.Series([38, 36, 10], index=['Homer', 'Marge', 'Lisa'])
hml

Homer    38
Marge    36
Lisa     10
dtype: int64

In [14]:
hmm = pd.Series([38, 36, 2], index=['Homer', 'Marge', 'Maggie'])
hmm

Homer     38
Marge     36
Maggie     2
dtype: int64

In [15]:
hml + hmm

Homer     76.0
Lisa       NaN
Maggie     NaN
Marge     72.0
dtype: float64

Notice that since Lisa and Maggie are not in both series, pandas gives us NaN values.

In [17]:
hml.add(hmm, fill_value=0)

Homer     76.0
Lisa      10.0
Maggie     2.0
Marge     72.0
dtype: float64

We can control default values, in case an index element appears in one series but not another.

## How are indexes different from series..or numpy arrays or lists?

In [21]:
idx = pd.Index(['Homer', 'Marge', 'Lisa'])
idx

Index(['Homer', 'Marge', 'Lisa'], dtype='object')

Unlike almost every other data structure we have seen so far (except tuples), an Index object can't be modified (it is immutable):

In [22]:
idx[0]

'Homer'

In [23]:
idx[0] = 'Flanders'

TypeError: Index does not support mutable operations

## How to not care about indexes

In [25]:
simpsons_2assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
             , index=['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']
            )
simpsons_2assignments_pd = simpsons_2assignments_pd.round()
simpsons_2assignments_pd

Unnamed: 0,Assignment 1,Assignment 2
Homer,91.0,26.0
Marge,23.0,61.0
Bart,54.0,48.0
Lisa,80.0,19.0
Maggie,97.0,89.0


There may be times you don't want to deal with an index differently from normal columns. You can convert an index to a regular column by calling `.reset_index()`:

In [26]:
simpsons_2assignments_pd.reset_index()

Unnamed: 0,index,Assignment 1,Assignment 2
0,Homer,91.0,26.0
1,Marge,23.0,61.0
2,Bart,54.0,48.0
3,Lisa,80.0,19.0
4,Maggie,97.0,89.0


Notice that there is now a new column called "**index**" and the old index has been replaced with a simple index which simply represents the rows by their location.

In reality, you will want to give your new column a proper name:

In [33]:
idx2col = simpsons_2assignments_pd.reset_index().rename(columns={'index':'Names'})
idx2col

Unnamed: 0,Names,Assignment 1,Assignment 2
0,Homer,91.0,26.0
1,Marge,23.0,61.0
2,Bart,54.0,48.0
3,Lisa,80.0,19.0
4,Maggie,97.0,89.0


On the other hand, if you want to set a column as an index, use the `set_index` command:

In [34]:
idx2col.set_index('Names')

Unnamed: 0_level_0,Assignment 1,Assignment 2
Names,Unnamed: 1_level_1,Unnamed: 2_level_1
Homer,91.0,26.0
Marge,23.0,61.0
Bart,54.0,48.0
Lisa,80.0,19.0
Maggie,97.0,89.0


In [35]:
idx2col.set_index('Names').index

Index(['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie'], dtype='object', name='Names')

Notice that not only have to reverted to the original table, with names as an index, but the new index has kept the name of the column.

## Columns are also an index!

In [32]:
simpsons_2assignments_pd.columns

Index(['Assignment 1', 'Assignment 2'], dtype='object')

## Heirarchical Indexes

In [46]:
simpsons_class_assignments_df = pd.DataFrame(((np.random.rand(10,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
            )
simpsons_class_assignments_df = simpsons_class_assignments_df.round()
simpsons_class_assignments_df['Names'] = ['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie'] * 2
simpsons_class_assignments_df['Class'] = ['Python', 'Linear Algebra'] * 5

simpsons_class_assignments_df = simpsons_class_assignments_df[['Names', 'Class', 'Assignment 1', 'Assignment 2']].sort_values(by=['Names', 'Class'])
simpsons_class_assignments_df

Unnamed: 0,Names,Class,Assignment 1,Assignment 2
7,Bart,Linear Algebra,30.0,11.0
2,Bart,Python,16.0,6.0
5,Homer,Linear Algebra,61.0,97.0
0,Homer,Python,19.0,1.0
3,Lisa,Linear Algebra,97.0,79.0
8,Lisa,Python,62.0,79.0
9,Maggie,Linear Algebra,39.0,96.0
4,Maggie,Python,10.0,28.0
1,Marge,Linear Algebra,32.0,18.0
6,Marge,Python,59.0,78.0


Given the table above, we have already seen how we can set one column to be the index:

In [58]:
simpsons_class_assignments_df.set_index('Names')

Unnamed: 0_level_0,Class,Assignment 1,Assignment 2
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bart,Linear Algebra,30.0,11.0
Bart,Python,16.0,6.0
Homer,Linear Algebra,61.0,97.0
Homer,Python,19.0,1.0
Lisa,Linear Algebra,97.0,79.0
Lisa,Python,62.0,79.0
Maggie,Linear Algebra,39.0,96.0
Maggie,Python,10.0,28.0
Marge,Linear Algebra,32.0,18.0
Marge,Python,59.0,78.0


We also know what happens if we run an aggregate function on this data:

In [59]:
simpsons_class_assignments_df.set_index('Names').max()

Class           Python
Assignment 1        97
Assignment 2        97
dtype: object

#### What if our index contains more than one column?

In [60]:
simpsons_class_assignments_df.set_index(['Names', 'Class'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Assignment 1,Assignment 2
Names,Class,Unnamed: 2_level_1,Unnamed: 3_level_1
Bart,Linear Algebra,30.0,11.0
Bart,Python,16.0,6.0
Homer,Linear Algebra,61.0,97.0
Homer,Python,19.0,1.0
Lisa,Linear Algebra,97.0,79.0
Lisa,Python,62.0,79.0
Maggie,Linear Algebra,39.0,96.0
Maggie,Python,10.0,28.0
Marge,Linear Algebra,32.0,18.0
Marge,Python,59.0,78.0


In [61]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).index

MultiIndex([(  'Bart', 'Linear Algebra'),
            (  'Bart',         'Python'),
            ( 'Homer', 'Linear Algebra'),
            ( 'Homer',         'Python'),
            (  'Lisa', 'Linear Algebra'),
            (  'Lisa',         'Python'),
            ('Maggie', 'Linear Algebra'),
            ('Maggie',         'Python'),
            ( 'Marge', 'Linear Algebra'),
            ( 'Marge',         'Python')],
           names=['Names', 'Class'])

What happens if we aggregate a dataframe with two indexes?

In [62]:
simpsons_class_assignments_df.set_index('Names').max()

Class           Python
Assignment 1        97
Assignment 2        97
dtype: object

We need one more parameter to see the magic of multiple indexes:

In [65]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).max(level=0)

Unnamed: 0_level_0,Assignment 1,Assignment 2
Names,Unnamed: 1_level_1,Unnamed: 2_level_1
Bart,30.0,11.0
Homer,61.0,97.0
Lisa,97.0,79.0
Maggie,39.0,96.0
Marge,59.0,78.0


In [66]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).max(level=1)

Unnamed: 0_level_0,Assignment 1,Assignment 2
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
Linear Algebra,97.0,97.0
Python,62.0,79.0


We can now aggregate values per index!

#### Stack/Unstack

Let's go back to our dataframe with two indexes:

In [67]:
simpsons_class_assignments_df.set_index(['Names', 'Class'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Assignment 1,Assignment 2
Names,Class,Unnamed: 2_level_1,Unnamed: 3_level_1
Bart,Linear Algebra,30.0,11.0
Bart,Python,16.0,6.0
Homer,Linear Algebra,61.0,97.0
Homer,Python,19.0,1.0
Lisa,Linear Algebra,97.0,79.0
Lisa,Python,62.0,79.0
Maggie,Linear Algebra,39.0,96.0
Maggie,Python,10.0,28.0
Marge,Linear Algebra,32.0,18.0
Marge,Python,59.0,78.0


Now flip the 'class' index to columns:

In [68]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).unstack()

Unnamed: 0_level_0,Assignment 1,Assignment 1,Assignment 2,Assignment 2
Class,Linear Algebra,Python,Linear Algebra,Python
Names,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bart,30.0,16.0,11.0,6.0
Homer,61.0,19.0,97.0,1.0
Lisa,97.0,62.0,79.0,79.0
Maggie,39.0,10.0,96.0,28.0
Marge,32.0,59.0,18.0,78.0


Notice that we are now breaking up assignments by class, rather than names

In [69]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).unstack().max()

              Class         
Assignment 1  Linear Algebra    97.0
              Python            62.0
Assignment 2  Linear Algebra    97.0
              Python            79.0
dtype: float64