In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import random

In [None]:
%reload_ext postcell
%postcell register

# Indexes - hidden hero of Pandas

![](images/dataframes.jpg)

Indexes are often an after-thought for Pandas programmers. Casual users generally don't know much about them, yet they are vitally important to understand.

## What is an index

Recall from the Series lecture that an index is a way to name values. Dictionary values can be looked up via keys and list elements can be looked up via index location.

SQL programmers can think of indexeds as primary keys on a table.

When creating a series or a dataframe, if an explicit index is not provided, one is created automatically:

In [None]:
simps = pd.Series(['Homer', 'Marge', 'Lisa'])
simps

In [None]:
simps.values

In [None]:
simps.index

An explicit index an be provided:

In [None]:
simps_i = pd.Series(['Homer', 'Marge', 'Lisa'], index=['Dad', 'Mom', 'Daughter'])
simps_i

In [None]:
simps_i.index

## Indexes or indices?

I don't know

* "indexes" site:pandas.pydata.org
* "indices" site:pandas.pydata.org

## Combining series and dataframes based on indexes

Operations on two series are done by matching indexes:

In [None]:
hml = pd.Series([38, 36, 10], index=['Homer', 'Marge', 'Lisa'])
hml

In [None]:
hmm = pd.Series([38, 36, 2], index=['Homer', 'Marge', 'Maggie'])
hmm

In [None]:
hml + hmm

Notice that since Lisa and Maggie are not in both series, pandas gives us NaN values.

In [None]:
hml.add(hmm, fill_value=0)

We can control default values, in case an index element appears in one series but not another.

## How are indexes different from series..or numpy arrays or lists?

In [None]:
idx = pd.Index(['Homer', 'Marge', 'Lisa'])
idx

Unlike almost every other data structure we have seen so far (except tuples), an Index object can't be modified (it is immutable):

In [None]:
idx[0]

In [None]:
idx[0] = 'Flanders'

## How to not care about indexes

In [None]:
simpsons_2assignments_pd = pd.DataFrame(((np.random.rand(5,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
             , index=['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie']
            )
simpsons_2assignments_pd = simpsons_2assignments_pd.round()
simpsons_2assignments_pd

There may be times you don't want to deal with an index differently from normal columns. You can convert an index to a regular column by calling `.reset_index()`:

In [None]:
simpsons_2assignments_pd.reset_index()

Notice that there is now a new column called "**index**" and the old index has been replaced with a simple index which simply represents the rows by their location.

In reality, you will want to give your new column a proper name:

In [None]:
idx2col = simpsons_2assignments_pd.reset_index().rename(columns={'index':'Names'})
idx2col

On the other hand, if you want to set a column as an index, use the `set_index` command:

In [None]:
idx2col.set_index('Names')

In [None]:
idx2col.set_index('Names').index

Notice that not only have we reverted to the original table, with names as an index, but the new index has kept the name of the column.

## Columns are also an index!

In [None]:
simpsons_2assignments_pd.columns

## Heirarchical Indexes

Note that this is a very advanced feature and you are unlikely to ever have to create such indexes directly. These will generally be created when using groupby. 

In [None]:
simpsons_class_assignments_df = pd.DataFrame(((np.random.rand(10,2) * 100) )
             , columns=['Assignment 1', 'Assignment 2']
            )
simpsons_class_assignments_df = simpsons_class_assignments_df.round()
simpsons_class_assignments_df['Names'] = ['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie'] * 2
simpsons_class_assignments_df['Class'] = ['Python', 'Linear Algebra'] * 5

simpsons_class_assignments_df = simpsons_class_assignments_df[['Names', 'Class', 'Assignment 1', 'Assignment 2']].sort_values(by=['Names', 'Class'])
simpsons_class_assignments_df

Given the table above, we have already seen how we can set one column to be the index:

In [None]:
simpsons_class_assignments_df.set_index('Names')

We also know what happens if we run an aggregate function on this data:

In [None]:
simpsons_class_assignments_df.set_index('Names').max()

#### What if our index contains more than one column?

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class'])

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).index

What happens if we aggregate a dataframe with two indexes?

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).max()

We need one more parameter to see the magic of multiple indexes (note, the following syntax is deprecated):

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).max(level=0)

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).max(level=1)

Pandas suggests we do this via `groupby`:

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).groupby(level=0).max()

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).groupby(level=1).max()

#### Stack/Unstack

Let's go back to our dataframe with two indexes:

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class'])

Now flip the 'class' index to columns:

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).unstack()

In [None]:
simpsons_class_assignments_df.set_index(['Names', 'Class']).unstack().stack()