<img src="https://snipboard.io/Kx6OAi.jpg">

# Session 3. Advanced Pandas: Multi-indexing
<div style="margin-top: -20px;">Author:  David Yerrington</div>

## Learning Objectives

- Define what a multi-indexing is
- Describe common methods for accessing data by multiple indices

### Prerequisite Knowledge
- Basic Pandas 
  - Difference between Series vs Dataframe
  - Bitmasks, query function, selecting data
  - Aggregations

## Environment Setup

Don't forget to setup your Python environment from [the setup guide](../environment.md) if you haven't done so yet.


In [5]:
import pandas as pd, numpy as np

## Load some Pokemon data

In [6]:
pokemon = pd.read_csv("../data/pokemon.csv", encoding = "utf8")

## 1. Let's Create a DataFrame with a Multi-Index

A lot of what working with a multi-index is about dealing with them as a result of many Pandas transformations such as `.groupby` and `.agg`.  Let's look at one.

In [10]:
poketypes = pokemon.groupby(["Type 1", "Type 2"]).mean()

## 2. What is a Multi-Index

One of the most common forms of multi-indexes is the result of a `.groupby` with more than one group criteria.  In this grouped result, all pokemon are grouped by their primary types first, then by their secondary types before aggregations.  In this case we used `.mean` to aggregate the resulting subsets by this criteria.

For example:
- There are 69 Pokemon in the first "bug" subset
- Of the 69 bug pokemon, only 2 are "electric" type
- The input of our mean function are the values are 50 and 70.  The mean of those values are 60.

The same logic applies to the rest of the dataset but when we look at the multi-index on the left side of the dataframe, it describes the relationship but also literally the results from aggregation.

**We could flatten the index pushing it's values into the row space as an option:**

**But... we should really learn to work with multi-indexing at the row level!**
- It's descriptive
- Allows easy selection of data
- Gain flexibility with new methods to work with our data

## Various Accessors

**`.index`**

### `.loc[row:columns]`

Remember, that `.loc` takes as its first indexer as row reference.  With multi-indexes, `.loc` behaves a little differently.

`.loc[([primary/first subset name])]`

`.loc[([1st subset name, 2nd subsetname])]`

### All secondary types that exist, regarless of their primary group designation

> **Review:  What does `slice()` do?**
>
> Slices simplifies the process of accessing sequentual types in Python but also in Numpy.  Generally, slice objects are their type defined by `slice(start, stop, step)`.  In the case of Pandas multi-indexing and using strings as index keys, Pandas uses this internally to reference rows.

### Slicing by multiple primany and secondary groups

> **Also related:  Cross-section**
>
> Using `.xs`, you can also query subsets in the same way but with a few abstract conveniences such as:
> - Filtering levels completely
> - Setting values
>
> Read more about cross section features in Pandas in [the docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#cross-section).

### Selecting Multiple Values with `.query()`

This is a little obtuse but it works just fine.  In a particular level such as `Type 1` (first level), we can select specific subsets with the `.query()` method like so:

```python
index_values = [list of items]
df.query("`index name` in @index_values")
```

In [1]:
subsets = ['Bug', 'Fire', 'Dragon']
# code here

## Multi-Level Columns

Let's explore the next level of multi-dimensional data.  It's really common to use multiple aggregations when grouping data into multiple groups.  Even seasoned data practioners gloss over when they see these types of DataFrames in a notebook during a team meeting and complain that it's too difficult to work with.  Some might argue that using Pandas and writing code to select row and columns is more difficult than say Excel, or Tableau but that doens't mean it doesn't have time and place where it's useful.

**Let's take our multi-index grouped Pokemon dataset, and create a few aggregations.**



### Question:  Based on how selecting rows by groups works, what do you think the code looks like for selecting columns by heirachy?

In [2]:
# inspect .columns

In [3]:
# Select columns by level and column

### 3. Sorting

Sorting works the same way by the rules of selection for multi-dimension columns or indices.

> - `ascending` order is from __least__ to __greatest__ ie: __a->z__ or __0->100__.
> - `ascending = False` is reverse ascending order wich is from __greatest__ to __least__ ie: __z->a__ or __100->0__

In [159]:
## first example:
# ascending = False
## Second example:
# ascending = [False, True]

## Flattening Columns

### `.stack()`

You can move your columns to indices at any level.  By default, `.stack` moves your lowest level column structure over to a row index.

## Join them using a basic comprehension
Sometimes you have to actually write code instead of relying on Pandas functions inherently.  Let's write an example.


# Summary

### Multi-indexing comprises both row and column index DataFrame structures.

- Commonly multi-indices are the result of `.groupby` and multiple aggregation.
- `df.index` becomes multidimensional
- Selecting multi-index subsets by name `.loc[([1st subset name, 2nd subset name])]`
    - Select only 2nd-level subsets by value:  `.loc[(slice(None), 2nd subset name'), [list of columns or : for all]]`
    - Select specific level subsets by list: `.query("`index name` in @index_values")`
- Selecting multi-level columns by name: `.loc[:, [1st level name, 2nd level column])]`
- Sorting of multi-level columns: `.sort_values([(1st level name, 2nd level column), (1st level name, 2nd level column)], ascending = [False, True])`
- `.stack()` to move the lowest level column level to the lowest leven multi-index.

In [4]:
# poke_multi.loc[(slice(None), ('Fire', 'Ground', 'Normal')), :]

# items = ['Bug', 'Water', 'Normal']
# poke_multi.query("`Type 1` in @items")

# poke_multi.loc[(slice(None), ('Fire', 'Dragon')), :]
# poke_multi = pokemon.groupby(['Type 1', 'Type 2']).agg([np.min, np.mean, np.max])
# poke_multi.sort_values([('HP', 'amax'), ('Defense', 'amax')], ascending = [False, True])

# poke_multi.loc[:, (['HP', 'Defense'], ('amin', 'amax'))]

# poke_multi.index