<a href="https://colab.research.google.com/github/anyuanay/info212/blob/main/INFO212_Week5_Lecture_Pandas_Cut_Hierarchical_Indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 212: Data Science Programming 1
___

## Week 5: Lecture:
### 1. Pandas Cut
### 2. Hierarchical Indexing
---

## Agenda

### Cut and Hierarchical Indexing:
- pd.cut() for categorization
- data.stack()
- data.unstack()
- frame.swaplevel('key1', 'key2')
- frame.sum(level='key2')
- frame.sum(level='color', axis=1)
- frame.set_index(['c', 'd'])
- frame.reset_index()
- frame.reindex(['a', 'b'])

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas cut() function

The cut() function in pandas is used to transform continuous data into categorical ones, often referred to as "binning."

The key parameters of the cut() function are:
- x: Array-like input data.
- bins: Defines the bin edges for the segmentation. Can be an int (defining the number of equal-width bins in the range of x) or a sequence of scalars.
- right: Bool, default True. Indicates whether the bins include the rightmost edge or not.
- labels: Specifies the labels for the returned bins. If set to False, only integer indicators of the bins are returned.
- retbins: Bool, default False. Whether to return the bins or not.

```
data = {'Name': ['Anna', 'Ben', 'Charlie', 'David', 'Ella', 'Frank', 'Grace', 'Hannah', 'Ivy', 'Jack'],
        'Age': [15, 22, 34, 45, 29, 67, 54, 89, 20, 32]}

df = pd.DataFrame(data)

bins = [0, 18, 35, 50, 70, 100]
labels = ['<18', '18-34', '35-49', '50-69', '70+']
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels
```

We can assign custom labels to the bins:

```
bins = [0, 20, 40, 60, 80, 100]
labels = ['Very Young', 'Young', 'Middle Aged', 'Senior', 'Elderly']
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
```


We can exclude the rightmost edge of the bin:

```
bins = [0, 20, 40, 60, 80, 100]
df['Age Group'] = pd.cut(df['Age'], bins=bins, right=False)
```


If we specify an integer for bins, it'll create that many equal-width bins:

```
df['Age Group'] = pd.cut(df['Age'], bins=5)
```


## Exercise:
Load the housing.csv as a DataFrame. Plot the median house price. Categorize the median prices into cheap, regular, and expensive. Plot the distribution of the categories.

# Indexing with a DataFrame's columns
It’s not unusual to want to use one or more columns from a DataFrame as the row
index; alternatively, you may wish to move the row index into the DataFrame’s columns.
Here’s an example DataFrame:


```
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame
````

## set_index() and reset_index()
DataFrame’s set_index function will create a new DataFrame using one or more of
its columns as the index:

```
frame.set_index('c')
```

we can also set multiple columns as index. The result has an **hierarchical index!**
```
frame2 = frame.set_index(['c', 'd'])
frame2
```

By default the columns are removed from the DataFrame, though you can leave them
in:

```
frame.set_index(['c', 'd'], drop=False)
```

reset_index, on the other hand, does the opposite of set_index; the hierarchical
index levels are moved into the columns:

```
frame2.reset_index()
```

## Exercise:
For the housing DataFrame, create the column `value_cat` using the categories cheap, regular, and expensive based on the value range in previous exercise. Set the columns `ocean_proximity` and `value_cat` as the index for the DataFrame. To group the houses in `ocean_proximity`, ensure the DataFrame is first sorted on `ocean_proximity`. Store the result in df1.

# Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that enables you to have multiple
(two or more) index levels on an axis. Somewhat abstractly, it provides a way for
you to work with higher dimensional data in a lower dimensional form.

```
frame2
```

What you’re seeing is a prettified view of a DataFrame with a MultiIndex as its index. The
“gaps” in the index display mean “use the label directly above”:

```
frame2.index
```

Hierarchical index has levels. The level numbers start from 0 at outmost and increment inward.

With a hierarchically indexed object, so-called partial indexing is possible, enabling
you to concisely select subsets of the data:

```
frame2.loc['one']
frame2.loc['one'].iloc[1]
```

## Exercise:
Show df1'index and its levels. Select all hourses 'INLAND' and plot the distribution of value_cat for those hourses. Similarly, Select all hourses 'ISLAND' and plot the distribution of value_cat for those hourses.

Hierarchical indexing plays an important role in reshaping data group-based
operations like forming a pivot table (next in aggregation). It also provides a convenient way to work with higher dimensional data in a lower dimensional form. For example, we can convert the DataFrame to a 1-D Series.
```
frame3 = frame2.stack()
```

The inverse operation of unstack is stack:

```
frame2.stack().unstack()
```

The DataFrame xs() function takes a key argument to select data at a particular level of a MultiIndex.
```
frame2.xs(1, level=1)
```

## Exercise:
Select all 'expensive' hourses and plot distribution by `ocean_proximity`.

### Reordering and Sorting Levels
At times you will need to rearrange the order of the levels on an axis or sort the data
by the values in one specific level. The swaplevel takes two level numbers or names
and returns a new object with the levels interchanged (but the data is otherwise
unaltered):

```
frame2.swaplevel('c', 'd')
```

sort_index, on the other hand, sorts the data using only the values in a single level.
When swapping levels, it’s not uncommon to also use sort_index so that the result is
lexicographically sorted by the indicated level:

```
frame2.sort_index(level=1)
frame2.swaplevel(0, 1).sort_index(level=0)
```

## Exercise:
Swap the index of df1 so that `value_cat` is at the level 0. Ensure the houses are grouped by the value_cat. Select all expensive houses located in INLAND.