## CMPINF 2100 Week 04
### Pandas DataFrame Index Deep Dive

Lets build on previous video, but focus heavily on the `.index` attribute of a DataFrame.

## Import Modules

In [2]:
import numpy as np
import pandas as pd

## Example DataFrame

We will use the same baseball example DataFrame from the previous recording.

We will create the DF from a dictionary.

In [3]:
baseball_dict = {"City": ["Pittsburgh", "Cincinnati", "Chicago", "St. Louis", "Milwaukee"],
                "Teams": ["Pirates", "Reds", "Cubs", "Cardinals", "Brewers"],
                "Division": 5*["Central"],
                "League": 5*["NL"]}

In [4]:
baseball_dict

{'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'],
 'Teams': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
 'Division': ['Central', 'Central', 'Central', 'Central', 'Central'],
 'League': ['NL', 'NL', 'NL', 'NL', 'NL']}

In [5]:
baseball_df = pd.DataFrame(baseball_dict)

In [6]:
baseball_df

Unnamed: 0,City,Teams,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


But this videos is ALL about the `.index` attribute!!!

In [8]:
baseball_df.index

RangeIndex(start=0, stop=5, step=1)

In [9]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   City      5 non-null      object
 1   Teams     5 non-null      object
 2   Division  5 non-null      object
 3   League    5 non-null      object
dtypes: object(4)
memory usage: 288.0+ bytes


But we can change the `.index` attribute!!

I dislike the Pandas allows the `.index` attribute to be ANYTHING!!!

I feel like `.index` attribute should simply be a ROW COUNTER.

But Pandas lets `.index` be a meaningful quantity! The `.index` attribute can therefore be a separate variable!!

For the baseball example, lets change the `.index` to be the NUMBER of GAMES BACK a team is from the division leader

In [10]:
baseball_df.index = [31.5, 27.5, 22.5, 0, 7.5]

In [11]:
baseball_df.index

Index([31.5, 27.5, 22.5, 0.0, 7.5], dtype='float64')

In [12]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 31.5 to 7.5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   City      5 non-null      object
 1   Teams     5 non-null      object
 2   Division  5 non-null      object
 3   League    5 non-null      object
dtypes: object(4)
memory usage: 200.0+ bytes


In [13]:
baseball_df

Unnamed: 0,City,Teams,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinnati,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


How can we find a single row or SELECT a single row from the DataFrame?

In [14]:
baseball_df[0] 

KeyError: 0

In [15]:
baseball_df[0.0] 

KeyError: 0.0

Pandas is COLUMN or VARIABLE or field centric!!!

In [16]:
baseball_df

Unnamed: 0,City,Teams,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinnati,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


When we type the `[]` to access something in Pandas DataFrame we need to provide the COLUMN NAMES!!!

In [17]:
baseball_df['Teams']

31.5      Pirates
27.5         Reds
22.5         Cubs
0.0     Cardinals
7.5       Brewers
Name: Teams, dtype: object

The column name allows easy SLICING of the DataFrame!

In [18]:
baseball_df.columns

Index(['City', 'Teams', 'Division', 'League'], dtype='object')

But how does this help us with the ROWS??

The `.index` attribute is associated with teh ROWS!!!

Pandas uses a SPECIAL ATTRIBUTE to let us manage or select ROWS!

There are two flavors.

The `.loc[]` attribute allows selecting rows based on the `.index` LOCATION OR KEY!!!

The `.iloc[]` attribute allows selecting rows based on the `.index` INTEGER POSITION!!!

In [19]:
baseball_df

Unnamed: 0,City,Teams,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinnati,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


In [20]:
baseball_df.loc[0.0]

City        St. Louis
Teams       Cardinals
Division      Central
League             NL
Name: 0.0, dtype: object

In [21]:
baseball_df.loc[31.5]

City        Pittsburgh
Teams          Pirates
Division       Central
League              NL
Name: 31.5, dtype: object

In [22]:
baseball_df.iloc[0]

City        Pittsburgh
Teams          Pirates
Division       Central
League              NL
Name: 31.5, dtype: object

The `.loc[]` selects rows based on the LOCATION or KEY.

The `.iloc[]` selects rows based on INTEGER POSITION

## Resetting Index
I do not like the `.index`. I prefer it to be the REGULAR column if it storing MEANINGFUL VALUES!!!

Lets sort by the `.index`

In [23]:
baseball_df.sort_index()

Unnamed: 0,City,Teams,Division,League
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL
22.5,Chicago,Cubs,Central,NL
27.5,Cincinnati,Reds,Central,NL
31.5,Pittsburgh,Pirates,Central,NL


In [24]:
baseball_df

Unnamed: 0,City,Teams,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinnati,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


In [25]:
baseball_df.sort_index(inplace=True)

In [26]:
baseball_df

Unnamed: 0,City,Teams,Division,League
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL
22.5,Chicago,Cubs,Central,NL
27.5,Cincinnati,Reds,Central,NL
31.5,Pittsburgh,Pirates,Central,NL


We can PULL OUT the `.index` attribute into a REGULAR column using the `.reset_index()` method!

In [27]:
baseball_df.reset_index()

Unnamed: 0,index,City,Teams,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinnati,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


In [29]:
baseball_df.reset_index().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   index     5 non-null      float64
 1   City      5 non-null      object 
 2   Teams     5 non-null      object 
 3   Division  5 non-null      object 
 4   League    5 non-null      object 
dtypes: float64(1), object(4)
memory usage: 328.0+ bytes


If you do not want to KEEP the values within the `.index` attribute when you reset...you can DROP them.

In [30]:
baseball_df.reset_index(drop=True)

Unnamed: 0,City,Teams,Division,League
0,St. Louis,Cardinals,Central,NL
1,Milwaukee,Brewers,Central,NL
2,Chicago,Cubs,Central,NL
3,Cincinnati,Reds,Central,NL
4,Pittsburgh,Pirates,Central,NL


But if you want to KEEP the `.index` attribute values...the DEFAULT name of the column `index` is VERY vague!!!

In [32]:
baseball_df.reset_index()

Unnamed: 0,index,City,Teams,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinnati,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


We can RENAME a column using the `.rename()` method!!!

To rename the columns in Pandas DataFrames, we need to use a Dictionary within the `columns` argument of the `.rename()` method!!!

The KEY is the original column name and the VALUE is the NEW or desired column name!

In [33]:
baseball_df.reset_index().rename(columns={'index': 'games_back'})

Unnamed: 0,games_back,City,Teams,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinnati,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


In [34]:
baseball_df

Unnamed: 0,City,Teams,Division,League
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL
22.5,Chicago,Cubs,Central,NL
27.5,Cincinnati,Reds,Central,NL
31.5,Pittsburgh,Pirates,Central,NL


Sometimes we want to place each step in a PIPELINE or WORKFLOW of actions on a separate line.

This makes the code easier to read especially when there are MANY, MANY, actions in the workflow.

In [35]:
baseball_df_b = baseball_df.\
reset_index().\
rename(columns={'index': 'games_back'}).\
copy()

In [36]:
baseball_df_b

Unnamed: 0,games_back,City,Teams,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinnati,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


## INDEX upon creation

The `.index` attribute can be defined when the DataFrame is created.

In [39]:
baseball_df_c = pd.DataFrame(data=baseball_dict,
                             index = [31.5, 27.5, 22.5, 0, 7.5],
                             columns = ['League', 'Division', 'City', 'Teams'])

In [40]:
baseball_df_c

Unnamed: 0,League,Division,City,Teams
31.5,NL,Central,Pittsburgh,Pirates
27.5,NL,Central,Cincinnati,Reds
22.5,NL,Central,Chicago,Cubs
0.0,NL,Central,St. Louis,Cardinals
7.5,NL,Central,Milwaukee,Brewers


## Summary

The `.loc[]` and `.iloc[]` attributes allow us to select rows based on the `.index` attribute!

We can also reset the index using the `.reset_index()` method.

We can change column names using the `.rename()` method.