# Introduction to Pandas DataFrame

The previous video was all about Pandas Series. A Series contains values associated with a single variable.

The DataFrame is a COLLECTION of variables! Or, a collection of Pandas Series!

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## NumPy to DataFrames

We previously learned that Pandas Series are kind of like 1D NumPy arrays.

Pandas DataFrame is kind of like a 2D NumPy array.

So let's create a DataFrame from a 2D NumPy array.

In [2]:
X = np.arange(1, 25).reshape( 6, -1 )

In [3]:
X.shape

(6, 4)

In [4]:
X.ndim

2

In [5]:
X.size

24

In [6]:
X[ 0 ]

array([1, 2, 3, 4])

In [7]:
X[ 1 ]

array([5, 6, 7, 8])

In [8]:
X[ -1 ]

array([21, 22, 23, 24])

In [9]:
X

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17, 18, 19, 20],
       [21, 22, 23, 24]])

In [10]:
X[ :, 0 ]

array([ 1,  5,  9, 13, 17, 21])

In [11]:
X[ :, -1 ]

array([ 4,  8, 12, 16, 20, 24])

In [12]:
X[ 0, -1 ]

4

In [13]:
X[ -1, 0 ]

21

In [14]:
X[ :3, :2 ]

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

Let's convert `X` into a DataFrame!!

In [15]:
Xdf = pd.DataFrame( X )

In [16]:
%whos

Variable   Type         Data/Info
---------------------------------
X          ndarray      6x4: 24 elems, type `int32`, 96 bytes
Xdf        DataFrame        0   1   2   3\n0   1 <...>19  20\n5  21  22  23  24
np         module       <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd         module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>


In [17]:
type( Xdf )

pandas.core.frame.DataFrame

In [18]:
print( Xdf )

    0   1   2   3
0   1   2   3   4
1   5   6   7   8
2   9  10  11  12
3  13  14  15  16
4  17  18  19  20
5  21  22  23  24


In [19]:
print( X )

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]
 [21 22 23 24]]


In [20]:
Xdf

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20
5,21,22,23,24


Provide a single index to slice the NumPy 2D array!

In [21]:
X[ 0 ]

array([1, 2, 3, 4])

Look what happens when we use a single index to slice the DataFrame!

In [22]:
Xdf[ 0 ]

0     1
1     5
2     9
3    13
4    17
5    21
Name: 0, dtype: int32

In [23]:
X[ :, 0 ]

array([ 1,  5,  9, 13, 17, 21])

In [24]:
Xdf[ :, 0 ]

InvalidIndexError: (slice(None, None, None), 0)

In [25]:
Xdf[ 0, : ]

InvalidIndexError: (0, slice(None, None, None))

There is clearly something different about the Pandas DataFrame compared to the NumPy 2D array...even though both are TABLE-LIKE.

Both have 2 dimensions.

Both have rows and columns.

But we CANNOT interact with a Pandas DataFrame using syntax just like the NumPy 2D array!!!

Remember...the Pandas Series is like a list, dictionary, and 1D NumPy array!!!!

## Dictionary to DataFrame

Let's see the connection between a Dictionary and a DataFrame by converting a Dictionary into a DataFrame!

Remember that the Dictionary has KEY/VALUE pairs to define each ITEM!

In most of our previous examples, the KEY was associated with a SINGLE valued VALUE. But now, we will use a MULTI-VALUED or MULTI-ENTRY VALUE per KEY!

In [26]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [27]:
baseball_dict

{'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'],
 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
 'Division': ['Central', 'Central', 'Central', 'Central', 'Central'],
 'League': ['NL', 'NL', 'NL', 'NL', 'NL']}

In [28]:
baseball_dict['City']

['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee']

In [29]:
baseball_dict['Team']

['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers']

Convert the dictionary into a DataFrame!!!

In [30]:
baseball_df = pd.DataFrame( baseball_dict )

In [31]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


The Dictionary KEYS become COLUMN NAMES!!!!!

In [32]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

In [34]:
baseball_dict.keys()

dict_keys(['City', 'Team', 'Division', 'League'])

The dictionary VALUEs become the entries within the ROWS for each COLUMN!

In [35]:
baseball_df.index

RangeIndex(start=0, stop=5, step=1)

## DataFrame attributes

In [36]:
Xdf.index

RangeIndex(start=0, stop=6, step=1)

In [37]:
baseball_df.index

RangeIndex(start=0, stop=5, step=1)

In [38]:
Xdf.columns

RangeIndex(start=0, stop=4, step=1)

In [39]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

In [40]:
Xdf.shape

(6, 4)

In [41]:
baseball_df.shape

(5, 4)

In [42]:
type( Xdf )

pandas.core.frame.DataFrame

In [43]:
type( baseball_df )

pandas.core.frame.DataFrame

But...we are REALLY interested in the DATA TYPE associated with each COLUMN contained within the DataFrame!!!

In [44]:
Xdf.dtypes

0    int32
1    int32
2    int32
3    int32
dtype: object

In [45]:
baseball_df.dtypes

City        object
Team        object
Division    object
League      object
dtype: object

In [46]:
baseball_df.dtypes

City        object
Team        object
Division    object
League      object
dtype: object

In [47]:
type( baseball_df.dtypes )

pandas.core.series.Series

In [48]:
baseball_df.dtypes.index

Index(['City', 'Team', 'Division', 'League'], dtype='object')

In [49]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

## DataFrame methods

We will only show a few in this video. Next week is really dedicated to DataFrame methods!!!!!!

In [50]:
Xdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       6 non-null      int32
 1   1       6 non-null      int32
 2   2       6 non-null      int32
 3   3       6 non-null      int32
dtypes: int32(4)
memory usage: 224.0 bytes


In [51]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   City      5 non-null      object
 1   Team      5 non-null      object
 2   Division  5 non-null      object
 3   League    5 non-null      object
dtypes: object(4)
memory usage: 288.0+ bytes


In [52]:
Xdf.describe()

Unnamed: 0,0,1,2,3
count,6.0,6.0,6.0,6.0
mean,11.0,12.0,13.0,14.0
std,7.483315,7.483315,7.483315,7.483315
min,1.0,2.0,3.0,4.0
25%,6.0,7.0,8.0,9.0
50%,11.0,12.0,13.0,14.0
75%,16.0,17.0,18.0,19.0
max,21.0,22.0,23.0,24.0


In [55]:
X.mean(axis=0)

array([11., 12., 13., 14.])

In [56]:
X.std(axis=0, ddof=1)

array([7.48331477, 7.48331477, 7.48331477, 7.48331477])

In [57]:
baseball_df.describe()

Unnamed: 0,City,Team,Division,League
count,5,5,5,5
unique,5,5,1,1
top,Pittsburgh,Pirates,Central,NL
freq,1,1,5,5


In [58]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


There are many more methods but we will not show most in this video.

Instead, let's consider how to SORT or ORDER or ARRANGE the DataFrame.

In [59]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


In [60]:
baseball_df.sort_values(['Team'])

Unnamed: 0,City,Team,Division,League
4,Milwaukee,Brewers,Central,NL
3,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL


In [61]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


In [62]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


Most Pandas methods do NOT modify in place!!!!!

In [63]:
baseball_df.sort_values(['Team'])

Unnamed: 0,City,Team,Division,League
4,Milwaukee,Brewers,Central,NL
3,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL


Sometimes, we do not want to retain the original `.index` attribute positions!

We want to IGNORE the `.index` when we SORT!

In [64]:
baseball_df.sort_values( ['Team'], ignore_index=True )

Unnamed: 0,City,Team,Division,League
0,Milwaukee,Brewers,Central,NL
1,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
3,Pittsburgh,Pirates,Central,NL
4,Cincinnati,Reds,Central,NL


We could also sort in DESCENDING order.

In [67]:
baseball_df.sort_values( ['Team'], ignore_index=True, ascending=False)

Unnamed: 0,City,Team,Division,League
0,Cincinnati,Reds,Central,NL
1,Pittsburgh,Pirates,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


The result is NOT stored. So to store or KEEP the sorting...we need to either assign to a NEW object.

When I assign a result to a new object, I like to force the DEEP COPY!

In [68]:
baseball_df_b = baseball_df.sort_values( ['Team'], ignore_index=True, ascending=False).copy()

In [69]:
baseball_df_b

Unnamed: 0,City,Team,Division,League
0,Cincinnati,Reds,Central,NL
1,Pittsburgh,Pirates,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


Alternatively, we CAN force Pandas to MODIFY in place using the `inplace` argument!

In [70]:
baseball_df.sort_values( ['Team'], inplace=True )

In [71]:
baseball_df

Unnamed: 0,City,Team,Division,League
4,Milwaukee,Brewers,Central,NL
3,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL


In [72]:
baseball_df_b

Unnamed: 0,City,Team,Division,League
0,Cincinnati,Reds,Central,NL
1,Pittsburgh,Pirates,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


## Summary

This video focused on CREATING DataFrames from NumPy and from Dictionaries.

This was to highlight the fact that DataFrames are 2D objects - they have 2 dimensions - ROWS AND COLUMNS!!!!!

This was to highlight that the KEY becomes the COLUMN NAME!!!!

We need to think of Pandas DataFrames as a combination of the NumPy 2D array and the Python Dictionary!!!!!

# Pandas DataFrame index deep dive

Let's build on the previous video, but focus heavily on the `.index` attribute of a DataFrame.

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Example DataFrame

We will use the same baseball example DataFrame from the previous recording.

We will create that DataFrame from a dictionary.

In [2]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [3]:
baseball_dict

{'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
 'Division': ['Central', 'Central', 'Central', 'Central', 'Central'],
 'League': ['NL', 'NL', 'NL', 'NL', 'NL']}

In [4]:
baseball_df = pd.DataFrame( baseball_dict )

In [5]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinatti,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


But...this is video is ALL about the `.index` attribute!!!!

In [6]:
baseball_df.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   City      5 non-null      object
 1   Team      5 non-null      object
 2   Division  5 non-null      object
 3   League    5 non-null      object
dtypes: object(4)
memory usage: 288.0+ bytes


But we can change the `.index` attribute!

I dislike that Pandas allows the `.index` attribute to be ANYTHING!!!

I feel the `.index` attribute should simply be a ROW counter.

But...Pandas lets the `.index` be a meaningful quantity! The `.index` attribute can therefore be a separate variable!

For the baseball example, let's change the `.index` to be the NUMBER OF GAMES BACK a team is from the division leader.

In [8]:
baseball_df.index = [31.5, 27.5, 22.5, 0, 7.5]

In [9]:
baseball_df.index

Float64Index([31.5, 27.5, 22.5, 0.0, 7.5], dtype='float64')

In [10]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 5 entries, 31.5 to 7.5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   City      5 non-null      object
 1   Team      5 non-null      object
 2   Division  5 non-null      object
 3   League    5 non-null      object
dtypes: object(4)
memory usage: 200.0+ bytes


In [11]:
baseball_df

Unnamed: 0,City,Team,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinatti,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


How can we find a single row or SELECT a single row from the DataFrame?

In [12]:
baseball_df[ 0 ]

KeyError: 0

In [13]:
baseball_df[ 0.0 ]

KeyError: 0.0

Pandas is COLUMN or variable or field centric!!!!

In [14]:
baseball_df

Unnamed: 0,City,Team,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinatti,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


When we type the `[]` to access SOMETHING in a Pandas DataFrame...we need to provide the COLUMN NAME!!!!

In [15]:
baseball_df[ 'City' ]

31.5    Pittsburgh
27.5    Cincinatti
22.5       Chicago
0.0      St. Louis
7.5      Milwaukee
Name: City, dtype: object

In [16]:
baseball_df[ 'Team' ]

31.5      Pirates
27.5         Reds
22.5         Cubs
0.0     Cardinals
7.5       Brewers
Name: Team, dtype: object

The column name allows easy SLICING of the DataFrame!

In [17]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

But how does this help us with the ROWS???

The `.index` attribute is associated with the ROWS!!!!!!!!

In [18]:
baseball_df.index

Float64Index([31.5, 27.5, 22.5, 0.0, 7.5], dtype='float64')

Pandas uses a SPECIAL ATTRIBUTE to let us manage or select ROWS!

There are two flavors.

The `.loc[]` attribute allows selecting rows based on the `.index` LOCATION or KEY!!!

The `.iloc[]` attribute allows selecting rows based on the `.index` integer POSITION!!!!

In [19]:
baseball_df

Unnamed: 0,City,Team,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinatti,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


In [20]:
baseball_df.loc[ 0.0 ]

City        St. Louis
Team        Cardinals
Division      Central
League             NL
Name: 0.0, dtype: object

In [21]:
baseball_df.loc[ 31.5 ]

City        Pittsburgh
Team           Pirates
Division       Central
League              NL
Name: 31.5, dtype: object

In [22]:
baseball_df.iloc[ 0 ]

City        Pittsburgh
Team           Pirates
Division       Central
League              NL
Name: 31.5, dtype: object

In [23]:
baseball_df.iloc[ 3 ]

City        St. Louis
Team        Cardinals
Division      Central
League             NL
Name: 0.0, dtype: object

In [24]:
baseball_df.iloc[ -1 ]

City        Milwaukee
Team          Brewers
Division      Central
League             NL
Name: 7.5, dtype: object

The `.loc[]` attribute selects ROWS based on `.index` LOCATION KEY!!!

The `.iloc[]` attribute selects ROWS based on `.index` INTEGER POSITION!!!

## Reseting Index

I do not like the `.index`. I prefer it to be a REGULAR column if it is storing MEANINGFUL VALUES!!!

In [25]:
baseball_df

Unnamed: 0,City,Team,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinatti,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


Let's sort by the `.index`!

In [26]:
baseball_df.sort_index()

Unnamed: 0,City,Team,Division,League
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL
22.5,Chicago,Cubs,Central,NL
27.5,Cincinatti,Reds,Central,NL
31.5,Pittsburgh,Pirates,Central,NL


In [27]:
baseball_df

Unnamed: 0,City,Team,Division,League
31.5,Pittsburgh,Pirates,Central,NL
27.5,Cincinatti,Reds,Central,NL
22.5,Chicago,Cubs,Central,NL
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL


In [28]:
baseball_df.sort_index( inplace=True )

In [29]:
baseball_df

Unnamed: 0,City,Team,Division,League
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL
22.5,Chicago,Cubs,Central,NL
27.5,Cincinatti,Reds,Central,NL
31.5,Pittsburgh,Pirates,Central,NL


We can PULL OUT the `.index` attribute into a REGULAR column using the `.reset_index()` method!

In [30]:
baseball_df.reset_index()

Unnamed: 0,index,City,Team,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinatti,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


In [31]:
baseball_df.reset_index().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   index     5 non-null      float64
 1   City      5 non-null      object 
 2   Team      5 non-null      object 
 3   Division  5 non-null      object 
 4   League    5 non-null      object 
dtypes: float64(1), object(4)
memory usage: 328.0+ bytes


If...you do NOT want to KEEP the values within the `.index` attribute when you reset...you can DROP them!

In [33]:
baseball_df.reset_index(drop=True)

Unnamed: 0,City,Team,Division,League
0,St. Louis,Cardinals,Central,NL
1,Milwaukee,Brewers,Central,NL
2,Chicago,Cubs,Central,NL
3,Cincinatti,Reds,Central,NL
4,Pittsburgh,Pirates,Central,NL


But...if you want to KEEP the `.index` attribute values...the DEFAULT name of the new column `index` is VERY vague!

In [34]:
baseball_df.reset_index()

Unnamed: 0,index,City,Team,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinatti,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


We can RENAME a column using the `.rename()` method!!!!

To rename the columns in Pandas DataFrames...we need to use a DICTIONARY within the `columns` argument of the `.rename()` method!!!!

The KEY is the original column name and the VALUE is the NEW or desired column name!

In [35]:
baseball_df.reset_index().rename(columns={'index': 'games_back'})

Unnamed: 0,games_back,City,Team,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinatti,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


In [36]:
baseball_df

Unnamed: 0,City,Team,Division,League
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL
22.5,Chicago,Cubs,Central,NL
27.5,Cincinatti,Reds,Central,NL
31.5,Pittsburgh,Pirates,Central,NL


Sometimes we want to place each step in a PIPELINE or WORKFLOW of actions on a separate line.

This makes the code easier to read especially when there are MANY, MANY actions in the WORKFLOW!

In [41]:
baseball_df_b = baseball_df.\
reset_index().\
rename(columns={'index': 'games_back'}).\
copy()

In [42]:
baseball_df_b

Unnamed: 0,games_back,City,Team,Division,League
0,0.0,St. Louis,Cardinals,Central,NL
1,7.5,Milwaukee,Brewers,Central,NL
2,22.5,Chicago,Cubs,Central,NL
3,27.5,Cincinatti,Reds,Central,NL
4,31.5,Pittsburgh,Pirates,Central,NL


In [43]:
baseball_df

Unnamed: 0,City,Team,Division,League
0.0,St. Louis,Cardinals,Central,NL
7.5,Milwaukee,Brewers,Central,NL
22.5,Chicago,Cubs,Central,NL
27.5,Cincinatti,Reds,Central,NL
31.5,Pittsburgh,Pirates,Central,NL


## Index upon creation

The `.index` attribute can be defined when the DataFrame is created.

In [44]:
baseball_df_c = pd.DataFrame( data = baseball_dict,
                              index = [31.5, 27.5, 22.5, 0, 7.5],
                              columns = ['League', 'Division', 'City', 'Team'])

In [45]:
baseball_df_c

Unnamed: 0,League,Division,City,Team
31.5,NL,Central,Pittsburgh,Pirates
27.5,NL,Central,Cincinatti,Reds
22.5,NL,Central,Chicago,Cubs
0.0,NL,Central,St. Louis,Cardinals
7.5,NL,Central,Milwaukee,Brewers


## Summary

The `.loc[]` and `.iloc[]` attributes allow us to select rows based on the `.index` attribute!!!!

We can also reset the index using the `.reset_index()` method.

We can change colum names using the `.rename()` method.

# Columns and Rows in the Pandas DataFrame

We have seen how to create the DataFrame. We have worked with the `.index` attribute. We now need to practice selecting columns and rows within the DataFrame.

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Create example DataFrame

We will make the Dictionary about baseball and then convert that Dictionary to a DataFrame.

In [2]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [3]:
baseball_dict

{'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
 'Division': ['Central', 'Central', 'Central', 'Central', 'Central'],
 'League': ['NL', 'NL', 'NL', 'NL', 'NL']}

In [4]:
baseball_df = pd.DataFrame( baseball_dict,
                            columns=['League', 'Division', 'City', 'Team'])

In [5]:
baseball_df

Unnamed: 0,League,Division,City,Team
0,NL,Central,Pittsburgh,Pirates
1,NL,Central,Cincinatti,Reds
2,NL,Central,Chicago,Cubs
3,NL,Central,St. Louis,Cardinals
4,NL,Central,Milwaukee,Brewers


In [6]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   League    5 non-null      object
 1   Division  5 non-null      object
 2   City      5 non-null      object
 3   Team      5 non-null      object
dtypes: object(4)
memory usage: 288.0+ bytes


## Columns

### Selecting columns

In [7]:
baseball_df[ 'Team' ]

0      Pirates
1         Reds
2         Cubs
3    Cardinals
4      Brewers
Name: Team, dtype: object

In [8]:
baseball_df[ 'Division' ]

0    Central
1    Central
2    Central
3    Central
4    Central
Name: Division, dtype: object

In [9]:
type( baseball_df[ 'Division' ] )

pandas.core.series.Series

In [10]:
baseball_df[ 'Division' ].index

RangeIndex(start=0, stop=5, step=1)

In [11]:
baseball_df[ 'Team' ].index

RangeIndex(start=0, stop=5, step=1)

There is another way to access or select a single Column and return it as a Series.

The previous way is the BRACKET notation.

But this other way is known as the DOT NOTATION!

In [12]:
baseball_df.Team

0      Pirates
1         Reds
2         Cubs
3    Cardinals
4      Brewers
Name: Team, dtype: object

In [13]:
baseball_df.Division

0    Central
1    Central
2    Central
3    Central
4    Central
Name: Division, dtype: object

In [14]:
baseball_df.City

0    Pittsburgh
1    Cincinatti
2       Chicago
3     St. Louis
4     Milwaukee
Name: City, dtype: object

Still another approach is to use the FORMAL `.loc[]` attribute for selecting the COLUMN.

In [15]:
baseball_df.loc[ :, 'Team' ]

0      Pirates
1         Reds
2         Cubs
3    Cardinals
4      Brewers
Name: Team, dtype: object

In [16]:
baseball_df.loc[ :, 'Division' ]

0    Central
1    Central
2    Central
3    Central
4    Central
Name: Division, dtype: object

In [17]:
baseball_df.loc[ :, 'City' ]

0    Pittsburgh
1    Cincinatti
2       Chicago
3     St. Louis
4     Milwaukee
Name: City, dtype: object

In [18]:
type( baseball_df.loc[ :, 'City'] )

pandas.core.series.Series

In [19]:
baseball_df.iloc[ :, 0 ]

0    NL
1    NL
2    NL
3    NL
4    NL
Name: League, dtype: object

In [20]:
baseball_df.iloc[ :, -1 ]

0      Pirates
1         Reds
2         Cubs
3    Cardinals
4      Brewers
Name: Team, dtype: object

In [21]:
baseball_df.loc[:, 'Team' ]

0      Pirates
1         Reds
2         Cubs
3    Cardinals
4      Brewers
Name: Team, dtype: object

In [22]:
baseball_df.columns

Index(['League', 'Division', 'City', 'Team'], dtype='object')

However, be very careful...with how you ENTER or TYPE the column names!

In [23]:
baseball_df.loc[ :, ['Team'] ]

Unnamed: 0,Team
0,Pirates
1,Reds
2,Cubs
3,Cardinals
4,Brewers


In [24]:
baseball_df

Unnamed: 0,League,Division,City,Team
0,NL,Central,Pittsburgh,Pirates
1,NL,Central,Cincinatti,Reds
2,NL,Central,Chicago,Cubs
3,NL,Central,St. Louis,Cardinals
4,NL,Central,Milwaukee,Brewers


In [25]:
type( baseball_df.loc[ :, ['Team'] ] )

pandas.core.frame.DataFrame

In [26]:
type( baseball_df.loc[ :, 'Team' ] )

pandas.core.series.Series

The reason for the difference in returned data type...is because we MUST use a LIST to select MULTIPLE COLUMNS!!!!

In [27]:
baseball_df

Unnamed: 0,League,Division,City,Team
0,NL,Central,Pittsburgh,Pirates
1,NL,Central,Cincinatti,Reds
2,NL,Central,Chicago,Cubs
3,NL,Central,St. Louis,Cardinals
4,NL,Central,Milwaukee,Brewers


In [28]:
baseball_df.loc[ :, ['City', 'Team'] ]

Unnamed: 0,City,Team
0,Pittsburgh,Pirates
1,Cincinatti,Reds
2,Chicago,Cubs
3,St. Louis,Cardinals
4,Milwaukee,Brewers


In [29]:
baseball_df.loc[ :, ['Team', 'City'] ]

Unnamed: 0,Team,City
0,Pirates,Pittsburgh
1,Reds,Cincinatti
2,Cubs,Chicago
3,Cardinals,St. Louis
4,Brewers,Milwaukee


In [30]:
baseball_df.loc[ :, ['League', 'Team'] ]

Unnamed: 0,League,Team
0,NL,Pirates
1,NL,Reds
2,NL,Cubs
3,NL,Cardinals
4,NL,Brewers


In [32]:
baseball_df.loc[ :, baseball_df.columns[1:] ]

Unnamed: 0,Division,City,Team
0,Central,Pittsburgh,Pirates
1,Central,Cincinatti,Reds
2,Central,Chicago,Cubs
3,Central,St. Louis,Cardinals
4,Central,Milwaukee,Brewers


We need the base Python LIST to SELECT multiple columns from the DataFrame!!!!

### Adding and Deleting Columns

Adding columns is similar to how we added new KEY/VALUE pairs to Dictionaries.

The VALUE will be assigned to a KEY. The KEY is the NEW column Name!!!!

Let's start with the VALUE that will be added.

In [33]:
baseball_df

Unnamed: 0,League,Division,City,Team
0,NL,Central,Pittsburgh,Pirates
1,NL,Central,Cincinatti,Reds
2,NL,Central,Chicago,Cubs
3,NL,Central,St. Louis,Cardinals
4,NL,Central,Milwaukee,Brewers


In [34]:
baseball_df['games_back'] = [31.5, 27.5, 22.5, 0, 7.5]

In [35]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   League      5 non-null      object 
 1   Division    5 non-null      object 
 2   City        5 non-null      object 
 3   Team        5 non-null      object 
 4   games_back  5 non-null      float64
dtypes: float64(1), object(4)
memory usage: 328.0+ bytes


In [36]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back
0,NL,Central,Pittsburgh,Pirates,31.5
1,NL,Central,Cincinatti,Reds,27.5
2,NL,Central,Chicago,Cubs,22.5
3,NL,Central,St. Louis,Cardinals,0.0
4,NL,Central,Milwaukee,Brewers,7.5


In [37]:
baseball_df.sort_values(['games_back'])

Unnamed: 0,League,Division,City,Team,games_back
3,NL,Central,St. Louis,Cardinals,0.0
4,NL,Central,Milwaukee,Brewers,7.5
2,NL,Central,Chicago,Cubs,22.5
1,NL,Central,Cincinatti,Reds,27.5
0,NL,Central,Pittsburgh,Pirates,31.5


Force modifying the sort in place.

In [38]:
baseball_df.sort_values(['games_back'], inplace=True)

In [40]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back
3,NL,Central,St. Louis,Cardinals,0.0
4,NL,Central,Milwaukee,Brewers,7.5
2,NL,Central,Chicago,Cubs,22.5
1,NL,Central,Cincinatti,Reds,27.5
0,NL,Central,Pittsburgh,Pirates,31.5


Let's add another column which contains the number of WINS per team.

This new column will be assigned a Pandas Series rather than a List.

In [41]:
baseball_df['wins'] = pd.Series([87, 79, 64, 59, 55],
                                index=baseball_df.index)

In [42]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back,wins
3,NL,Central,St. Louis,Cardinals,0.0,87
4,NL,Central,Milwaukee,Brewers,7.5,79
2,NL,Central,Chicago,Cubs,22.5,64
1,NL,Central,Cincinatti,Reds,27.5,59
0,NL,Central,Pittsburgh,Pirates,31.5,55


In [43]:
baseball_df.wins

3    87
4    79
2    64
1    59
0    55
Name: wins, dtype: int64

In [44]:
baseball_df.Team

3    Cardinals
4      Brewers
2         Cubs
1         Reds
0      Pirates
Name: Team, dtype: object

Let's add the number of losses per team.

In [45]:
baseball_df['losses'] = pd.Series([63, 70, 85, 90, 94],
                                  index=baseball_df.index)

In [46]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back,wins,losses
3,NL,Central,St. Louis,Cardinals,0.0,87,63
4,NL,Central,Milwaukee,Brewers,7.5,79,70
2,NL,Central,Chicago,Cubs,22.5,64,85
1,NL,Central,Cincinatti,Reds,27.5,59,90
0,NL,Central,Pittsburgh,Pirates,31.5,55,94


But what if we want to add a SCALAR or CONSTANT value to the DataFrame?

For example, I want a column named `season` to store the year the data came from.

At first, we might think we need to do the following to add the column:

In [47]:
pd.Series([2022, 2022, 2022, 2022, 2022], index=baseball_df.index)

3    2022
4    2022
2    2022
1    2022
0    2022
dtype: int64

In [48]:
pd.Series( 5 * [2022], index = baseball_df.index )

3    2022
4    2022
2    2022
1    2022
0    2022
dtype: int64

Pandas will replicate the constant value down all rows of the new column!!!

In [49]:
baseball_df['season'] = 2022

In [50]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
3,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
4,NL,Central,Milwaukee,Brewers,7.5,79,70,2022
2,NL,Central,Chicago,Cubs,22.5,64,85,2022
1,NL,Central,Cincinatti,Reds,27.5,59,90,2022
0,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


In [51]:
baseball_df.season.index

Int64Index([3, 4, 2, 1, 0], dtype='int64')

In [53]:
baseball_df.wins.index

Int64Index([3, 4, 2, 1, 0], dtype='int64')

### Deleting or DROPPING columns

We can remove columns through the `.drop()` method.

In [54]:
baseball_df.drop(columns=['season'])

Unnamed: 0,League,Division,City,Team,games_back,wins,losses
3,NL,Central,St. Louis,Cardinals,0.0,87,63
4,NL,Central,Milwaukee,Brewers,7.5,79,70
2,NL,Central,Chicago,Cubs,22.5,64,85
1,NL,Central,Cincinatti,Reds,27.5,59,90
0,NL,Central,Pittsburgh,Pirates,31.5,55,94


In [55]:
baseball_df.drop(columns=['wins', 'losses', 'season'])

Unnamed: 0,League,Division,City,Team,games_back
3,NL,Central,St. Louis,Cardinals,0.0
4,NL,Central,Milwaukee,Brewers,7.5
2,NL,Central,Chicago,Cubs,22.5
1,NL,Central,Cincinatti,Reds,27.5
0,NL,Central,Pittsburgh,Pirates,31.5


In [56]:
baseball_df.drop(columns=['City', 'Team'])

Unnamed: 0,League,Division,games_back,wins,losses,season
3,NL,Central,0.0,87,63,2022
4,NL,Central,7.5,79,70,2022
2,NL,Central,22.5,64,85,2022
1,NL,Central,27.5,59,90,2022
0,NL,Central,31.5,55,94,2022


In [57]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
3,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
4,NL,Central,Milwaukee,Brewers,7.5,79,70,2022
2,NL,Central,Chicago,Cubs,22.5,64,85,2022
1,NL,Central,Cincinatti,Reds,27.5,59,90,2022
0,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


We would need to set the `inplace` argument to True to RETAIN or KEEP the resulting DataFrame that has FEWER columns!!!

## Selecting Rows

We know how to use the `.loc[]` and `.iloc[]` attributes to select rows based on the `.index` attribute.

In [58]:
baseball_df.loc[ 3 ]

League               NL
Division        Central
City          St. Louis
Team          Cardinals
games_back          0.0
wins                 87
losses               63
season             2022
Name: 3, dtype: object

In [59]:
baseball_df.iloc[ 3 ]

League                NL
Division         Central
City          Cincinatti
Team                Reds
games_back          27.5
wins                  59
losses                90
season              2022
Name: 1, dtype: object

But the more interesting way to SELECT or FILTER rows is based on CONDITIONS!!!!

We want to CONDITIONALLY SUBSET the rows!!!

We must identify or SELECT the column to apply the condition!

In [61]:
baseball_df.wins > 65

3     True
4     True
2    False
1    False
0    False
Name: wins, dtype: bool

In [62]:
baseball_df.loc[ baseball_df.wins > 65 ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
3,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
4,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


In [63]:
baseball_df.loc[ baseball_df['wins'] > 65 ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
3,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
4,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


In [64]:
baseball_df.loc[ baseball_df.wins > 65, : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
3,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
4,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


We can select a subset of the columns by providing a LIST of column names!

In [65]:
baseball_df.loc[ baseball_df.wins > 65, ['Team', 'wins', 'losses'] ]

Unnamed: 0,Team,wins,losses
3,Cardinals,87,63
4,Brewers,79,70


We can also FILTER by strings. For example, let's find the row where `Team == 'Brewers'`.

In [66]:
baseball_df.loc[ baseball_df.Team == 'Brewers', : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
4,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


In [67]:
baseball_df.loc[ baseball_df.Team == 'Pirates', : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


If you want to match or FILTER based on MULTIPLE conditions you need to use () to separate each condition.

For example, let's find Brewers OR the Cardinals.

In [68]:
baseball_df.loc[ (baseball_df.Team == 'Brewers') | (baseball_df.Team == 'Cardinals'), : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
3,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
4,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


We can also have an AND operation.

I want to find all rows where Team is equal to Cardinals AND the wins is greater than 65.

In [69]:
baseball_df.loc[ (baseball_df.Team == 'Cardinals') & (baseball_df.wins > 65), : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
3,NL,Central,St. Louis,Cardinals,0.0,87,63,2022


## Summary

We have seen how to select COLUMNS using BRACKET and DOT notation.

We have seen how to select MULTIPLE columns.

We have seen how to select ROWS based on `.index` attribute via `.loc[]` and `.iloc[]`.

We have seen how to select ROWS based on CONDITIONS (FILTERING rows).

We also saw how to ADD and REMOVE (DROP) COLUMNS.