## CMPINF 2100 Week 04

### Introduction to Pandas DataFrame

The previous video was all about Pandas Series. A Series contains values associated with a single variable.

The DataFrame is a COLLECTION of variables! Or, a collection of Pandas Series!

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## NumPy to DataFrames

We previously learned that Pandas Series are kind of like 1D NumPy arrays.

Pandas DataFrame is kind of like a 2D NumPy array.

So let's create a DataFrame from a 2D NumPy array.

In [2]:
X = np.arange(1, 25).reshape( 6, -1 )

In [3]:
X.shape

(6, 4)

In [4]:
X.ndim

2

In [5]:
X.size

24

In [6]:
X[ 0 ]

array([1, 2, 3, 4])

In [7]:
X[ 1 ]

array([5, 6, 7, 8])

In [8]:
X[ -1 ]

array([21, 22, 23, 24])

In [9]:
X

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17, 18, 19, 20],
       [21, 22, 23, 24]])

In [10]:
X[ :, 0 ]

array([ 1,  5,  9, 13, 17, 21])

In [11]:
X[ :, -1 ]

array([ 4,  8, 12, 16, 20, 24])

In [12]:
X[ 0, -1 ]

4

In [13]:
X[ -1, 0 ]

21

In [14]:
X[ :3, :2 ]

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

Let's convert `X` into a DataFrame!!

In [15]:
Xdf = pd.DataFrame( X )

In [16]:
%whos

Variable   Type         Data/Info
---------------------------------
X          ndarray      6x4: 24 elems, type `int32`, 96 bytes
Xdf        DataFrame        0   1   2   3\n0   1 <...>19  20\n5  21  22  23  24
np         module       <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd         module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>


In [17]:
type( Xdf )

pandas.core.frame.DataFrame

In [18]:
print( Xdf )

    0   1   2   3
0   1   2   3   4
1   5   6   7   8
2   9  10  11  12
3  13  14  15  16
4  17  18  19  20
5  21  22  23  24


In [19]:
print( X )

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]
 [21 22 23 24]]


In [20]:
Xdf

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20
5,21,22,23,24


Provide a single index to slice the NumPy 2D array!

In [21]:
X[ 0 ]

array([1, 2, 3, 4])

Look what happens when we use a single index to slice the DataFrame!

In [22]:
Xdf[ 0 ]

0     1
1     5
2     9
3    13
4    17
5    21
Name: 0, dtype: int32

In [23]:
X[ :, 0 ]

array([ 1,  5,  9, 13, 17, 21])

In [24]:
Xdf[ :, 0 ]

InvalidIndexError: (slice(None, None, None), 0)

In [25]:
Xdf[ 0, : ]

InvalidIndexError: (0, slice(None, None, None))

There is clearly something different about the Pandas DataFrame compared to the NumPy 2D array...even though both are TABLE-LIKE.

Both have 2 dimensions.

Both have rows and columns.

But we CANNOT interact with a Pandas DataFrame using syntax just like the NumPy 2D array!!!

Remember...the Pandas Series is like a list, dictionary, and 1D NumPy array!!!!

## Dictionary to DataFrame

Let's see the connection between a Dictionary and a DataFrame by converting a Dictionary into a DataFrame!

Remember that the Dictionary has KEY/VALUE pairs to define each ITEM!

In most of our previous examples, the KEY was associated with a SINGLE valued VALUE. But now, we will use a MULTI-VALUED or MULTI-ENTRY VALUE per KEY!

In [26]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [27]:
baseball_dict

{'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'],
 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
 'Division': ['Central', 'Central', 'Central', 'Central', 'Central'],
 'League': ['NL', 'NL', 'NL', 'NL', 'NL']}

In [28]:
baseball_dict['City']

['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee']

In [29]:
baseball_dict['Team']

['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers']

Convert the dictionary into a DataFrame!!!

In [30]:
baseball_df = pd.DataFrame( baseball_dict )

In [31]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


The Dictionary KEYS become COLUMN NAMES!!!!!

In [32]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

In [34]:
baseball_dict.keys()

dict_keys(['City', 'Team', 'Division', 'League'])

The dictionary VALUEs become the entries within the ROWS for each COLUMN!

In [35]:
baseball_df.index

RangeIndex(start=0, stop=5, step=1)

## DataFrame attributes

In [36]:
Xdf.index

RangeIndex(start=0, stop=6, step=1)

In [37]:
baseball_df.index

RangeIndex(start=0, stop=5, step=1)

In [38]:
Xdf.columns

RangeIndex(start=0, stop=4, step=1)

In [39]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

In [40]:
Xdf.shape

(6, 4)

In [41]:
baseball_df.shape

(5, 4)

In [42]:
type( Xdf )

pandas.core.frame.DataFrame

In [43]:
type( baseball_df )

pandas.core.frame.DataFrame

But...we are REALLY interested in the DATA TYPE associated with each COLUMN contained within the DataFrame!!!

In [44]:
Xdf.dtypes

0    int32
1    int32
2    int32
3    int32
dtype: object

In [45]:
baseball_df.dtypes

City        object
Team        object
Division    object
League      object
dtype: object

In [46]:
baseball_df.dtypes

City        object
Team        object
Division    object
League      object
dtype: object

In [47]:
type( baseball_df.dtypes )

pandas.core.series.Series

In [48]:
baseball_df.dtypes.index

Index(['City', 'Team', 'Division', 'League'], dtype='object')

In [49]:
baseball_df.columns

Index(['City', 'Team', 'Division', 'League'], dtype='object')

## DataFrame methods

We will only show a few in this video. Next week is really dedicated to DataFrame methods!!!!!!

In [50]:
Xdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       6 non-null      int32
 1   1       6 non-null      int32
 2   2       6 non-null      int32
 3   3       6 non-null      int32
dtypes: int32(4)
memory usage: 224.0 bytes


In [51]:
baseball_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   City      5 non-null      object
 1   Team      5 non-null      object
 2   Division  5 non-null      object
 3   League    5 non-null      object
dtypes: object(4)
memory usage: 288.0+ bytes


In [52]:
Xdf.describe()

Unnamed: 0,0,1,2,3
count,6.0,6.0,6.0,6.0
mean,11.0,12.0,13.0,14.0
std,7.483315,7.483315,7.483315,7.483315
min,1.0,2.0,3.0,4.0
25%,6.0,7.0,8.0,9.0
50%,11.0,12.0,13.0,14.0
75%,16.0,17.0,18.0,19.0
max,21.0,22.0,23.0,24.0


In [55]:
X.mean(axis=0)

array([11., 12., 13., 14.])

In [56]:
X.std(axis=0, ddof=1)

array([7.48331477, 7.48331477, 7.48331477, 7.48331477])

In [57]:
baseball_df.describe()

Unnamed: 0,City,Team,Division,League
count,5,5,5,5
unique,5,5,1,1
top,Pittsburgh,Pirates,Central,NL
freq,1,1,5,5


In [58]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


There are many more methods but we will not show most in this video.

Instead, let's consider how to SORT or ORDER or ARRANGE the DataFrame.

In [59]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


In [60]:
baseball_df.sort_values(['Team'])

Unnamed: 0,City,Team,Division,League
4,Milwaukee,Brewers,Central,NL
3,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL


In [61]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


In [62]:
baseball_df

Unnamed: 0,City,Team,Division,League
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


Most Pandas methods do NOT modify in place!!!!!

In [63]:
baseball_df.sort_values(['Team'])

Unnamed: 0,City,Team,Division,League
4,Milwaukee,Brewers,Central,NL
3,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL


Sometimes, we do not want to retain the original `.index` attribute positions!

We want to IGNORE the `.index` when we SORT!

In [64]:
baseball_df.sort_values( ['Team'], ignore_index=True )

Unnamed: 0,City,Team,Division,League
0,Milwaukee,Brewers,Central,NL
1,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
3,Pittsburgh,Pirates,Central,NL
4,Cincinnati,Reds,Central,NL


We could also sort in DESCENDING order.

In [67]:
baseball_df.sort_values( ['Team'], ignore_index=True, ascending=False)

Unnamed: 0,City,Team,Division,League
0,Cincinnati,Reds,Central,NL
1,Pittsburgh,Pirates,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


The result is NOT stored. So to store or KEEP the sorting...we need to either assign to a NEW object.

When I assign a result to a new object, I like to force the DEEP COPY!

In [68]:
baseball_df_b = baseball_df.sort_values( ['Team'], ignore_index=True, ascending=False).copy()

In [69]:
baseball_df_b

Unnamed: 0,City,Team,Division,League
0,Cincinnati,Reds,Central,NL
1,Pittsburgh,Pirates,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


Alternatively, we CAN force Pandas to MODIFY in place using the `inplace` argument!

In [70]:
baseball_df.sort_values( ['Team'], inplace=True )

In [71]:
baseball_df

Unnamed: 0,City,Team,Division,League
4,Milwaukee,Brewers,Central,NL
3,St. Louis,Cardinals,Central,NL
2,Chicago,Cubs,Central,NL
0,Pittsburgh,Pirates,Central,NL
1,Cincinnati,Reds,Central,NL


In [72]:
baseball_df_b

Unnamed: 0,City,Team,Division,League
0,Cincinnati,Reds,Central,NL
1,Pittsburgh,Pirates,Central,NL
2,Chicago,Cubs,Central,NL
3,St. Louis,Cardinals,Central,NL
4,Milwaukee,Brewers,Central,NL


## Summary

This video focused on CREATING DataFrames from NumPy and from Dictionaries.

This was to highlight the fact that DataFrames are 2D objects - they have 2 dimensions - ROWS AND COLUMNS!!!!!

This was to highlight that the KEY becomes the COLUMN NAME!!!!

We need to think of Pandas DataFrames as a combination of the NumPy 2D array and the Python Dictionary!!!!!