In [67]:
import pandas as pd

### The file 'game_plays.csv' is unnecessarily large, and not easily added to github. Because of this, I used the following code to trim it down

First,  I took a look at the initial shape:

In [68]:
df = pd.read_csv('game_plays.csv')
df.shape

(5050529, 18)

5050529 rows and 18 columns! Below I see what would be easy to filter on, given the project, by looking at the columns and their datatypes. 

In [69]:
print(df.columns)
print(df.dtypes)
df.head()

Index(['play_id', 'game_id', 'team_id_for', 'team_id_against', 'event',
       'secondaryType', 'x', 'y', 'period', 'periodType', 'periodTime',
       'periodTimeRemaining', 'dateTime', 'goals_away', 'goals_home',
       'description', 'st_x', 'st_y'],
      dtype='object')
play_id                 object
game_id                  int64
team_id_for            float64
team_id_against        float64
event                   object
secondaryType           object
x                      float64
y                      float64
period                   int64
periodType              object
periodTime               int64
periodTimeRemaining    float64
dateTime                object
goals_away               int64
goals_home               int64
description             object
st_x                   float64
st_y                   float64
dtype: object


Unnamed: 0,play_id,game_id,team_id_for,team_id_against,event,secondaryType,x,y,period,periodType,periodTime,periodTimeRemaining,dateTime,goals_away,goals_home,description,st_x,st_y
0,2016020045_1,2016020045,,,Game Scheduled,,,,1,REGULAR,0,1200.0,2016-10-18 23:40:58,0,0,Game Scheduled,,
1,2016020045_2,2016020045,,,Period Ready,,,,1,REGULAR,0,1200.0,2016-10-19 01:35:28,0,0,Period Ready,,
2,2016020045_3,2016020045,,,Period Start,,,,1,REGULAR,0,1200.0,2016-10-19 01:40:50,0,0,Period Start,,
3,2016020045_4,2016020045,16.0,4.0,Faceoff,,0.0,0.0,1,REGULAR,0,1200.0,2016-10-19 01:40:50,0,0,Jonathan Toews faceoff won against Claude Giroux,0.0,0.0
4,2016020045_5,2016020045,16.0,4.0,Shot,Wrist Shot,-71.0,9.0,1,REGULAR,54,1146.0,2016-10-19 01:41:44,0,0,Artem Anisimov Wrist Shot saved by Michal Neuv...,71.0,-9.0


It seems like filtering by date is an obvious choice here, given the project parameters. The range of dates is checked below:

In [70]:
df.aggregate({'dateTime':['min','max']})

Unnamed: 0,dateTime
min,2000-10-05 00:00:00
max,2020-09-29 04:05:31


In [71]:
filtered_df = df.query("dateTime >= '2016-04-13' \
                       and dateTime < '2017-06-11' ")


It also appears that using both x,y and st_x,st_y is redundant. Furthermore, the text descriptions will not be needed for this project.

In [72]:
filtered_df = filtered_df.drop(['st_y','st_x','description'], axis=1)


In [73]:
filtered_df.shape

(450788, 15)

450788 rows and 15 columns! Much more manageable. Let's save it as a new csv file.

In [74]:
filtered_df.to_csv('game_plays_trimmed.csv')