# Summary

## inputs
This notebook has three inputs:
1. A reviewed and slightly cleaned version of NFL **nflplaybyplay2009to2016** having many rows per game - each row is a 'play' in the games
2. A **dimensions** dataset from our initial review notebook - this categorized each column by how it should be treated
3. An NFL **nfl_teams_scraped** dataset that matches team names to the abbreviation used in the gameplay data  (e.g. Green Bay Packers == GB)
    3.1 I've since found another list on kaggle - but this one works well enough

The gameplay data has many nulls(), but they make sense once we recognize that not every field is applicable for every type of play.
>> for example,
> If a row represents a passing play, then the rushing data does not make sense, so it's all null
>


## goal
To create datasets that might not yet be completely prepared for ML, but can be queried for many uses, including ML

## cleanup
1. Separate the data into core **facts** - these are columns that apply to every play, and should never be null
2. Create a separate dataset for all the **dimensions** columns that are only good for specific kinds of plays
3. Add in facts that are inferred by the sparse dimensions columns, but don't explicitly exist as facts:
>> for example:
>     If there was a defensive two point conversion - the def_two_point will be non-null
>           but it is null for every other case
>           see we create a def_two_point_key that is always 1 or 0 in the fact table
>           and we move def_two_point the dimensions
> There are cases where we could just fill the def_two_point with 'Not Applicable' when it's null,
> but that's not ging to solve every issue
>
4. Identify boolean keys that are important pivots in the facts table:
    (a) whether a pass was attempted
    (b) whether a RUSH was attempted
    (c) whether there was a penalty on the play
    (d) an offensive or defensive two point conversion
    (e) whether there was a sack
    (f) whether a pass was attempted
 ... and more...

## outputs
1. A cleaned NFL `gameplay` dataset - having many rows per game - each row is a 'play' in the games
2. A column-level metrics dataset that holds some key metrics from describe(), dtypes, etc. and also some configurations



## metrics
Looking at the metrics data - the final output gameplay dataset should be almost complete, with a small enough set of nulls that can be reviewd manually (52)
The completeness column is just the amount of non-null records divided by the total record count

<span><img src="metrics_clean_01.png" width="2500"></span>


## TODO

In [8]:
# todo - is the play_recorded key really helpful?
# todo - remove the inconsistent playtype column or update it
# todo - review the No Play conversion - if the playtype is no good, why fix it halfway?

# Prepare

In [9]:
%load_ext autoreload
%autoreload 2

import re
import os
import warnings

import numpy as np
import pandas as pd
import os
import sys

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option('display.float_format', lambda x: '%.5f' % x)
# comments: <span style="color:#20B2AA">


np.random.seed(0)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
module_path = os.path.abspath(os.path.join('../src'))
print("Adding modules", module_path)
if module_path not in sys.path:
    sys.path.append(module_path)

Adding modules /Users/christopherlomeli/Source/courses/datascience/nfl_capstone/src


In [11]:
from src.features.wrangling.database_loader import DatabaseLoader
from src.features.wrangling.get_metrics import GetMetrics, update_by_lookp
from src.data.s3utils import download_from_s3

## setup

In [12]:

RAW_DATA_PATH = '../data/raw'
INTERIM_DATA_PATH='../data/interim'
USE_CONNECTION="DB_FILENAME_URL"   # DB_FILENAME_URL for csv or DB_CONNECTION_URL for postgres

#inputs
DATA_FILE = os.path.join(INTERIM_DATA_PATH,"nflplaybyplay2009to2016_reviewed_01.parquet")
TEAMS_DATA = os.path.join(RAW_DATA_PATH,"nfl_teams_scraped.csv")
DIMENSIONS_DATA = os.path.join(RAW_DATA_PATH,"dimensions.csv")

#outputs
GAMEPLAY_FACTS_DF_NAME=os.path.join(INTERIM_DATA_PATH, "gameplay_facts_cleaned_01.parquet")
GAMEPLAY_DIM_DF_NAME=os.path.join(INTERIM_DATA_PATH, "gameplay_dimensions_cleaned_01.parquet")
ANALYTICS_DF_NAME=os.path.join(INTERIM_DATA_PATH, "analytic_events_cleaned_01.parquet")
ADMIN_DF_NAME=os.path.join(INTERIM_DATA_PATH, "admin_events_cleaned_01.parquet")
READ_ME = os.path.join(INTERIM_DATA_PATH,"README.02-cjl-clean.txt")

# tables
METRICS_INPUT_TABLE_NAME="nfl_metrics"
CATEGORY_OUTPUT_TABLE_NAME="nfl_cleaned_categories"
METRICS_OUTPUT_TABLE_NAME="nfl_cleaned_metrics"


## read data

In [13]:
download_from_s3(prefix="nfl_capstone/data/raw", local_dir=os.path.abspath(RAW_DATA_PATH), wishlist=['nfl_teams_scraped.csv', 'dimensions.csv'])

already exists:  /Users/christopherlomeli/Source/courses/datascience/nfl_capstone/data/raw/dimensions.csv
we don't need games.csv right now.
we don't need nfl_stadiums.csv right now.
we don't need nfl_teams.csv right now.
already exists:  /Users/christopherlomeli/Source/courses/datascience/nfl_capstone/data/raw/nfl_teams_scraped.csv
we don't need NFL Play by Play 2009-2016 (v3).csv right now.
we don't need NFL Play by Play 2009-2017 (v4).csv right now.
we don't need NFL Play by Play 2009-2018 (v5).csv right now.
we don't need spreadspoke_scores.csv right now.


In [14]:
if not os.path.exists(DATA_FILE):
    raise Exception(f"Can't find the input file {DATA_FILE} .  Have you run the preceeding notebooks? ")

In [15]:
# Read the data file
data_df = pd.read_parquet(DATA_FILE)
data_df.head()

Unnamed: 0,date,game_id,drive,qtr,down,time,time_under,time_secs,play_time_diff,sideof_field,...,yac_epa,home_wp_pre,away_wp_pre,home_wp_post,away_wp_post,win_prob,wpa,air_wpa,yac_wpa,season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.48567,0.51433,0.54643,0.45357,0.48567,0.06076,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.14608,0.54643,0.45357,0.55109,0.44891,0.54643,0.00465,-0.03224,0.0369,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.55109,0.44891,0.51079,0.48921,0.55109,-0.04029,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.03142,0.51079,0.48921,0.46122,0.53878,0.51079,-0.04958,0.10666,-0.15624,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.46122,0.53878,0.55893,0.44107,0.46122,0.09771,,,2009


In [16]:
db = DatabaseLoader(connection_string_env_url=USE_CONNECTION)

In [17]:
# Read in the metrics file we used to inspect the data manually
metrics_df = db.read_table(METRICS_INPUT_TABLE_NAME)
metrics_df.head()


Unnamed: 0,column_name,data_type,unique_counts,feature_type,c_dimension,row_count,good_count,missing_count,completeness,quality,mean,std,min,max,median,top,freq
0,def_two_point,object,2,category,dim(twoopoint),407688,24,407664,6e-05,poor,,,,,,Failure,19.0
1,blocking_player,object,101,category,dim(block),407688,117,407571,0.00029,poor,,,,,,D.Watson,3.0
2,two_point_conv,object,2,category,,407688,605,407083,0.00148,poor,,,,,,Failure,322.0
3,chal_replay_result,object,2,key,fact,407688,3402,404286,0.00834,poor,,,,,,Upheld,1986.0
4,rec_fumb_player,object,1827,category,dim(fumble),407688,4373,403315,0.01073,poor,,,,,,M.Adams,15.0


# Conversions

## 01 rename key indicators with a _key postfix

In [18]:
metrics_df.head()

Unnamed: 0,column_name,data_type,unique_counts,feature_type,c_dimension,row_count,good_count,missing_count,completeness,quality,mean,std,min,max,median,top,freq
0,def_two_point,object,2,category,dim(twoopoint),407688,24,407664,6e-05,poor,,,,,,Failure,19.0
1,blocking_player,object,101,category,dim(block),407688,117,407571,0.00029,poor,,,,,,D.Watson,3.0
2,two_point_conv,object,2,category,,407688,605,407083,0.00148,poor,,,,,,Failure,322.0
3,chal_replay_result,object,2,key,fact,407688,3402,404286,0.00834,poor,,,,,,Upheld,1986.0
4,rec_fumb_player,object,1827,category,dim(fumble),407688,4373,403315,0.01073,poor,,,,,,M.Adams,15.0


In [19]:
for col in metrics_df.loc[(metrics_df.feature_type=='key'), 'column_name']:
     new_name = f"{col}_key".replace(".","_")
     if col in data_df.columns:
        data_df.rename(columns={col: new_name}, inplace=True)
        print(f"Renamed {col} to {new_name}")


Renamed chal_replay_result to chal_replay_result_key
Renamed timeout_indicator to timeout_indicator_key
Renamed play_attempted to play_attempted_key
Renamed sp to sp_key
Renamed touchdown to touchdown_key
Renamed safety to safety_key
Renamed onsidekick to onsidekick_key
Renamed pass_attempt to pass_attempt_key
Renamed qb_hit to qb_hit_key
Renamed interception_thrown to interception_thrown_key
Renamed rush_attempt to rush_attempt_key
Renamed reception to reception_key
Renamed fumble to fumble_key
Renamed sack to sack_key


In [20]:
for col in metrics_df.loc[(metrics_df.feature_type=='key'), 'column_name']:
    new_name = f"{col}_key".replace(".","_")
    if new_name in data_df.columns:
        print("CHANGED ", new_name)
    else:
        print("BAD - did not change ", col)

CHANGED  chal_replay_result_key
CHANGED  timeout_indicator_key
CHANGED  play_attempted_key
CHANGED  sp_key
CHANGED  touchdown_key
CHANGED  safety_key
CHANGED  onsidekick_key
CHANGED  pass_attempt_key
CHANGED  qb_hit_key
CHANGED  interception_thrown_key
CHANGED  rush_attempt_key
CHANGED  reception_key
CHANGED  fumble_key
CHANGED  sack_key


## 02 add additional fact keys to the database based on query

In [21]:
# create a little function to add a new key based on another clumn
def add_key(newkey, depends_on):
    data_df[newkey] = 0
    data_df.loc[(data_df[depends_on].notnull()), newkey] = 1

# add keys base on nullity of another field
add_key("def_two_point_key", "def_two_point" )
add_key("ex_point_result_key", "ex_point_result" )
add_key("return_key", "returner" )
add_key("tackle_key", "tackler1" )
add_key("two_point_conv_key", "two_point_conv" )

# add penalty_key if penalty.yards > 0
data_df["penalty_key"] = 0
data_df.loc[(data_df["penalty_yards"] > 0), "penalty_key"] = 1


## 03 move analytics to a separate dataframe

In [22]:
# save analytics columns to a separate df, and remove from the dataset
all_columns = set(data_df.columns)
analytics_columns = set(metrics_df.loc[(metrics_df.feature_type == 'analytics'), 'column_name'])

analytics_df =  data_df[analytics_columns].copy()
data_df.drop(columns=analytics_columns, inplace=True)


## 04 update the playattempted column
Several rows are for administrative events such as timeout, END Game, or End Quarter -- which creates a lot of nulls and not-applicable values
We probably don't want them, so at least segment them as gameplay = Yes or No, where gameplay=No signifies an administrative event, not a real play

In [23]:
data_df["playattempted"] = 1

# maybe drop or add category [gameplay vs admin]
data_df.loc[data_df["play_type"].isin([
    'Quarter End',
    'Two Minute Warning',
    'End of Game',
    'Half End',
    'Timeout'
]), "play_attempted"] = 0


In [24]:
print("Review results - admin events should now have a gameplay == 0")
data_df.loc[data_df.playattempted == 0, ['play_attempted', 'play_type']].value_counts()

Review results - admin events should now have a gameplay == 0


Series([], dtype: int64)

## 05 fill in missing passers
In some cases the passer is NaN - in these cases we can get it from the passer_id from other good records

In [25]:
# create a unique list of passer_id and passer name
passers_df = data_df.loc[(data_df.passer.notna()) & (data_df.passer_id.notna()) & (data_df.passer_id!='None') , ['passer', 'passer_id']]

# use this utility to update
data_df = update_by_lookp(
    left_df=data_df, left_col='passer', left_on='passer_id',
    lookup_df=passers_df, lookup_col='passer_fix', lookup_on='passer_id', nulls_only=True)

## 06 add playrecorded_key column
There are many `playtype`=="No Play" rows, and they are only "No Play" because the play was replayed due to penalty or some other event.
But the initial play had a value, such as Pass, Punt, etc.
We want to know what the original play was, so create a different column for No play where if `play_recorded`=False then it's a reply
We then update to the original `playtype` in the `playtype` field and use the `play_recorded` field to know whether there was a replay (play_recorded=False) or not.

In [26]:
## 06 add playrecorded_key column
# set play_recorded = True (1) for all rows
data_df["play_recorded_key"] = 1

# Set the play_recorde to False (0) if the playtype is currently == No Play
data_df.loc[data_df["play_type"] == 'No Play', "play_recorded_key"] = 0
data_df[['play_type', 'play_recorded_key']].value_counts()


play_type           play_recorded_key
Pass                1                    159353
Run                 1                    120831
Kickoff             1                     23403
Punt                1                     22003
No Play             0                     21414
Timeout             1                     16206
Sack                1                     10649
Extra Point         1                     10063
Field Goal          1                      8928
Quarter End         1                      4914
QB Kneel            1                      3530
End of Game         1                      1973
Spike               1                       640
Half End            1                        40
dtype: int64

## 07 convert 'No Play' `playtypes` to their original playtype
"No play" means that there was some sort of penalty or stoppage, and the play was cancelled - to be replayed.
The penalty could have occurred after the play started or before it started (e.g. False Start).
If a play actually did get underway then we want to know what the play was.

So add a new field (play_recorded) that tells us whether the play was cancelled and needs to be replayed or was counted.
And then back-fill the playtype field for these row with the original play (when we can figure it out)

In [27]:
data_df["pass_attempt_key"]

0         0
1         1
2         0
3         1
4         0
         ..
407683    0
407684    1
407685    1
407686    0
407687    0
Name: pass_attempt_key, Length: 407688, dtype: int64

In [28]:
# maybe drop or add category [gameplay vs admin]
data_df.loc[(data_df["play_type"] == 'No Play') & (data_df["pass_attempt_key"] == 1), 'play_type'] = 'Pass'
data_df.loc[(data_df["play_type"] == 'No Play') & (data_df["field_goal_result"].notna()), 'play_type'] = 'Field Goal'
data_df.loc[(data_df["play_type"] == 'No Play') & (data_df["punt_result"].notna()), 'play_type'] = 'Punt'
data_df.loc[(data_df["play_type"] == 'No Play') & (data_df["penalty_type"].notna()), 'play_type'] = 'Penalty'

print("Remaining No Plays")
data_df.loc[(data_df["play_recorded_key"] == 0) , ['play_type', 'play_recorded_key']].value_counts()




Remaining No Plays


play_type   play_recorded_key
Penalty     0                    9700
Pass        0                    8703
No Play     0                    2516
Punt        0                     366
Field Goal  0                     129
dtype: int64

## 08 populate missing penalties
There are many missing `penaltytype` columns, but we can see what they should have been by looking at the desc field.
Use that field to parse out the actual `penaltytype` where we can

In [29]:
# create a function to parse the penalty from the description"
def parse_penalty(value):
    result = re.search(r"(?i)(PENALTY on)([^\,]*)\,([^\,]*)", value)
    try:
        v =  result.group(3)
    except AttributeError:
        v = ''
    return v


In [30]:
# get the subset of all rows where we know it's a penalty, but penaltytype is NA
missing_penalties = data_df.loc[(data_df['accepted_penalty'] == 1) & (data_df['penalty_type'].isna()) & (data_df['desc'].notna()), ['desc']]

# parse the penalty from the description
missing_penalties['penalty_fix'] = missing_penalties.desc.apply(parse_penalty)

# drop the desc column - we have what we need
missing_penalties.drop(columns = ['desc'], inplace=True)

# merge our fix in as penalty_fix field
df = pd.merge(data_df, missing_penalties, left_index=True, right_index=True, how='outer')
df.loc[(data_df['accepted_penalty'] == 1) & (data_df['penalty_type'].isna()) & (df['penalty_fix'].notna()), ['penalty_type', 'penalty_fix']]

# replace empty penaltytype with the penalty_fix
df.loc[(data_df['accepted_penalty'] == 1) & (data_df['penalty_type'].isna()) & (df['penalty_fix'].notna()), 'penalty_type'] = df['penalty_fix']

# drop the fix column
df.drop(columns = ['penalty_fix'], inplace=True)
print("Shape of the original dataframe", data_df.shape)
print("Shape of our updated dataframe", df.shape)

# assign df to data_df
data_df = df

# verify results - should be zero -- assert?
print("Are there any leftover penalties - this should be empty")
data_df.loc[(data_df['accepted_penalty'] == 1) & (data_df['penalty_type'].isna()) & (data_df['desc'].notna()), ['desc', 'penalty_type']]

Shape of the original dataframe (407688, 95)
Shape of our updated dataframe (407688, 95)
Are there any leftover penalties - this should be empty


Unnamed: 0,desc,penalty_type


## 09 validate team names
Several fields are populated with the team abbreviation (e.g. LA Rams == 'LAR')
Some of these abbreviations are historical and no longer exist
Others are errors -

In [31]:
# Get a control list of team names and abbreviations
team_df = pd.read_csv(TEAMS_DATA)
teams = list(team_df.Abbreviation)
teams

['ARI',
 'ATL',
 'BAL',
 'BUF',
 'CAR',
 'CHI',
 'CIN',
 'CLE',
 'DAL',
 'DEN',
 'DET',
 'GB',
 'HOU',
 'IND',
 'KC',
 'MIA',
 'MIN',
 'NE',
 'NO',
 'NYG',
 'NYJ',
 'PHI',
 'PIT',
 'SF',
 'SEA',
 'TB',
 'TEN',
 'WAS',
 'SD',
 'LAC',
 'LV',
 'OAK',
 'LAR',
 'STL',
 'JAX',
 'JAC']

In [32]:
# create a function to list al team abbreviations that are not in our control list
team_columns = ['home_team', 'away_team', 'defensive_team', 'posteam', 'rec_fumb_team','timeout_team']

def validate_teams():
    for t in team_columns:
        print("COLUMN: ", t)
        print(data_df.loc[~data_df[t].isin(list(teams)), t].unique())
        print("-----------------------------------")

validate_teams()

COLUMN:  home_team
['LA']
-----------------------------------
COLUMN:  away_team
['LA']
-----------------------------------
COLUMN:  defensive_team
[None 'LA']
-----------------------------------
COLUMN:  posteam
[None 'LA']
-----------------------------------
COLUMN:  rec_fumb_team
[None 'LA']
-----------------------------------
COLUMN:  timeout_team
['None' 'LA']
-----------------------------------


In [33]:
# cleanup the ones we know about
for t in team_columns:
    print("Update COLUMN: ", t)
    data_df.loc[data_df[t]=='LA', t] = 'LAR'

Update COLUMN:  home_team
Update COLUMN:  away_team
Update COLUMN:  defensive_team
Update COLUMN:  posteam
Update COLUMN:  rec_fumb_team
Update COLUMN:  timeout_team


## 10 fillna for sparse columns
Some categorical columns have null values because the column is not applicable to the play itself.
For example, if there was no pass, then the pass-outcome column would be null.
So this is really a 'Not Applicable' sort of value
We are just cleaning at this point, and we may use this data in many ways, so rather than having a null, it's simple to update tose nulls to 'NA' for these categorical fields.

Depending on how we use the data later, we'll drop these, or segment them out for different uses.

In [34]:
sparse_columns = ['chal_replay_result_key',
                  'def_two_point',
                  'ex_point_result',
                  'field_goal_result',
                  'interceptor',
                  'passer',
                  'passer_id',
                  'pass_length',
                  'pass_location',
                  'pass_outcome',
                  'penalized_player',
                  'penalized_team',
                  'penalty_yards',
                  'penalty_type',
                  'punt_result',
                  'receiver',
                  'rec_fumb_player',
                  'rec_fumb_team',
                  'returner',
                  'return_result',
                  'run_location',
                  'rusher',
                  'safety_key',
                  'two_point_conv',
                  'run_gap']

for col in sparse_columns:
    print(f"{col} Nulls before:  {data_df[col].isna().sum()}", end=" ")
    data_df[col].fillna("NA", inplace = True)
    print(f"{col} Nulls after:  {data_df[col].isna().sum()}")
    print('----------------------------------------------------------')


chal_replay_result_key Nulls before:  404286 chal_replay_result_key Nulls after:  0
----------------------------------------------------------
def_two_point Nulls before:  407664 def_two_point Nulls after:  0
----------------------------------------------------------
ex_point_result Nulls before:  397578 ex_point_result Nulls after:  0
----------------------------------------------------------
field_goal_result Nulls before:  398629 field_goal_result Nulls after:  0
----------------------------------------------------------
interceptor Nulls before:  403168 interceptor Nulls after:  0
----------------------------------------------------------
passer Nulls before:  228359 passer Nulls after:  0
----------------------------------------------------------
passer_id Nulls before:  0 passer_id Nulls after:  0
----------------------------------------------------------
pass_length Nulls before:  240520 pass_length Nulls after:  0
----------------------------------------------------------
pass_

# Segment into Gameplay, Admin, Analytics datasets

## 11 separate gameplay events from admin events to separate dataframes
These tend to have a lot of nulls because they are really not gameplay events - and they might not be good for all kinds of predictions

In [35]:
print("original shape", data_df.shape)

admin_events_df = data_df.loc[data_df.playattempted != 1].copy()
print("admin_events shape", admin_events_df.shape)

gameplay_facts_df = data_df.loc[data_df.playattempted == 1].copy()
print("game_events shape", gameplay_facts_df.shape)


original shape (407688, 95)
admin_events shape (0, 95)
game_events shape (407688, 95)


In [36]:
gameplay_facts_columns = set(metrics_df.loc[(metrics_df.c_dimension == 'fact'), 'column_name'])
gameplay_dim_columns = set(metrics_df.loc[(metrics_df.c_dimension != 'fact'), 'column_name'])

all_gameplay_columns = set(gameplay_facts_df.columns)

adjusted_dim_columns = gameplay_dim_columns.intersection(all_gameplay_columns)  # we already got rid of the analytics columns - so make sure we are only choosing columns that still exists

print("Facts ::", gameplay_facts_columns)
print("Sparse dimensions ::", adjusted_dim_columns)

gameplay_dimensions_df = gameplay_facts_df[adjusted_dim_columns].copy()
gameplay_facts_df.drop(columns=adjusted_dim_columns, inplace=True)



Facts :: {'goal_to_go', 'yards_gained', 'pos_team_score', 'first_down', 'away_team', 'play_attempted', 'timeout_indicator', 'sp', 'rush_attempt', 'sack', 'down', 'yrdline100', 'abs_score_diff', 'posteam_timeouts_pre', 'play_time_diff', 'home_team', 'game_id', 'onsidekick', 'ydsnet', 'ydstogo', 'chal_replay_result', 'def_team_score', 'date', 'posteam', 'desc', 'qtr', 'pass_attempt', 'reception', 'defensive_team', 'yards_after_catch', 'safety', 'drive', 'yrdln', 'qb_hit', 'time_secs', 'time_under', 'fumble', 'play_type', 'sideof_field', 'score_diff', 'time', 'season', 'interception_thrown', 'touchdown'}
Sparse dimensions :: {'ex_point_prob', 'ex_point_result', 'tackler2', 'home_timeouts_remaining_post', 'returner', 'receiver_id', 'rusher_id', 'rec_fumb_player', 'tackler1', 'penalized_team', 'accepted_penalty', 'run_location', 'penalty_type', 'two_point_prob', 'pass_outcome', 'field_goal_prob', 'interceptor', 'safety_prob', 'air_yards', 'blocking_player', 'timeout_team', 'penalized_player

## 12 fillna down and firstdown to zero

In [37]:
gameplay_facts_df.down = gameplay_facts_df.down.fillna(0)
gameplay_facts_df.first_down = gameplay_facts_df.first_down.fillna(0)
gameplay_facts_df.loc[(gameplay_facts_df.down.isna()), 'play_type'].value_counts()

Series([], Name: play_type, dtype: int64)

## 13 set firstdown to zero for Kickoffs and Extra points

In [38]:
# firstdown

gameplay_facts_df.loc[(gameplay_facts_df.first_down.isna()) & (gameplay_facts_df.play_type.isin(['Kickoff', 'Extra Point'])), 'first_down'] = 0
gameplay_facts_df.loc[(gameplay_facts_df.first_down.isna()), 'play_type'].value_counts()

Series([], Name: play_type, dtype: int64)

# Wind up

## 14 create new metrics with these changes

In [39]:
dimensions_df = pd.DataFrame()
print("read in dimensions data: ", DIMENSIONS_DATA)
try:
    dimensions_df = pd.read_csv(DIMENSIONS_DATA)
except Exception:
    pass

dimensions_df.head()

read in dimensions data:  ../data/raw/dimensions.csv


Unnamed: 0,column_name,feature_type,c_dimension
0,abs_score_diff,continuous,fact
1,accepted._penalty,category,dim (penalty)
2,air_epa,analytics,analytics
3,air_wpa,analytics,analytics
4,air_yards,continuous,"dim(pass, run)"


In [40]:
metrics = GetMetrics()

fact_metrics_df = metrics.get_metrics(gameplay_facts_df, dimensions_df)
fact_categories_df = metrics.get_categories(data_df=gameplay_facts_df, unique_count_threshold=40)

# db = DatabaseLoader(relative_dir="../working_data/fact_metrics.db")
db = DatabaseLoader(connection_string_env_url="DB_CONNECTION_URL")

db.load_table(fact_metrics_df, METRICS_OUTPUT_TABLE_NAME)
db.load_table(fact_categories_df, CATEGORY_OUTPUT_TABLE_NAME)


## 15 save data to disk
can only really save these interim files for small data

In [41]:
gameplay_facts_df.to_parquet(GAMEPLAY_FACTS_DF_NAME, engine='fastparquet',  compression='snappy')
gameplay_dimensions_df.to_parquet(GAMEPLAY_DIM_DF_NAME, engine='fastparquet',  compression='snappy')
analytics_df.to_parquet(ANALYTICS_DF_NAME, engine='fastparquet',  compression='snappy')
admin_events_df.to_parquet(ADMIN_DF_NAME, engine='fastparquet',  compression='snappy')

In [42]:
with open(READ_ME, 'w') as f:
    f.write(f"\n{os.path.basename(GAMEPLAY_FACTS_DF_NAME)}\ta clean version of gameplay with just the core facts")
    f.write(f"\n{os.path.basename(GAMEPLAY_DIM_DF_NAME)}\ta less-clean version of gameplay dimensions that are only non-null for specific kinds of facts")
    f.write(f"\n{os.path.basename(ANALYTICS_DF_NAME)}\tmove all probabilities and stats into a separate dataset - they could be useful later")
    f.write(f"\n{os.path.basename(ADMIN_DF_NAME)}\tmove all gameplay records that are not really plays (e.g. 'Quarter end' to this dataset")
    f.write(f"\n{os.path.basename(METRICS_OUTPUT_TABLE_NAME)}.csv\toptionally - we might save the metrics in a file instead of a database")
    f.write(f"\n{os.path.basename(CATEGORY_OUTPUT_TABLE_NAME)}.csv\toptionally we might save the metrics categories in a file instead of a database")
