## <font color=#0e0654>NFL ingest and load process</font>
Chris Lomeli
Springboard Capstone

## <font color=#0e0654>This notebook is for demonstration of the NFL ingest and load process</font>

The goal for this notebook is to demonstrate the workflow of (1) downloading NFL data from [NFLVerse](https://github.com/nflverse), (2) wrangling and splitting dimensional data into a relational database and (3) transforming the data a subset of the data into a team/week win/loss prediction model.

The steps to accomplish this are:

- download data from NFLVerse

- clean and restructure the data into semi-normailzed relational tables
- prepare the data by querying a subset of the data for a team/week level input dataset that can be used to predict win/loss
- perform feature selection to get the right features tthen aggregate the weigted averages into power scores
- merge game and power score data into the input dataset for our experiment

This job would normally be run from nfl_main_job.py, which autonomously executes the individual steps in the same way you see in this notebook.


<img src="../images/nfl.png" alt="NFLVerse Ingest">


<div style="border: 1px solid rgba(147, 112, 219, 0.1); margin: 1px 0;"></div>

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">imports</h3>
</div>

In [1]:
import logging
import os
import sys

from src import configs

sys.path.append(os.path.abspath("../src"))


In [2]:
from src.nfl_00_load_nflverse_data import read_nflverse_datasets
from src.nfl_01_load_nfl_database import create_nfl_database
from src.nfl_02_prepare_weekly_stats import prepare_team_week_dataset
from src.nfl_03_perform_feature_selection import perform_team_week_feature_selection
from src.nfl_04_merge_game_feature_selection import merge_team_week_features

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">housekeeping</h3>
</div>


In [3]:
# Get the logger
logger = configs.configure_logging("pbp_logger")
logger.setLevel(logging.INFO)


<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">download nflverse data</h3>
</div>

our goal in this step is to get the information downloaded with as few risk as possible

this protects us from any unexpected changes in the nflverse datasets or issues wihtout own transformations

once we have the data safely in our system we can dimension and store the data in a database with as many validations and retries as necessary

normally we will run this from a python job 'src.nfl_main_job.py' but for demo purposes we'll run these here



### <font color="#0e0654">read nflverse data to output folders</font>

<font color=#0e0654>

- our goal is to get the data stored without risk of failures - we store directly to local or s3
- The files we want are configures in the configs.py code
- using a synchronous http client
- and a fixed size executor thread pool
<font/>

In [4]:
read_nflverse_datasets()

2023-07-20 15:16:20,931 - INFO - begin downloads ...


............................................

2023-07-20 15:16:28,754 - INFO - downloads complete ...


....

<div style="border: 1px solid rgba(147, 112, 219, 0.4); margin: 1px 0;"></div>

### <font color="#0e0654">load to database</font>

<font color=#0e0654>

In this step we read from the output directory

we then split the data into several dimensions based on cardinality

<br>

for example,

<br>

* explode the player participation array columns into thier own datasets, and
* create a play_actions dataset that contains key play by play info
* pull out all the player events in the play-by-play data, cross-reference them to the team they are playing for in that week
* pull out game data e.g. game date, final scores, home and away teams, et.

<br>

we load all dimensions to an relational database for availability to other experiments


todo:  the load process should be using a bulk/copy instead of inserts
<font/>

In [5]:
create_nfl_database()

2023-07-20 15:16:28,759 - INFO - Reading all files from pbp
2023-07-20 15:16:32,896 - INFO - Impute columns to 0
2023-07-20 15:16:33,092 - INFO - impute non binary pbp columns ...
2023-07-20 15:16:33,866 - INFO - Impute columns to 0
2023-07-20 15:16:34,489 - INFO - Impute columns to 0:00
2023-07-20 15:16:35,875 - INFO - Impute columns to NA
2023-07-20 15:16:41,408 - INFO - moving play_id to play_counter, and creating a joinable play_id key
2023-07-20 15:16:42,259 - INFO - Conform key actions like pass, rush, kickoff, etc.... 
2023-07-20 15:16:54,577 - INFO - Validate actions dimension ...
2023-07-20 15:16:54,858 - INFO - checking play_action counts...
2023-07-20 15:16:54,870 - INFO - Creating new drive dimension...
2023-07-20 15:16:54,941 - INFO - Validate drive_df dimension ...
2023-07-20 15:16:55,168 - INFO - Creating new situations dimension...
2023-07-20 15:16:55,235 - INFO - Validate situation_df dimension ...
2023-07-20 15:16:55,446 - INFO - Creating new metrics dimension...
2023

<div style="border: 1px solid rgba(147, 112, 219, 0.4); margin: 1px 0;"></div>

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">prepare data for team/week win/loss experiment</h3>
</div>

Our goal is to use the nfl dimensions we have created to prepare data for our a team/week experiment

The team_week experiment aims to select the best features in the nflverse data to predict win/loss

The expectations is not that we can really predict win/loss any better than current statistical approaches, but to experiment with potentials for ML

Normally we will run this from a python job 'src.nfl_main_job.py' but for demo purposes we'll run these here


### <font color="#0e0654">query and build a team/week dataset</font>

<font color=#0e0654>

In this step we use SQL queries to pull play actions, events and statistics from the nfl database

We are lagging and leading to fill incomplete data in the original dataset

We are merging all statistics into defense and offense datasets so they can be attributed to specific teams

for example, for the Ravens (BAL) vs Jets (NYJ) in week 1 of 2022 we want separate sets for BAL and NYJ - so each team has its own stats

We are aggregating up to the Season, Week, Team level for this application

We can concatenate available statistics, which are also at the Season, Week, Team level

finally, we create play_action, offense and defense datasets in our data directory and optionally back to the database

<font/>


In [6]:
prepare_team_week_dataset(store_to_db=False)

2023-07-20 15:25:12,001 - INFO - Build a 'control' dataset with all seasons and weeks...
2023-07-20 15:25:12,601 - INFO - query and modify play actions data ...
2023-07-20 15:25:16,297 - INFO - double checking play actions before save...
2023-07-20 15:25:16,319 - INFO - query and modify game info data ...
2023-07-20 15:25:16,356 - INFO - query and modify next gen passing data ...
2023-07-20 15:25:16,398 - INFO - query and modify next gen rushing data ...
2023-07-20 15:25:16,421 - INFO - query and modify play-by-play player events ...
2023-07-20 15:25:21,086 - INFO - query defense player_stats data ...
2023-07-20 15:25:21,192 - INFO - query offense player_stats data ...
2023-07-20 15:25:21,365 - INFO - back and forward fill ngs_air_power metrics by week ...
2023-07-20 15:25:21,382 - INFO - back and forward fill ngs_ground_power metrics by week ...
2023-07-20 15:25:21,392 - INFO - back and forward fill pbp_events metrics by week ...
2023-07-20 15:25:21,400 - INFO - back and forward fill 

ok
ok


<div style="border: 1px solid rgba(147, 112, 219, 0.4); margin: 1px 0;"></div>

### <font color="#0e0654">feature selection</font>

<font color=#0e0654>

In this step we use the weekly datasets we've created to find the best feature to predict game win/loss

We perform some data prep, including scaling, categorical encoding

We perform sklearn correlations, generally and specifically for the target win/loss column

After some automl experiments, we use xgboost to determine the best features for predicting win/loss

We separate the top features for defense and offense and calculate a weighted average of all to get a single defense_power and offense_power score

We perform a sanity check to validate that the power score columns learn as well as the individual stats, and has the ability to learn

<font/>


In [7]:
perform_team_week_feature_selection()

2023-07-20 15:25:22,235 - INFO - SelectNFLFeatures
2023-07-20 15:25:22,236 - INFO - load tmp_weekly_offense
2023-07-20 15:25:22,244 - INFO - prepare a features dataset
2023-07-20 15:25:22,244 - INFO - encode the target win/loss column
2023-07-20 15:25:22,245 - INFO - create a features dataframe for feature selection ...
2023-07-20 15:25:22,246 - INFO - scale all features  ...
2023-07-20 15:25:23,174 - INFO - get percentage contribution of offensive and defensive features
2023-07-20 15:25:23,177 - INFO - calculate weighted average of offensive and defensive features
2023-07-20 15:25:23,184 - INFO - Writing to tmp_offense_week_features
2023-07-20 15:25:23,206 - INFO - SelectNFLFeatures
2023-07-20 15:25:23,207 - INFO - load tmp_weekly_defense
2023-07-20 15:25:23,211 - INFO - prepare a features dataset
2023-07-20 15:25:23,211 - INFO - encode the target win/loss column
2023-07-20 15:25:23,212 - INFO - create a features dataframe for feature selection ...
2023-07-20 15:25:23,213 - INFO - sca

<div style="border: 1px solid rgba(147, 112, 219, 0.4); margin: 1px 0;"></div>

### <font color="#0e0654">merge play action and performance features into play action</font>

<font color=#0e0654>

In this step we merge defense and offense data back into play_action data

<br>

For example:


Taking the Ravens (BAL) vs Jets (NYJ) game in in week 1 of 2022

for each drive the offense and defense changes:  BAL is offense in drive 1, then defense in drive 2

we create two different slices for that single game - one focused on BAL and the other on NYJ

we then fold in the offense and defense stats for each drive from our defense and offense datasets

<br>

* for drive 1 where BAL is playing offense:

    - the offense's offense_power (offense_op) will come from BAL's offense stats
    - the offense's defense_power (offense_dp) will come from BAL's defense stats
    - the defense's offense_power (defense_op) will come from NYJ's offense stats
    - the defense's defense_power (defense_dp) will come from NYJ's defense stats


<font/>

In [8]:
merge_team_week_features()

2023-07-20 15:25:24,018 - INFO - loading weekly features into a single game dataset...
2023-07-20 15:25:24,019 - INFO - Reading from tmp_weekly_play_actions
2023-07-20 15:25:24,145 - INFO - Reading from tmp_offense_week_features
2023-07-20 15:25:24,149 - INFO - Reading from tmp_defense_week_features
2023-07-20 15:25:24,152 - INFO - merge stats into play_actions...
2023-07-20 15:25:25,020 - INFO - merging offense_OP...
2023-07-20 15:25:26,048 - INFO - merging offense_DP...
2023-07-20 15:25:27,169 - INFO - merging defense_OP...
2023-07-20 15:25:27,884 - INFO - merging defense_DP...
2023-07-20 15:25:27,910 - INFO - aggregate game dataset weekly stats by season, week, team...
2023-07-20 15:25:28,172 - INFO - writing file weekly_game_stats


<div style="border: 1px solid rgba(147, 112, 219, 0.4); margin: 1px 0;"></div>