## <font color=teal>imports</font>

In [2]:
import logging
import os
import sys

from src import configs

sys.path.append(os.path.abspath("../src"))


In [3]:

from src.nfl_00_load_nflverse_data import read_nflverse_datasets
from src.nfl_01_load_nfl_database import create_nfl_database
from src.nfl_02_prepare_weekly_stats import prepare_team_week_dataset
from src.nfl_03_perform_feature_selection import perform_team_week_feature_selection
from src.nfl_04_merge_game_feature_selection import merge_team_week_features

## <font color=teal>housekeeping</font>

In [4]:
# Get the logger
logger = configs.configure_logging("pbp_logger")
logger.setLevel(logging.INFO)

## <font color=teal>ingest data from NFLVerse into local storage, then transform to database dimensions<font/>

our goal in this step is to get the information downloaded with as few risk as possible

this protects us from any unexpected changes in the nflverse datasets or issues wihtout own transformations

once we have the data safely in our system we can dimension and store the data in a database with as many validations and retries as necessary

normally we will run this from a python job 'src.nfl_main_job.py' but for demo purposes we'll run these here



### <font color="#9370DB">read nflverse data to output folders</font>

<font color=purple>

- our goal is to get the data stored without risk of failures - we store directly to local or s3
- The files we want are configures in the configs.py code
- using a synchronous http client
- and a fixed size executor thread pool
<font/>

In [5]:
read_nflverse_datasets()

2023-07-20 09:37:31,478 - INFO - Success: https://github.com/nflverse/nflverse-data/releases/download/nextgen_stats/ngs_2016_rushing.csv.gz
2023-07-20 09:37:31,542 - INFO - Success: https://github.com/nflverse/nflverse-data/releases/download/nextgen_stats/ngs_2016_passing.csv.gz
2023-07-20 09:37:31,600 - INFO - Success: https://github.com/nflverse/nflverse-data/releases/download/nextgen_stats/ngs_2016_receiving.csv.gz
2023-07-20 09:37:31,626 - INFO - Success: https://github.com/nflverse/nflverse-data/releases/download/injuries/injuries_2016.parquet
2023-07-20 09:37:31,714 - INFO - Success: https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2016.parquet
2023-07-20 09:37:31,930 - INFO - Success: https://github.com/nflverse/nflverse-data/releases/download/players/players.parquet
2023-07-20 09:37:32,305 - INFO - Success: https://github.com/nflverse/nflverse-data/releases/download/player_stats/player_stats.parquet
2023-07-20 09:37:32,483 - INFO - Success: https://g

### <font color="#9370DB">load to database</font>

<font color=purple>

In this step we read from the output directory

we then split the data into several dimensions based on cardinality

<br>

for example we,

<br>

* explode the player participation array columns into thier own datasets, and
* create a play_actions dataset that contains key play by play info
* pull out all the player events in the play-by-play data, cross-reference them to the team they are playing for in that week
* pull out game data e.g. game date, final scores, home and away teams, et.

<br>

we load all dimensions to an relational database for availability to other experiments


todo:  the load process should be using a bulk/copy instead of inserts
<font/>

In [6]:
create_nfl_database()

2023-07-20 09:37:40,213 - INFO - Reading all files from pbp
2023-07-20 09:37:40,214 - INFO -   + Reading pbp_2019.parquet
2023-07-20 09:37:40,371 - INFO -   + Reading pbp_2018.parquet
2023-07-20 09:37:40,469 - INFO -   + Reading pbp_2022.parquet
2023-07-20 09:37:40,569 - INFO -   + Reading pbp_2021.parquet
2023-07-20 09:37:40,665 - INFO -   + Reading pbp_2017.parquet
2023-07-20 09:37:40,765 - INFO -   + Reading pbp_2016.parquet
2023-07-20 09:37:40,864 - INFO -   + Reading pbp_2020.parquet
2023-07-20 09:37:43,921 - INFO - Impute columns to 0
2023-07-20 09:37:44,091 - INFO - impute non binary pbp columns ...
2023-07-20 09:37:44,641 - INFO - Impute columns to 0
2023-07-20 09:37:45,290 - INFO - Impute columns to 0:00
2023-07-20 09:37:46,596 - INFO - Impute columns to NA
2023-07-20 09:37:52,080 - INFO - moving play_id to play_counter, and creating a joinable play_id key
2023-07-20 09:37:52,817 - INFO - Conform key actions like pass, rush, kickoff, etc. and add a single category field called

## <font color=teal>prepare data for team/week win/loss experiment<font/>

Our goal is to use the nfl dimensions we have created to prepare data for our a team/week experiment

The team_week experiment aims to select the best features in the nflverse data to predict win/loss

The expectations is not that we can really predict win/loss any better than current statistical approaches, but to experiment with potentials for ML

Normally we will run this from a python job 'src.nfl_main_job.py' but for demo purposes we'll run these here


### <font color="#9370DB">load to database</font>

<font color=purple>

In this step we use SQL queries to pull play actions, events and statistics from the nfl database

We are lagging and leading to fill incomplete data in the original dataset

We are merging all statistics into defense and offense datasets so they can be attributed to specific teams

for example, for the Ravens (BAL) vs Jets (NYJ) in week 1 of 2022 we want separate sets for BAL and NYJ - so each team has its own stats

We are aggregating up to the Season, Week, Team level for this application

We can concatenate available statistics, which are also at the Season, Week, Team level

finally, we create play_action, offense and defense datasets in our data directory and optionally back to the database

<font/>


In [7]:
prepare_team_week_dataset(store_to_db=False)

2023-07-20 09:46:12,034 - INFO - Build a 'control' dataset with all seasons and weeks...
2023-07-20 09:46:12,641 - INFO - query and modify play actions data ...
2023-07-20 09:46:16,236 - INFO - Validating game 2016_01_BUF_BAL values at location: double checking play actions before save...
2023-07-20 09:46:16,257 - INFO - query and modify game info data ...
2023-07-20 09:46:16,284 - INFO - query and modify next gen passing data ...
2023-07-20 09:46:16,321 - INFO - query and modify next gen rushing data ...
2023-07-20 09:46:16,338 - INFO - query and modify play-by-play player events ...
2023-07-20 09:46:20,983 - INFO - query defense player_stats data ...
2023-07-20 09:46:21,110 - INFO - query offense player_stats data ...
2023-07-20 09:46:21,267 - INFO - back and forward fill ngs_air_power metrics by week ...
2023-07-20 09:46:21,285 - INFO - back and forward fill ngs_ground_power metrics by week ...
2023-07-20 09:46:21,294 - INFO - back and forward fill pbp_events metrics by week ...
202

Shape of ngs_air_power                 :  (3961, 21),	 Contains 7 seasons, starting with 2016 and ending in 2022 min week: 1, max week : 22
Shape of ngs_ground_power              :  (3961, 11),	 Contains 7 seasons, starting with 2016 and ending in 2022 min week: 1, max week : 22
Shape of pbp_events                    :  (3961, 9),	 Contains 7 seasons, starting with 2016 and ending in 2022 min week: 1, max week : 22
Shape of defense_stats                 :  (3961, 13),	 Contains 7 seasons, starting with 2016 and ending in 2022 min week: 1, max week : 22
Shape of possession_stats              :  (3961, 28),	 Contains 7 seasons, starting with 2016 and ending in 2022 min week: 1, max week : 22
Shape of game info                     :  (3812, 11),	 Contains 7 seasons, starting with 2016 and ending in 2022 min week: 1, max week : 22
ok
ok


### <font color="#9370DB">feature selection</font>

<font color=purple>

In this step we use the weekly datasets we've created to find the best feature to predict game win/loss

We perform some data prep, including scaling, categorical encoding

We perform sklearn correlations, generally and specifically for the target win/loss column

After some automl experiments, we use xgboost to determine the best features for predicting win/loss

We separate the top features for defense and offense and calculate a weighted average of all to get a single defense_power and offense_power score

We perform a sanity check to validate that the power score columns learn as well as the individual stats, and has the ability to learn

<font/>


In [8]:
perform_team_week_feature_selection()

2023-07-20 09:46:28,362 - INFO - SelectNFLFeatures
2023-07-20 09:46:28,363 - INFO - load tmp_weekly_offense
2023-07-20 09:46:28,371 - INFO - prepare a features dataset
2023-07-20 09:46:28,371 - INFO - encode the target win/loss column
2023-07-20 09:46:28,373 - INFO - create a features dataframe for feature selection ...
2023-07-20 09:46:28,373 - INFO - scale all features  ...
2023-07-20 09:46:29,998 - INFO - get percentage contribution of offensive and defensive features
2023-07-20 09:46:30,012 - INFO - calculate weighted average of offensive and defensive features
2023-07-20 09:46:30,029 - INFO - Writing to tmp_offense_week_features
2023-07-20 09:46:30,052 - INFO - SelectNFLFeatures
2023-07-20 09:46:30,052 - INFO - load tmp_weekly_defense
2023-07-20 09:46:30,067 - INFO - prepare a features dataset
2023-07-20 09:46:30,067 - INFO - encode the target win/loss column
2023-07-20 09:46:30,068 - INFO - create a features dataframe for feature selection ...
2023-07-20 09:46:30,069 - INFO - sca

### <font color="#9370DB">merge play action and performance features into play action</font>

<font color=purple>

In this step we merge defense and offense data back into play_action data

<br>

For example:


Taking the Ravens (BAL) vs Jets (NYJ) game in in week 1 of 2022

for each drive the offense and defense changes:  BAL is offense in drive 1, then defense in drive 2

we create two different slices for that single game - one focused on BAL and the other on NYJ

we then fold in the offense and defense stats for each drive from our defense and offense datasets

<br>

* for drive 1 where BAL is playing offense:

    - the offense's offense_power (offense_op) will come from BAL's offense stats
    - the offense's defense_power (offense_dp) will come from BAL's defense stats
    - the defense's offense_power (defense_op) will come from NYJ's offense stats
    - the defense's defense_power (defense_dp) will come from NYJ's defense stats


<font/>

In [9]:
merge_team_week_features()

2023-07-20 09:46:30,918 - INFO - loading weekly features into a single game dataset...
2023-07-20 09:46:30,919 - INFO - Reading from tmp_weekly_play_actions
2023-07-20 09:46:31,029 - INFO - Reading from tmp_offense_week_features
2023-07-20 09:46:31,033 - INFO - Reading from tmp_defense_week_features
2023-07-20 09:46:31,037 - INFO - merge stats into play_actions...
2023-07-20 09:46:31,874 - INFO - Validating game 2016_01_BUF_BAL values at location: merging offense_OP...
2023-07-20 09:46:32,894 - INFO - Validating game 2016_01_BUF_BAL values at location: merging offense_DP...
2023-07-20 09:46:34,008 - INFO - Validating game 2016_01_BUF_BAL values at location: merging defense_OP...
2023-07-20 09:46:34,700 - INFO - Validating game 2016_01_BUF_BAL values at location: merging defense_DP...
2023-07-20 09:46:34,728 - INFO - aggregate game dataset weekly stats by season, week, team...
2023-07-20 09:46:34,980 - INFO - writing file weekly_game_stats  to /Users/christopherlomeli/Source/courses/dat