
<font color=teal>
_______________________________________
</font>


### <font color=teal>Goal:</font>

- Build a dimensional version of the nflverse data, separating data with different cardinalities

### <font color=teal>Input:</font>

- output directory where we downloaded nflverse files


### <font color=teal>Steps:</font>
- Split data into smaller dimensions .e.g game info vs play info vs play analytics, etc.
- Insert data into DB tables
- store to a database for further experimentation

### <font color=teal>Output:</font>

- DB tables


![nflverse database](../images/database.png)



<font color=teal>
_______________________________________
</font>


# <font color=teal>imports</font>
Most processing is performed in python code, and there's a python module to do everything here without manual

In [1]:
import logging
import os
import sys

from src import configs

sys.path.append(os.path.abspath("../src"))



Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


In [2]:
from src.nflverse_transform_job import load_files
from src.pbp_fact import transform_pbp
from src.pbp_participation import transform_pbp_participation
from src.player_stats import transform_player_stats, merge_injuries
from src.player_injuries import prep_player_injuries
from src.player_stats import transform_players
from src.db_utils import load_dims_to_db


# <font color=teal>housekeeping</font>

In [3]:
LOAD_TO_DB = True
database_schema = 'controls'

# Get the logger
logger = configs.configure_logging("pbp_logger")
logger.setLevel(logging.INFO)

---

# <font color=teal>load and transform play by play datasets</font>

### <font color="#9370DB">load</font>

In [4]:

%%time
pbp = load_files(data_subdir='pbp')


2023-07-18 07:43:21,208 - INFO - Reading all files from pbp
2023-07-18 07:43:21,209 - INFO -   + Reading pbp_2019.parquet
2023-07-18 07:43:21,381 - INFO -   + Reading pbp_2018.parquet
2023-07-18 07:43:21,510 - INFO -   + Reading pbp_2022.parquet
2023-07-18 07:43:21,636 - INFO -   + Reading pbp_2021.parquet
2023-07-18 07:43:21,756 - INFO -   + Reading pbp_2017.parquet
2023-07-18 07:43:21,860 - INFO -   + Reading pbp_2016.parquet
2023-07-18 07:43:21,974 - INFO -   + Reading pbp_2020.parquet


CPU times: user 2.64 s, sys: 571 ms, total: 3.21 s
Wall time: 2.21 s


### <font color="#9370DB">transform</font>

In [5]:

%%time
datasets = transform_pbp(pbp)

2023-07-18 07:43:25,967 - INFO - Impute columns to 0
2023-07-18 07:43:26,141 - INFO - impute non binary pbp columns ...
2023-07-18 07:43:26,811 - INFO - Impute columns to 0
2023-07-18 07:43:27,400 - INFO - Impute columns to 0:00
2023-07-18 07:43:28,900 - INFO - Impute columns to NA
2023-07-18 07:43:35,290 - INFO - moving play_id to play_counter, and creating a joinable play_id key
2023-07-18 07:43:36,124 - INFO - Conform key actions like pass, rush, kickoff, etc. and add a single category field called actions... 
2023-07-18 07:43:47,972 - INFO - Validate actions dimension ...
2023-07-18 07:43:48,299 - INFO - Creating new drive dimension...
2023-07-18 07:43:48,371 - INFO - Validate drive_df dimension ...
2023-07-18 07:43:48,627 - INFO - Creating new situations dimension...
2023-07-18 07:43:48,697 - INFO - Validate situation_df dimension ...
2023-07-18 07:43:48,935 - INFO - Creating new metrics dimension...
2023-07-18 07:43:48,982 - INFO - Validate play_metrics_df dimension ...
2023-07-1

CPU times: user 26.2 s, sys: 5.63 s, total: 31.8 s
Wall time: 32.6 s


---

# <font color=teal>load and transform play by play participation datasets</font>

In [6]:
%%time
pbp_participation_df = load_files('pbp-participation')


2023-07-18 06:27:27,412 - INFO - Reading all files from pbp-participation
2023-07-18 06:27:27,413 - INFO -   + Reading pbp-participation_2019.parquet
2023-07-18 06:27:27,443 - INFO -   + Reading pbp-participation_2018.parquet
2023-07-18 06:27:27,469 - INFO -   + Reading pbp-participation_2017.parquet
2023-07-18 06:27:27,493 - INFO -   + Reading pbp-participation_2021.parquet
2023-07-18 06:27:27,517 - INFO -   + Reading pbp-participation_2020.parquet
2023-07-18 06:27:27,540 - INFO -   + Reading pbp-participation_2016.parquet
2023-07-18 06:27:27,562 - INFO -   + Reading pbp-participation_2022.parquet


CPU times: user 280 ms, sys: 96.3 ms, total: 376 ms
Wall time: 269 ms


### <font color="#9370DB">transform</font>

In [7]:
%%time
player_df, player_events_df = transform_pbp_participation(
    participation_df=pbp_participation_df,
    player_events=datasets['player_events'])

datasets.update({
    'player_participation': player_df,
    'player_events': player_events_df,
})

2023-07-18 06:27:27,684 - INFO - pbp_participation:  move play_id to a play_count column and create a unique play_id that can be used in joins...
2023-07-18 06:27:27,938 - INFO - Calculating defense and offense team names by player and play...
2023-07-18 06:27:30,745 - INFO - Exploding offensive players to their own dataset...
2023-07-18 06:27:32,282 - INFO - Exploding defense_players to their own dataset...


CPU times: user 16.3 s, sys: 1.46 s, total: 17.7 s
Wall time: 17.8 s


---

# <font color=teal>transform player injuries</font>

### <font color="#9370DB">load</font>

In [8]:
%%time
injuries_df = load_files('injuries')

2023-07-18 06:27:45,499 - INFO - Reading all files from injuries
2023-07-18 06:27:45,500 - INFO -   + Reading injuries_2017.parquet
2023-07-18 06:27:45,506 - INFO -   + Reading injuries_2021.parquet
2023-07-18 06:27:45,511 - INFO -   + Reading injuries_2020.parquet
2023-07-18 06:27:45,515 - INFO -   + Reading injuries_2016.parquet
2023-07-18 06:27:45,520 - INFO -   + Reading injuries_2022.parquet
2023-07-18 06:27:45,524 - INFO -   + Reading injuries_2019.parquet
2023-07-18 06:27:45,529 - INFO -   + Reading injuries_2018.parquet


CPU times: user 47.9 ms, sys: 8.66 ms, total: 56.5 ms
Wall time: 44.9 ms


### <font color="#9370DB">transform</font>

In [9]:
%%time
injuries_df = prep_player_injuries(injuries_df)

2023-07-18 06:27:45,547 - INFO - Prep injury data...
2023-07-18 06:27:45,547 - INFO - Conforming names (e.g. gsis_id -> player_id)
2023-07-18 06:27:45,557 - INFO - Merge sparse injury columns
2023-07-18 06:27:45,560 - INFO - Get best values for null report_statuses...
2023-07-18 06:27:45,633 - INFO - check that all positions are correct...


CPU times: user 88.6 ms, sys: 3.84 ms, total: 92.5 ms
Wall time: 94.3 ms


---

# <font color=teal>transform player stats</font>

In [10]:
%%time
stats_df = load_files('player-stats')
stats_df = transform_player_stats(stats_df)
stats_df = merge_injuries(player_stats=stats_df, player_injuries=injuries_df)

2023-07-18 06:27:45,644 - INFO - Reading all files from player-stats
2023-07-18 06:27:45,648 - INFO -   + Reading player-stats.parquet
2023-07-18 06:27:45,690 - INFO - fix specific player_stats: <function player_stats_fixes at 0x2acdc4280>..
2023-07-18 06:27:45,772 - INFO - replace empty position_groups with position info...
2023-07-18 06:27:45,788 - INFO - replace empty player_name with player_display_name info...
2023-07-18 06:27:45,801 - INFO - replace empty headshot_url with 'none'...
2023-07-18 06:27:45,812 - INFO - fillna(0) for all binary columns...
2023-07-18 06:27:45,813 - INFO - Impute columns to 0


CPU times: user 374 ms, sys: 52.5 ms, total: 427 ms
Wall time: 375 ms


---

# <font color=teal>direct loads </font>

### <font color="#9370DB">adv stats</font>

In [11]:
%%time

advstats_def_df = load_files('advstats-season-def')
advstats_pass_df = load_files('advstats-season-pass')
advstats_rec_df = load_files('advstats-season-rec')
advstats_rush_df = load_files('advstats-season-rush')


2023-07-18 06:27:46,022 - INFO - Reading all files from advstats-season-def
2023-07-18 06:27:46,024 - INFO -   + Reading advstats-season-def.parquet
2023-07-18 06:27:46,029 - INFO - Reading all files from advstats-season-pass
2023-07-18 06:27:46,030 - INFO -   + Reading advstats-season-pass.parquet
2023-07-18 06:27:46,032 - INFO - Reading all files from advstats-season-rec
2023-07-18 06:27:46,033 - INFO -   + Reading advstats-season-rec.parquet
2023-07-18 06:27:46,037 - INFO - Reading all files from advstats-season-rush
2023-07-18 06:27:46,037 - INFO -   + Reading advstats-season-rush.parquet


CPU times: user 19.1 ms, sys: 6.67 ms, total: 25.8 ms
Wall time: 18.4 ms


### <font color="#9370DB">nextgen stats</font>

In [12]:
%%time
next_pass_df = load_files('nextgen-passing')


2023-07-18 06:27:46,046 - INFO - Reading all files from nextgen-passing
2023-07-18 06:27:46,047 - INFO -   + Reading nextgen-passing_2017.csv.gz
2023-07-18 06:27:46,052 - INFO -   + Reading nextgen-passing_2021.csv.gz
2023-07-18 06:27:46,056 - INFO -   + Reading nextgen-passing_2019.csv.gz
2023-07-18 06:27:46,061 - INFO -   + Reading nextgen-passing_2020.csv.gz
2023-07-18 06:27:46,065 - INFO -   + Reading nextgen-passing_2016.csv.gz
2023-07-18 06:27:46,070 - INFO -   + Reading nextgen-passing_2022.csv.gz
2023-07-18 06:27:46,075 - INFO -   + Reading nextgen-passing_2018.csv.gz


CPU times: user 29.9 ms, sys: 5.15 ms, total: 35.1 ms
Wall time: 36.7 ms


In [13]:
%%time
next_rec_df = load_files('nextgen-receiving')


2023-07-18 06:27:46,085 - INFO - Reading all files from nextgen-receiving
2023-07-18 06:27:46,087 - INFO -   + Reading nextgen-receiving_2021.csv.gz
2023-07-18 06:27:46,093 - INFO -   + Reading nextgen-receiving_2017.csv.gz
2023-07-18 06:27:46,099 - INFO -   + Reading nextgen-receiving_2019.csv.gz
2023-07-18 06:27:46,106 - INFO -   + Reading nextgen-receiving_2016.csv.gz
2023-07-18 06:27:46,113 - INFO -   + Reading nextgen-receiving_2020.csv.gz
2023-07-18 06:27:46,120 - INFO -   + Reading nextgen-receiving_2018.csv.gz
2023-07-18 06:27:46,127 - INFO -   + Reading nextgen-receiving_2022.csv.gz


CPU times: user 42.6 ms, sys: 7.87 ms, total: 50.5 ms
Wall time: 51.1 ms


In [14]:
%%time
next_rush_df = load_files('nextgen-rushing')

2023-07-18 06:27:46,139 - INFO - Reading all files from nextgen-rushing
2023-07-18 06:27:46,141 - INFO -   + Reading nextgen-rushing_2018.csv.gz
2023-07-18 06:27:46,145 - INFO -   + Reading nextgen-rushing_2022.csv.gz
2023-07-18 06:27:46,150 - INFO -   + Reading nextgen-rushing_2016.csv.gz
2023-07-18 06:27:46,153 - INFO -   + Reading nextgen-rushing_2020.csv.gz
2023-07-18 06:27:46,156 - INFO -   + Reading nextgen-rushing_2019.csv.gz
2023-07-18 06:27:46,159 - INFO -   + Reading nextgen-rushing_2021.csv.gz
2023-07-18 06:27:46,164 - INFO -   + Reading nextgen-rushing_2017.csv.gz


CPU times: user 23.8 ms, sys: 5.32 ms, total: 29.2 ms
Wall time: 29.4 ms


### <font color="#9370DB">players</font>

In [15]:
%%time
players_df = load_files('players')
players_df = transform_players(players_df)

2023-07-18 06:27:46,172 - INFO - Reading all files from players
2023-07-18 06:27:46,174 - INFO -   + Reading players.parquet
2023-07-18 06:27:46,203 - INFO - Process players dataset...
2023-07-18 06:27:46,203 - INFO - drop players without gsis_ids - they won't link to player_stats
2023-07-18 06:27:46,221 - INFO - fill empty players status to 'NONE'
2023-07-18 06:27:46,228 - INFO - rename gsis_id to player_id...


CPU times: user 63.5 ms, sys: 6.42 ms, total: 69.9 ms
Wall time: 58.2 ms


---

# <font color=teal>store to database so we can perform some SQL operations</font>

In [16]:
def load_all_datasets_to_db(data: dict):
    data['schema'] = database_schema
    load_dims_to_db(data)


In [17]:
%%time
if LOAD_TO_DB:
    datasets.update({
        'players': players_df,
        'player_stats': stats_df,
        'adv_stats_def': advstats_def_df,
        'adv_stats_pass': advstats_pass_df,
        'adv_stats_rec': advstats_rec_df,
        'adv_stats_rush': advstats_rush_df,
        'nextgen_pass': next_pass_df,
        'nextgen_rec': next_rec_df,
        'nextgen_rush': next_rush_df
    })
    load_all_datasets_to_db(datasets)

2023-07-18 06:27:46,236 - INFO - create table play_actions in schema controls
2023-07-18 06:28:34,283 - INFO - create table game_drive in schema controls
2023-07-18 06:29:00,262 - INFO - create table play_analytics in schema controls
2023-07-18 06:30:25,748 - INFO - create table play_situations in schema controls
2023-07-18 06:30:57,303 - INFO - create table play_metrics in schema controls
2023-07-18 06:31:22,067 - INFO - create table player_events in schema controls
2023-07-18 06:31:33,618 - INFO - create table game_info in schema controls
2023-07-18 06:31:33,890 - INFO - create table player_participation in schema controls
2023-07-18 06:35:58,738 - INFO - create table players in schema controls
2023-07-18 06:36:00,617 - INFO - create table player_stats in schema controls
2023-07-18 06:36:19,564 - INFO - create table adv_stats_def in schema controls
2023-07-18 06:36:20,041 - INFO - create table adv_stats_pass in schema controls
2023-07-18 06:36:20,108 - INFO - create table adv_stats_r

CPU times: user 4min 28s, sys: 43.9 s, total: 5min 12s
Wall time: 8min 39s
