# Summary

## inputs
This notebook has three inputs:
1. A NFL `scores` dataset having 1 row per game
2. Our NFL `gameplay` dataset - having many rows per game - each row is a 'play' in the games
3. An NFL `teams` dataset that matches team names to the abbreviation used in the gameplay data  (e.g. Green Bay Packers == GB)

The gameplay data does not have a clear final score - it is primarily concerned with the plays themselves - to get the actual score is hit-and-miss.

## goal
* The goal is to clean and enrich the data so that we can join gameplay data to the actual scores for each game

## cleanup
* The gameplay data has one or two incorrect dates
* The scores data has present-day abbreviations for historical games (e.g. Jacksonville was 'JAC' until 2013, then became 'JAX' - in the scores data it's always 'JAX')
* The gameplay data has Jacksonville as 'JAC' all the way to 2016 (should be 2013) - but we are going to leave that for now and conform our scores to that

## outputs
Join-able versions of:
1. A cleaned NFL `scores` dataset having 1 row per game
2. A cleaned NFL `gameplay` dataset - having many rows per game - each row is a 'play' in the games




# 01 - Prepare

## 01.1 - imports

In [206]:
%load_ext autoreload
%load_ext dotenv
%dotenv
%autoreload 2

import warnings

import numpy as np
import pandas as pd
import os
import sys

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

pd.set_option('display.float_format', lambda x: '%.5f' % x)

np.random.seed(0)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [207]:
module_path = os.path.abspath(os.path.join('../src'))
print("Adding modules", module_path)
if module_path not in sys.path:
    sys.path.append(module_path)

Adding modules /Users/christopherlomeli/Source/courses/datascience/nfl_capstone/src


## 01.2 - setup

In [208]:
FILE_TO_CLEAN="gameplay_facts_cleaned_01.parquet"

AWS_S3_BUCKET = os.getenv('AWS_S3_BUCKET')
AWS_S3_PREFIX = os.getenv('AWS_S3_RAW_PREFIX')

RAW_DATA_PATH = '../data/raw'
INTERIM_DATA_PATH='../data/interim'

# inputs
GAME_PLAYS=os.path.join(INTERIM_DATA_PATH,FILE_TO_CLEAN)
TEAM_SCORES=os.path.join(RAW_DATA_PATH,"spreadspoke_scores.csv")
TEAM_NAMES=os.path.join(RAW_DATA_PATH,"nfl_teams.csv")


# output
CLEAN_FACTS_DF_NAME=os.path.join(INTERIM_DATA_PATH, "gameplay_facts_cleaned_02.parquet")
CLEAN_SCORES_DF_NAME=os.path.join(INTERIM_DATA_PATH, "nfl_scores.parquet")
READ_ME = os.path.join(INTERIM_DATA_PATH,"README.03-cjl-clean.txt")



In [209]:
from src.data.s3utils import download_from_s3

## 01.3 - check for supporting input files

In [210]:
download_from_s3(
    bucket=AWS_S3_BUCKET,
    prefix=AWS_S3_PREFIX,
    local_dir=os.path.abspath(RAW_DATA_PATH),
    wishlist=['spreadspoke_scores.csv', 'nfl_teams.csv']
)

Already exists:  /Users/christopherlomeli/Source/courses/datascience/nfl_capstone/data/raw/nfl_teams.csv
Already exists:  /Users/christopherlomeli/Source/courses/datascience/nfl_capstone/data/raw/spreadspoke_scores.csv


In [211]:
if not os.path.exists(GAME_PLAYS):
    raise Exception(f"Can't find the input file {GAME_PLAYS} .  Have you run the preceding notebooks? ")

# 02 - Get scores data

## 02.1 - load spreadscores

In [212]:
# load
scores_df = pd.read_csv(TEAM_SCORES, parse_dates=['schedule_date'])

# clean up column names and data that we'll join on later
scores_df.drop(columns=['team_favorite_id', 'spread_favorite', 'over_under_line', 'weather_detail'], inplace=True)
scores_df['team_away'] = scores_df['team_away'].str.strip()
scores_df['team_home'] = scores_df['team_home'].str.strip()
scores_df.rename(columns={
    'schedule_date': 'date',
    'schedule_season': 'season',
    'schedule_week': 'week',
    'team_home': 'home_team',
    'team_away': 'away_team'
}, inplace=True)

scores_df.dtypes

date                   datetime64[ns]
season                          int64
week                           object
schedule_playoff                 bool
home_team                      object
score_home                      int64
score_away                      int64
away_team                      object
stadium                        object
stadium_neutral                  bool
weather_temperature           float64
weather_wind_mph              float64
weather_humidity              float64
dtype: object

## 02.2 - load teams list

In [213]:
# load
team_df = pd.read_csv(TEAM_NAMES)

# clean up column names and data that we'll join on later
team_df['team_name'] = team_df['team_name'].str.strip()
team_df['team_id'] = team_df['team_id'].str.strip()
team_df.drop(columns=['team_name_short', 'team_id_pfr','team_conference', 'team_conference_pre2002', 'team_division', 'team_division_pre2002'], inplace=True)

team_df.dtypes

team_name    object
team_id      object
dtype: object

## 02.3 - merge team_ids into score_df

In [214]:
def merge_team_id(scores_df : pd.DataFrame, team_df : pd.DataFrame,  home_or_away_team: str) -> pd.DataFrame:
    """02.3 - merge team_ids into score_df
    The scores_df df has team names as full names, but joining with other data, it should be the abbreviation
    Use the teams_df to get the abbreviation into the scores and also validate that the team name is correct

    :param team_df: all team names and abbreviations
    :param scores_df: spreadspoke scores df
    :param home_or_away_team: literally home_team or away_team
    :return:
    """

    id_name = home_or_away_team.strip() + "_id"
    # perform the merge
    df2 = scores_df.merge(team_df, left_on=home_or_away_team, right_on='team_name', how='left', indicator=True)

    # Jacksonville was JAC prior to 2013, but this dataset thinks it was mid 2016 - use 2016
    df2.loc[(df2.season < 2016) & (df2['team_id'] == 'JAX'), 'team_id'] = 'JAC'

    # the dataset stores team abbreviations, not names, so label them clearly as such
    df2.rename(columns={'team_id': id_name}, inplace=True)

    # quick validation
    cf = df2.loc[( df2[home_or_away_team] != df2.team_name), [home_or_away_team]].sum().item()
    assert cf == 0

    # ok, now drop the columns we no longer need and pass the merged scores_df df back
    df2.drop(columns=['_merge', 'team_name'], inplace=True)
    return df2

In [215]:
print("add the team abbreviation to the scores data:  scores now has a home_team_id (abbreviation)")
scores_df = merge_team_id(scores_df=scores_df, team_df=team_df, home_or_away_team='home_team')
scores_df[['home_team', 'home_team_id']].head()

add the team abbreviation to the scores data:  scores now has a home_team_id (abbreviation)


Unnamed: 0,home_team,home_team_id
0,Miami Dolphins,MIA
1,Houston Oilers,TEN
2,San Diego Chargers,SD
3,Miami Dolphins,MIA
4,Green Bay Packers,GB


In [216]:
print("scores now has a away_team_id (abbreviation)")
scores_df = merge_team_id(scores_df=scores_df, team_df=team_df, home_or_away_team='away_team')
scores_df[['away_team', 'away_team_id']].head()

scores now has a away_team_id (abbreviation)


Unnamed: 0,away_team,away_team_id
0,Oakland Raiders,OAK
1,Denver Broncos,DEN
2,Buffalo Bills,BUF
3,New York Jets,NYJ
4,Baltimore Colts,IND


In [217]:
# validate that there are no null team abbreviations (id's)
assert 0 == scores_df.loc[(scores_df.home_team_id.isna())].count().sum()
assert 0 == scores_df.loc[(scores_df.away_team_id.isna())].count().sum()

## 02.4 - get gameplay data

In [218]:
# load the actual play by play data
gameplay_df = pd.read_parquet(GAME_PLAYS)
gameplay_df.dtypes

play_id                 float32
game_id                  object
old_game_id              object
home_team                object
away_team                object
                         ...   
special                 float32
play                    float32
out_of_bounds           float32
home_opening_kickoff    float32
admin_event               int64
Length: 156, dtype: object

# 03 - conform score and gameplay data for joins
Whether we end up joining or not, we still want to use this to cross-check the gameplay data against the scores data
Look at the Detroit Lions 2017 season - both datasets should have 16 games

#### 03.1 - check gameplay dataset Detroit Lions season

In [219]:
# grab the first season we find - we'll use it to match the two datasets
s = pd.Series(gameplay_df.season).unique().tolist()
validate_year = s[0]

#### 03.2 - check that the scores_df and gameplay_df match for a random team in a season
todo - automate this validation

In [250]:
check_columns = ['season', 'game_date', 'game_id','home_team', 'away_team']
gdf = gameplay_df.loc[
    (gameplay_df.season == validate_year) & ((gameplay_df.home_team=='DET') | (gameplay_df.away_team=='DET')) , check_columns]\
    .groupby(check_columns).count().sort_values(by='game_date').reset_index()

In [251]:
check_columns = ['season', 'date', 'home_team_id', 'away_team_id']
sdf = scores_df.loc[(scores_df.season==validate_year ) & ((scores_df.home_team=='Detroit Lions') | (scores_df.away_team=='Detroit Lions')), check_columns].sort_values(by='date')

In [None]:
# basic check for same size
assert sdf.shape[0] == gdf.shape[0]

In [261]:
# check that each set has the same values
def compare_col(sdf, scolumn, gdf, gcolumn):
    s = set(sdf[scolumn])
    g = set(gdf[gcolumn])
    print(f"Checking {scolumn} against {gcolumn} ::", end=" ")
    assert s.difference(g) == set()
    print("OK")

compare_col(sdf, 'date', gdf, 'game_date')
compare_col(sdf, 'home_team_id', gdf, 'home_team')
compare_col(sdf, 'away_team_id', gdf, 'away_team')

Checking date against game_date :: OK
Checking home_team_id against home_team :: OK
Checking away_team_id against away_team :: OK


## iterate - check and fix any discrepancies

In [222]:
gameplay_df.home_team = gameplay_df.home_team.str.strip()
gameplay_df.away_team = gameplay_df.away_team.str.strip()
gameplay_df.game_id = gameplay_df.game_id.str.strip()

In [224]:
gameplay_df.loc[ (gameplay_df.away_team == 'LV') &  (gameplay_df.game_id.str[8:11] == 'OAK'), 'away_team'] = 'OAK'
gameplay_df.loc[ (gameplay_df.home_team == 'LV') &  (gameplay_df.game_id.str[-3:] == 'OAK'), 'home_team'] = 'OAK'

In [225]:
# the gameplay data date is still an object, and that's good for concatenating later, but we also want a real date
gameplay_df['date_string'] = gameplay_df['game_date']
gameplay_df['game_date'] = pd.to_datetime(gameplay_df['game_date'])

In [226]:
gameplay_df['game_id']  = gameplay_df['date_string'].astype('string').str.replace("-","")+gameplay_df.home_team.str.lower() + gameplay_df.away_team.str.lower()

In [227]:
scores_df['game_id'] = scores_df['date'].astype('string').str.replace("-","")+scores_df.home_team_id.str.lower() + scores_df.away_team_id.str.lower()

## 03.4 - test the join - iterate and clean until it's 100%

In [228]:
# try a test merge
test_df = gameplay_df.merge(scores_df, left_on='game_id', right_on='game_id', how='left', indicator=True)

In [229]:
if 0 == test_df.loc[(test_df['_merge'] != 'both')].count().sum():
    print("Yahoo - we have a complete join between scores and gameplay")
else:
    raise Exception("still not a complete join - needs more work!!!")

Yahoo - we have a complete join etween scores and gameplay


# 05 - output gameplay and score data

In [232]:
gameplay_df.to_parquet(CLEAN_FACTS_DF_NAME, engine='fastparquet',  compression='snappy')
scores_df.to_parquet(CLEAN_SCORES_DF_NAME, engine='fastparquet',  compression='snappy')


In [236]:
if not os.path.exists(READ_ME):
    print("Writing readme to local path:",READ_ME)

    pd.DataFrame([
        {'file': {os.path.basename(CLEAN_FACTS_DF_NAME)}, 'desc': 'clean version of gameplay with just the core facts'},
        {'file': {os.path.basename(CLEAN_SCORES_DF_NAME)}, 'desc': 'less-clean version of gameplay dimensions that are only non-null for specific kinds of facts'}]).to_csv(READ_ME, index=False)
else:
    print(READ_ME, " already exists")

Writing readme to local path: ../data/interim/README.03-cjl-clean.txt
