<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NFL-PxP-Data" data-toc-modified-id="NFL-PxP-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>NFL PxP Data</a></span></li><li><span><a href="#Processing-the-PxP-data" data-toc-modified-id="Processing-the-PxP-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Processing the PxP data</a></span></li><li><span><a href="#Save-DataFrame" data-toc-modified-id="Save-DataFrame-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Save DataFrame</a></span></li></ul></div>

In [None]:
import numpy as np
import pandas as pd

# Handling NFL Play-by-Play

This notebook is an optional companion to the homework and shows how the play-by-play data was handled to extract the necessary components for modeling 4th downs.

## NFL PxP Data



NFL play-by-play data is loaded from csv format.  Originally there were larger files containing all the fields and one file for each year.  The dataset has been pared down before this notebook to save space.

Below is a list of avaible columns we will use.  Many are self explanatory so when needed, a description will be given.  Note, there are many more fields available that we will not use.
+ GameID
+ Drive - index given the # of the drive within the game
+ qtr
+ down
+ yrdline100 - the yard line expressed on a scale of 1 to 99 instead of 1 to 50 and back to 1.
+ ydstogo - yards to go for a first down
+ Yards.Gained - yards gained on the play
+ posteam - possessing team
+ DefensiveTeam - defensive team
+ ~~desc - play description~~  <-- Had to drop this due to memory problems
+ PlayType - label for what type of play
+ Touchdown - 0,1 indicating if a TD was scored
+ FieldGoalResult - label indicating good, blocked, or no good.
+ FieldGoalDistance
+ PosTeamScore - Score of the possessing team.  This will flip when the possession flips.
+ DefTeamScore - Score of the defensive team.  This will flip when the possession flips.
+ HomeTeam
+ AwayTeam


A few convenient data fields are added to ease computation of possession value.

+ half
+ yrdregion - region of the field: Inside the 10, 10 to 20, and beyond 20.
+ HomeScore & AwayScore - The score of the possession and defensive teams are given.  This changes as the ball changes possession
+ nextposteam - The team possessing the ball in the next play. Non-plays are ignored
+ nextyrdline100 - Where the ball is on the next play. Non-plays are ignored
+ nextdown - The down for the next play
+ 1stdownconversion - Whether the current play converted a first down (0 or 1 value)

In [None]:
pxp = pd.read_csv(f'data/nfl_pxp_2009_2016_reduced.csv.gz')
pxp.head(10)

## Processing the PxP data

The PxP data doesn't have everything we want so we can compute some things:
1. We use a helper function `cut` which will cut the values of an array into bins.  An easy example of this is how we cut `qtr` to get `half`.  Another example is how we cut `yrdline100` to get regions on the field.
2. We need to compute the home team score and the away team score.  Using `posteam`, `DefensiveTeam`, `PosTeamScore`, `DefTeamScore`, `HomeTeam`, and `AwayTeam`, we can piece these values together with some simple logic.  For some reason certain events leave null values in the dataset so we fill those values in using a helper function `fill_null`.  We forward fill the score values since one of these events doesn't lead to score changes.
3. We compute `nextposteam`, `nextyrdline100`, `nextdown`, and `1stdownconversion`.  This basically tracks the result the current play to find where things stand at the end of the play.  This is a rather involved part that is not possible with the `datascience` package.  For details, see the code comments.

In [None]:
def process_pxp(pxp):
    # Step 1
    # Compute half.  For OT, use half = 3.
    pxp['half'] = pd.cut(
        pxp['qtr'], [1, 2, 4, 5], labels=[1, 2, 3], include_lowest=True)

    # Compute field region
    pxp['yrdregion'] = pd.cut(
        pxp['yrdline100'], [0., 9., 20., 100.], labels=['Inside10', '10to20', 'Beyond20'])
    
    # Step 2
    # Convert PosTeamScore/DefTeamScore to HomeScore/AwayScore
    pos_is_home = np.where(pxp['HomeTeam'] == pxp['posteam'], 1, 0)
    def_is_home = np.where(pxp['HomeTeam'] == pxp['DefensiveTeam'], 1, 0)
    pos_is_away = np.where(pxp['AwayTeam'] == pxp['posteam'], 1, 0)
    def_is_away = np.where(pxp['AwayTeam'] == pxp['DefensiveTeam'], 1, 0)

    pxp['HomeScore'] = pos_is_home * pxp['PosTeamScore'] + \
        def_is_home * pxp['DefTeamScore']
    pxp['HomeScore'].fillna(method='ffill', inplace=True)
    pxp['AwayScore'] = pos_is_away * pxp['PosTeamScore'] + \
        def_is_away * pxp['DefTeamScore']
    pxp['AwayScore'].fillna(method='ffill', inplace=True)

    # Step 3
    # Compute nextposteam, nextyrdline100, nextdown, and 1stdownconversion.  
    # Must be within a game and a half since possession doesn't carry between 
    # halves and obviously not between games.
    # Non-relevant plays are ignored so that computations aren't mangled.

    # Group by game halves
    game_halves = pxp.groupby(['GameID', 'half'])
    
    for col in ['posteam', 'yrdline100', 'down']:
        next_col = 'next' + col
        # Within a game half, shift the values for the column backward so that 
        # the n+1st entry becomes the nth entry
        next_col_vals = game_halves[col].shift(-1)
        # Plays we want to ignore
        next_col_vals.fillna(method='bfill', inplace=True)
        # add the new column to the table
        pxp[next_col] = next_col_vals

    mask = (pxp['nextdown'] == 1.) | (pxp['ydstogo'] <= pxp['Yards.Gained']) | \
        (pxp['Touchdown'] == 1)
    pxp['1stdownconversion'] = np.where(mask, 1, 0)
    
    # Cleanup NaNs
    for c in pxp.select_dtypes(include=['object']).columns:
        pxp[c].fillna('None', inplace=True)
    
    # Clean index
    pxp.reset_index(drop=True, inplace=True)

    return pxp

In [None]:
pxp = process_pxp(pxp)

In [None]:
pxp.info()

## Save DataFrame

In [None]:
pxp.to_csv('data/nfl_pxp_2009_2016.csv.gz', compression='gzip', index=False)