In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

## NFL PxP Data

NFL play-by-play data is loaded from csv format.  

Below is a list of avaible columns we will use.  Many are self explanatory so when needed, a description will be given.  Note, there are many more fields available that we will not use.  Also note that the original NFL play-by-play data has been manipulated to provide some of these columns so this isn't exactly what you get in raw form
+ GameID
+ Drive - index given the # of the drive within the game
+ Quarter
+ Half
+ Down
+ Yardline100 - the yard line expressed on a scale of 1 to 99 instead of 1 to 50 and back to 1.
+ YardstoGo - yards to go for a first down
+ Yards.Gained - yards gained on the play
+ YrdRegion - region of the field: Inside the 10, 10 to 20, and beyond 20.
+ PossessionTeam - possessing team
+ DefensiveTeam - defensive team
+ ~~desc - play description~~  <-- Had to drop this due to memory problems
+ PlayType - label for what type of play
+ Touchdown - 0,1 indicating if a TD was scored
+ FieldGoalResult - label indicating good, blocked, or no good.
+ FieldGoalDistance
+ HomeScore & AwayScore - The score of the possession and defensive teams are given.  This changes as the ball changes possession
+ PosTeamScore - Score of the possessing team.  This will flip when the possession flips.
+ DefTeamScore - Score of the defensive team.  This will flip when the possession flips.
+ HomeTeam
+ AwayTeam

In [2]:
pxp = pd.read_csv('nfl_pxp_2009_2016.csv.gz')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
pxp.head()

Unnamed: 0,GameID,Drive,Quarter,Down,Time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,Yardline,...,HomeScore,AwayScore,PosScore.onplay,DefScore.onplay,HomeScore.onplay,AwayScore.onplay,NextPossessionTeam,NextYardline100,NextDown,1stDownConversion
0,2009091000,1,1,0,15:00,15,3600.0,0.0,TEN,30.0,...,0.0,0.0,0,0,0,0,PIT,58.0,1.0,1
1,2009091000,1,1,1,14:53,15,3593.0,7.0,PIT,42.0,...,0.0,0.0,0,0,0,0,PIT,53.0,2.0,0
2,2009091000,1,1,2,14:16,15,3556.0,37.0,PIT,47.0,...,0.0,0.0,0,0,0,0,PIT,56.0,3.0,0
3,2009091000,1,1,3,13:35,14,3515.0,41.0,PIT,44.0,...,0.0,0.0,0,0,0,0,PIT,56.0,4.0,0
4,2009091000,1,1,4,13:27,14,3507.0,8.0,PIT,44.0,...,0.0,0.0,0,0,0,0,TEN,98.0,1.0,1


### Extract Kickoffs and Possessions

We need to extract kickoffs and possession starts in order to build a possession value calculator.  To do that, we follow this process:
1. **Extract kickoffs by using PlayType.**
2. **Extract possessions**:
    + Drive Starts: Drill down by GameID and Drive # (ignoring kickoffs) and take the first play of the drive.
    + First Downs: Drill down by GameID and Drive # and take all 1st and 10 or 1st and Goal to Go plays.
3. **Find the next score in the game for each possession.**  This is the hardest computation.  We do this by computing differences in the home and away scores and then fill those backward.  We treat home scores as positive and away scores as negative.  We only consider possession value within a half.  So if there is no score before halftime or the end of the game, the value is 0.
4. **Compute possession value.**  We multiply the next score value by +1 or -1 depending on if the current possessing team is the home team or away team.  If its the home team, then multiply by +1 because the next score is already oriented to the home team.  If its the away team, then multiply by -1 because a positive next score is a negative for the away team.
5. **Restrict possessions to the first and third quarter.**  We want to avoid end of half/game effects like settling for points at the end of the first half or playing to win at the end of the game.

Some caveats:
+ The dataset used is not perfect so while we expect this procedure to work a vast majority of the time, it may miss some results because of holes in the dataset.  It is unlikely this affects the analysis too much.
+ While we restricted to first and third quarters, we did not restrict cases when there is a blowout.  Competitive games lead to more reliable results so this is probably the first issue to address going forward.

In [4]:
def extract_states(pxp, get_first_downs=False):
    # Step 1: extract kickoffs
    ko_mask = pxp['PlayType'] == 'Kickoff'
    kickoffs = pxp.loc[ko_mask].copy()
    
    # Step 2: extract end of halves
    half_ends = pxp.groupby(['GameID', 'Half']).tail(1).copy()

    # Step 3: extract relevant states
    if not get_first_downs:
        # drives must start on first down
        valid_drive_starts = (pxp.Down == 1)
        # group by GameID and Drive and take the first play (dataset is already sorted)
        possessions = pxp.loc[valid_drive_starts].\
            groupby(['GameID', 'Drive']).head(1)        
        
    else:
        # 1st and 10 as well as 1st and goal to go
        first_down_mask = (pxp.Down == 1) & \
            ((pxp.YardstoGo == 10) | ((pxp.GoalToGo == 1.) & (pxp.YardstoGo <= 10.)))
        possessions = pxp.loc[first_down_mask].copy()
    
    # Step 4:
    # Boolean value to denote a half end
    possessions['HalfEnd'] = 0
    kickoffs['HalfEnd'] = 0
    half_ends['HalfEnd'] = 1

    # Step 5:
    # Concatenate kickoffs and drive starts, sort, and reindex
    states = pd.concat([kickoffs, possessions, half_ends])
    states.sort_values(
        ['GameID', 'Drive', 'TimeSecs'],
        ascending=[True, True, False],
        na_position='first', 
        inplace=True
    )
    states.reset_index(drop=True, inplace=True)
    
    # Step 6:
    # Group into halves
    game_halves = states.groupby(['GameID', 'Half'])

    # Step 7: Find the next score
    # Shift the Home and Away scores back 1 position
    states['HomeScore.nextstate'] = game_halves['HomeScore'].shift(-1)
    states['AwayScore.nextstate'] = game_halves['AwayScore'].shift(-1)
    
    # Step 8:
    # Group by game halves again (now with *Score.nextstate)
    game_halves = states.groupby(['GameID', 'Half'])
    
    # Step 9:
    # Compute changes in the scores.  + for Home and - for Away.
    score_change = (states['HomeScore.nextstate'] - states['HomeScore']) - \
        (states['AwayScore.nextstate'] - states['AwayScore'])
    # Backfill the score change so that each possession now has a value for next score in the game
    next_score = score_change.replace(to_replace=0., method='bfill').fillna(0)
    states['NextScore'] = next_score

    # Step 10: Compute possession value
    # Determine if the possessing team is home or away
    posteam = states['PossessionTeam']
    hometeam = states['HomeTeam']
    awayteam = states['AwayTeam']
    posteam_is_home = np.where(posteam == hometeam, 1, 0)
    posteam_is_away = np.where(posteam == awayteam, 1, 0)
    # NextScore is unchanged if posteam == hometeam and negated if posteam == awayteam
    states['PossessionValue'] = states['NextScore'] * \
        (posteam_is_home - posteam_is_away)

    # Step 11: Retrict to first and third quarters
    first_and_third_qtr = (states['Quarter'] == 1) | (states['Quarter'] == 3)
    states = states.loc[first_and_third_qtr].copy()

    # Step 12: Finalize
    # Compute some extra values
    states['AbsScoreDiff'] = np.abs(states['HomeScore'] - states['AwayScore']) 
    states['PossessionType'] = np.where(states['PlayType'] == 'Kickoff', 'Kickoff', 'FirstDown')
    
    cols = ['GameID', 'Drive', 'Half', 'Quarter', 'Yardline100', 'YrdRegion',
            'HomeTeam', 'AwayTeam',  'PossessionType', 'PossessionTeam',
            'AbsScoreDiff', 'NextScore', 'PossessionValue']
    # Drop the half ends, subset the columns
    states = states.loc[(states['HalfEnd'] != 1), cols].copy()
    states.reset_index(drop=True, inplace=True)
    return states

## Extract just Drive Starts

In [5]:
states_drive_starts = extract_states(pxp)
states_drive_starts.to_csv(
    'nfl_drive_starts_2009_2016.csv.gz', compression='gzip', index=False)
states_drive_starts.head(10)

Unnamed: 0,GameID,Drive,Half,Quarter,Yardline100,YrdRegion,HomeTeam,AwayTeam,PossessionType,PossessionTeam,AbsScoreDiff,NextScore,PossessionValue
0,2009091000,1,1,1,30.0,Beyond20,PIT,TEN,Kickoff,PIT,0.0,7.0,7.0
1,2009091000,1,1,1,58.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
2,2009091000,2,1,1,98.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
3,2009091000,3,1,1,43.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
4,2009091000,4,1,1,89.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
5,2009091000,5,1,1,73.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
6,2009091000,6,1,1,74.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
7,2009091000,7,1,1,79.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
8,2009091000,8,1,1,44.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
9,2009091000,14,2,3,30.0,Beyond20,PIT,TEN,Kickoff,TEN,0.0,-3.0,3.0


## Extract First Downs

In [6]:
states_first_downs = extract_states(pxp, get_first_downs=True)
states_first_downs.to_csv(
    'nfl_first_downs_2009_2016.csv.gz', compression='gzip', index=False)
states_first_downs.head(10)

Unnamed: 0,GameID,Drive,Half,Quarter,Yardline100,YrdRegion,HomeTeam,AwayTeam,PossessionType,PossessionTeam,AbsScoreDiff,NextScore,PossessionValue
0,2009091000,1,1,1,30.0,Beyond20,PIT,TEN,Kickoff,PIT,0.0,7.0,7.0
1,2009091000,1,1,1,58.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
2,2009091000,2,1,1,98.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
3,2009091000,3,1,1,43.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
4,2009091000,3,1,1,30.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
5,2009091000,4,1,1,89.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
6,2009091000,4,1,1,42.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
7,2009091000,4,1,1,22.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
8,2009091000,5,1,1,73.0,Beyond20,PIT,TEN,FirstDown,PIT,0.0,7.0,7.0
9,2009091000,6,1,1,74.0,Beyond20,PIT,TEN,FirstDown,TEN,0.0,7.0,-7.0
