# Expected Points in Football

This demo covers building a preliminary Expected Points model for football.  The model provides an expected value of having possession of the ball and 1st and 10 from a particular yard line.  For example, what's the expected next score given a drive starting at our own 42 yard line?  Or starting at our own 1 yard line?  Or opponent 1 yard line?  The interpretation is the expected value of the next score in the game, hence a value of how much the possession is worth.  

We first define _Possession Value_ was the value of the next score in the game to the possessing team.  For example, if the 49ers hold the ball 1st and 10 at the 42 yardline and the next score in the game occurred 4 drives later (after the 49ers and Raiders traded punts) and it was a field goal by the 49ers, then the _Possession Value_ when they held the ball 1st and 10 at the 42 yardline was +3 points.

We average the possession value by yardline to get the expected points:
$$
    \text{Expected Points at Yardline $X$} = \frac{1}{\text{# of Drives Starting at Yardline $X$}} \sum_i \text{Posssesion Value of Drive $i$ Starting at Yardline $X$}
$$


We will be using collected kickoffs and drive starts or first downs that have been extracted from NFL play-by-play data.  We restrict the analysis to the first and third quarters only since the second and fourth quarters contain abnormal play due to time and winning constraints.

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
import numpy as np
import pandas as pd

## 1. NFL PxP Data

What does play-by-play data even look like for the NFL?  We will be using a processed form of the data but here is a view of the granularity provided by the play-by-play data. This isn't even a complete view but rather a subset of what's available.

In [None]:
pxp = pd.read_csv('nfl_pxp_2009_2016.csv.gz')
pxp.head(10)

In [None]:
pxp.iloc[1]

## 2. NFL Possession Data: Drive Starts

NFL possession data is loaded from csv format.  We first load data that contains all kickoffs and drive starts.  The two columns we'll chiefly be interested in are `Yardline100` and `PossessionValue`.  The rest of the columns are:

+ GameID
+ Drive - index given the # of the drive within the game
+ Quarter
+ Half
+ Down
+ Yardline100 - the yard line expressed on a scale of 1 to 99 instead of 1 to 50 and back to 1.
+ YrdRegion - region of the field: Inside the 10, 10 to 20, and beyond 20.
+ PossessionType - either a first down or a kickoff
+ PossessionTeam - possessing team
+ AbsScoreDiff
+ NextScore - Next score in the game (+ for home team, - for away team)
+ PossessionValue - value of possession (+ if NextScore favors possessing team)

In [None]:
states_drive_starts = pd.read_csv('nfl_drive_starts_2009_2016.csv.gz')

In [None]:
states_drive_starts.head(10)

### What is the Expected Points value for a Kickoff?

The first thing we can ask is what is the expected value of receiving a kickoff?  What is is the expected value of the next score?

To do this, we do two things: get all the kickoffs and compute the average possession value.

In [None]:
# Restrict to kickoff events
ko = states_drive_starts['PossessionType'] == 'Kickoff'

In [None]:
# Compute the average kickoff value
ekv = states_drive_starts.loc[ko]['PossessionValue'].mean()
print(f"Expected Kickoff Value: {ekv:.3f} pts")

_Questions_

1. If the kickoff has expected value of about 0.6 points to the receiving team, then how much is a touchdown or field goal really worth in expectation?
2. How might this affect your decision making as a coach?  If you're deciding between trying to score a field goal or punting the ball away, would it matter if a field goal was not worth 3 points but rather worth something less?

### Expected Points for Drive Starts

We can group by each yardline where a drive started and computed the expected points.  This is straightforward: we group by `Yardline100` and compute the mean of `PossessionValue`.  We can plot the result too.

In [None]:
drive_starts = states_drive_starts.loc[~ko]

raw_drive_start_epv = drive_starts.\
    groupby('Yardline100')['PossessionValue'].\
    mean()

raw_drive_start_epv.plot();

_Questions_

1. Expected points generally increases as we get closer to the goal (Yardline100 near 0).  Does this confirm your own intuitions about scoring in football?
2. About where is the breakeven point, ie. the point where the possession is worth 0 points and thus even between the possessing team and the defending team?
3. Why is the line jagged?  In a lot of cases, it seems to tell us that if we move a yard closer, we'll have a lower expected points value.  Why do you think that's wrong?

#### How often do possessions even start at some yardlines?

We just computed average possession value for each drive starting at particular yardline.  Not many drives start at the opponent's 1 yardline given that it requires a turnover or a big return to start there.  A worthwhile thing to ask is, how often _do_ we start at these yardlines?

In some cases, not often.

In [None]:
bins = np.arange(0, 100)
drive_starts.hist(column='Yardline100', bins=bins);

In [None]:
drive_starts.groupby('Yardline100')['PossessionValue'].count().head(20)

_Question_

1. If barely any possessions start within the opponent 10 yardline, should we trust our model for expected points based on drive start?

## 2. NFL Possession Data Part 2: First Downs

Similar to the drive start data, we can also consider all first downs (regular 1st and 10 and 1st and Goal plays, not 1st and 5 due to a penalty).   We can load data that contains all kickoffs and first downs.  See the top for more information about the columns.

In [None]:
states_first_downs = pd.read_csv('nfl_first_downs_2009_2016.csv.gz')

In [None]:
states_first_downs.head(10)

### How often is there a first down at a given yardline?

We've expanded our dataset.  Now we just consider times when a team had a first down at any point during a drive, not just the first play of the drive.

From the histogram, we see there are a _lot_ more observations for each yardline (the 80/20 and 75/25 yardlines are about the same).

In [None]:
# extract kickoffs again
ko_2 = states_first_downs['PossessionType'] == 'Kickoff'


first_downs = states_first_downs.loc[~ko_2]
first_downs.hist(column='Yardline100', bins=bins);

_Question_

1. What are the patterns we're seeing in the histogram?  Why are there little spikes at the 1 and 99/1 yardlines?  Why is there a mini spike at the 75/25 yardline?  Why is there a jump at the 70/30 yardline?

### What do the Possession Values look like?

How often are touchdowns scored from the 1 yardline?  What about from our own 1 yardline?

In [None]:
first_downs.loc[first_downs.Yardline100 == 1].\
    hist('PossessionValue', bins=np.arange(-7, 8))

In [None]:
first_downs.loc[first_downs.Yardline100 == 99].\
    hist('PossessionValue', bins=np.arange(-7, 8))

### Did the value of the kickoff change?

It's worth asking, with this different dataset did something change about kickoffs?

Nope.  And why should it change?  We only incorporated more first downs, we didn't change how we viewed value.

In [None]:
# Compute the average kickoff value
ekv = states_first_downs.loc[ko_2]['PossessionValue'].mean()
print(f"Expected Kickoff Value: {ekv:.3f} pts")

### Expected Points for a First Down

As before, we group by each yardline and computed the expected points for the first downs.

The plot looks a lot less jagged: on the right-hand side from about 60 to 100 it looks identical to before.  But it looks a lot better on the left-hand side

In [None]:
raw_first_down_epv = first_downs.\
    groupby('Yardline100')['PossessionValue'].\
    mean()

raw_first_down_epv.plot();

_Questions_

1. Suppose your team has a 1st and 10 at its own 40 yardline (60 in Yardline100).  What is the added value of a 20 yard play?
2. Why did expanded our dataset to include all first downs make our expected points model less jagged?
3. Brainstorm some ideas for how we can take the above plot and produce a smooth line for expected points?
4. Regardless of how we would do it, what are some ways we would like to extend this model beyond it's current valuation of first downs?