# Expected Points


The following data was downloaded through a link provided on [drivebyfootball.com](https://www.drivebyfootball.com/)
on [a page describing their **Expected Points**  (EP) Model.](https://www.drivebyfootball.com/2011/06/our-expected-points-model.html)
An expected points model estimates the average number of points scored for any
game state.  Expected points models are used to evaluate teams and sometimes players,
by using them to compute the expected points  **added** on any given play (subtracting
the expected points of the game state after the play from the expected points of the game state before the
play).  For more information on what such models mean and where they come from,
look [here.](https://www.espn.com/nfl/story/_/id/8379024/nfl-explaining-expected-points-metric)

This is not necessarily the most sophisticated EP model.  For example, the game states used 
are down, yards left to go, and yardline.  So they
don't take into account how much time is left in the half, as some
expected points models do.

This is the [link to the GoogleDocs Spread Sheet](https://docs.google.com/spreadsheets/d/1aWwUTp8FlXBFHgEdwHtvx2n7Uws_mRdBDSND9yF0nwo/edit?authkey=CNWl2cUF&hl=en&authkey=CNWl2cUF&hl=en&authkey=CNWl2cUF#gid=0)
that the drivebyfootball folks provided.

The code below loads a text file created from the Google Docs spreadsheet.  The
text file was created  by saving the Google Doc as a **.pdf** and using Adobe Acrobat
to export that as a Text (Accessible) File. A little editing was required to ensure that what should be
the entries header line each occurred in one line (some column names were broken up because they
conrtained spaces).

In [153]:
import os.path
wd = '/Users/gawron/Desktop/src/sphinx/python_for_ss_extras/colab_notebooks/python-for-social-science/pandas'
fn0 = 'Markov\ Drive\ Analysis\ -\ Google\ Sheets.txt'
fn = 'Markov Drive Analysis - Google Sheets.txt'
new_fn = 'Markov_Drive_Analysis_NFL.csv'
path = os.path.join(wd,fn)
new_path = os.path.join(wd,new_fn)
#  We need this list for reading the data back in with pd.read_csv
columns = ['State',
 'Downs',
 'FGMade',
 'FGMiss',
 'FUM',
 'INT',
 'PUNT',
 'SAF',
 'TD',
 'Frequency',
 'E[Plays Remaining]',
 'Markov EP',
 'Recursive EP',
 'Regressive EP',
 'Down',
 'DTG',
 'Ydline']

## Read in the text file (only needs to be done once)

The code snippet below converts the one-item-per-line format of the text file `path` to 
a list of lists, where each inner list contains the data in one line of the Gooogle Docs spreadsheet.
Doing this requires knowing that there are 17 columns in the spreadsheet, so that every 17 lines 
of `path` should be collected into a list respresenting one line of the spreadsheet.

A Better  idea  Exploit the fact that the first column is the only column containing strings.

In [250]:
buffer = []
res = []
bad_rows = []
num_cols = 17
row_num=0

def not_floatable_line(line):
    if line == "":
        # We will turn empty strings into NaNs, so they are floatable.
        return False
    try:
        float(line)
    except ValueError:
        return True
    return False

def make_float(item):
    try:
        return float(item)
    except ValueError:
        return item
    

In [221]:
with open(path,'r') as fh:
    for (i, line) in enumerate(fh):
        line = line.strip()
        if line == "":
            continue
        if i>=num_cols and not_floatable_line(line):
            ## Wrap it up this is a complete line
            if len(buffer) != num_cols:
                ## Error row Missing or extra data
                print("Bad row")
                bad_rows.append(row_num)
            row_num += 1
            res.append(buffer)
            buffer = []
        try:
            if i>=num_cols:
                buffer.append(make_float(line))
            else:
                buffer.append(line)
        except:
            try:
                buffer.append(line)
            except Exception as e:
                print(i,line)
                raise e

Corresponds to row 58 in the Google Doc

In [223]:
bad_rows

[]

In [218]:
res[bad_rows[0]]

['1-16-46',
 0.0489,
 0.1977,
 0.0543,
 0.0586,
 0.1001,
 0.2979,
 0.0,
 0.2424,
 111.0,
 5.22,
 2.28,
 1.55,
 2.07,
 1.0,
 16.0,
 46.0,
 '',
 '',
 '']

The first "Line" contains the column names.

In [224]:
res[0]

['State',
 'Downs',
 'FGMade',
 'FGMiss',
 'FUM',
 'INT',
 'PUNT',
 'SAF',
 'TD',
 'Frequency',
 'E[Plays Remaining]',
 'Markov EP',
 'Recursive EP',
 'Regressive EP',
 'Down',
 'DTG',
 'Ydline']

This many rows in the spreadsheet:

In [225]:
len(res)

1016

In [226]:
res[1]

['1-1-1',
 0.0155,
 0.0717,
 0.0027,
 0.0222,
 0.0135,
 0.0002,
 0.0,
 0.8742,
 1504.0,
 1.95,
 6.32,
 6.22,
 6.24,
 1.0,
 1.0,
 1.0]

In [227]:
len(res[99])

17

## Create the DataFrame (Done once)

In [229]:
import pandas as pd
df = pd.DataFrame(res[1:],columns=res[0])

In [230]:
df

Unnamed: 0,State,Downs,FGMade,FGMiss,FUM,INT,PUNT,SAF,TD,Frequency,E[Plays Remaining],Markov EP,Recursive EP,Regressive EP,Down,DTG,Ydline
0,1-1-1,0.0155,0.0717,0.0027,0.0222,0.0135,0.0002,0.0,0.8742,1504.0,1.95,6.32,6.22,6.24,1.0,1.0,1.0
1,1-10-10,0.0299,0.3318,0.0265,0.0265,0.0409,0.0025,0.0,0.5419,4326.0,3.50,4.78,4.57,4.69,1.0,10.0,10.0
2,1-10-14,0.0287,0.3307,0.0379,0.0311,0.0489,0.0038,0.0,0.5189,4293.0,3.92,4.61,4.37,4.45,1.0,10.0,14.0
3,1-10-18,0.0299,0.3504,0.0542,0.0345,0.0505,0.0082,0.0,0.4722,4822.0,4.30,4.35,4.07,4.20,1.0,10.0,18.0
4,1-10-22,0.0314,0.3537,0.0655,0.0404,0.0554,0.0179,0.0,0.4357,5235.0,4.61,4.10,3.78,3.95,1.0,10.0,22.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1010,4-9-74,0.0002,0.0000,0.0000,0.0000,0.0000,0.9832,0.0,0.0165,124.0,1.04,0.12,-0.85,-1.48,4.0,9.0,74.0
1011,4-9-78,0.0055,0.0000,0.0000,0.0000,0.0000,0.9724,0.0,0.0220,184.0,1.03,0.15,-0.81,-1.72,4.0,9.0,78.0
1012,4-9-82,0.0119,0.0012,0.0003,0.0008,0.0011,0.9590,0.0,0.0257,86.0,1.11,0.18,-0.78,-1.97,4.0,9.0,82.0
1013,4-9-86,0.0189,0.0000,0.0000,0.0000,0.0000,0.9807,0.0,0.0004,53.0,1.02,0.00,-0.98,-2.22,4.0,9.0,86.0


Agree with spreadsheet row numbering

In [237]:
df.index = df.index + 2

In [238]:
df

Unnamed: 0,State,Downs,FGMade,FGMiss,FUM,INT,PUNT,SAF,TD,Frequency,E[Plays Remaining],Markov EP,Recursive EP,Regressive EP,Down,DTG,Ydline
2,1-1-1,0.0155,0.0717,0.0027,0.0222,0.0135,0.0002,0.0,0.8742,1504.0,1.95,6.32,6.22,6.24,1.0,1.0,1.0
3,1-10-10,0.0299,0.3318,0.0265,0.0265,0.0409,0.0025,0.0,0.5419,4326.0,3.50,4.78,4.57,4.69,1.0,10.0,10.0
4,1-10-14,0.0287,0.3307,0.0379,0.0311,0.0489,0.0038,0.0,0.5189,4293.0,3.92,4.61,4.37,4.45,1.0,10.0,14.0
5,1-10-18,0.0299,0.3504,0.0542,0.0345,0.0505,0.0082,0.0,0.4722,4822.0,4.30,4.35,4.07,4.20,1.0,10.0,18.0
6,1-10-22,0.0314,0.3537,0.0655,0.0404,0.0554,0.0179,0.0,0.4357,5235.0,4.61,4.10,3.78,3.95,1.0,10.0,22.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1012,4-9-74,0.0002,0.0000,0.0000,0.0000,0.0000,0.9832,0.0,0.0165,124.0,1.04,0.12,-0.85,-1.48,4.0,9.0,74.0
1013,4-9-78,0.0055,0.0000,0.0000,0.0000,0.0000,0.9724,0.0,0.0220,184.0,1.03,0.15,-0.81,-1.72,4.0,9.0,78.0
1014,4-9-82,0.0119,0.0012,0.0003,0.0008,0.0011,0.9590,0.0,0.0257,86.0,1.11,0.18,-0.78,-1.97,4.0,9.0,82.0
1015,4-9-86,0.0189,0.0000,0.0000,0.0000,0.0000,0.9807,0.0,0.0004,53.0,1.02,0.00,-0.98,-2.22,4.0,9.0,86.0


## Save the DataFrame (Done once)

In [239]:
df.to_csv(new_path)

## Reading the DataFrame back in (Done every other time we use the data)

In [251]:
import numpy as np

# Keep track of the non floats in the data for debugging purposes
#x_str0s = []

def float_fn(x_str):
    """
    If x_str can be coerced to a float, do so.
    
    Otherwise leave it alone
    """
    try:
        return float(x_str)
    except ValueError as e:
        #x_str0s.append(x_str)
        #return np.NaN
        return x_str

converters = {col:float_fn for col in columns}
df2 = pd.read_csv(new_path,index_col=0,converters=converters)

Most of the columns needed to be converted to floats (by default they will be read back in as strings).

In [233]:
converters

{'State': <function __main__.float_fn(x_str)>,
 'Downs': <function __main__.float_fn(x_str)>,
 'FGMade': <function __main__.float_fn(x_str)>,
 'FGMiss': <function __main__.float_fn(x_str)>,
 'FUM': <function __main__.float_fn(x_str)>,
 'INT': <function __main__.float_fn(x_str)>,
 'PUNT': <function __main__.float_fn(x_str)>,
 'SAF': <function __main__.float_fn(x_str)>,
 'TD': <function __main__.float_fn(x_str)>,
 'Frequency': <function __main__.float_fn(x_str)>,
 'E[Plays Remaining]': <function __main__.float_fn(x_str)>,
 'Markov EP': <function __main__.float_fn(x_str)>,
 'Recursive EP': <function __main__.float_fn(x_str)>,
 'Regressive EP': <function __main__.float_fn(x_str)>,
 'Down': <function __main__.float_fn(x_str)>,
 'DTG': <function __main__.float_fn(x_str)>,
 'Ydline': <function __main__.float_fn(x_str)>}

In [252]:
df2

Unnamed: 0,State,Downs,FGMade,FGMiss,FUM,INT,PUNT,SAF,TD,Frequency,E[Plays Remaining],Markov EP,Recursive EP,Regressive EP,Down,DTG,Ydline
2,1-1-1,0.0155,0.0717,0.0027,0.0222,0.0135,0.0002,0.0,0.8742,1504.0,1.95,6.32,6.22,6.24,1.0,1.0,1.0
3,1-10-10,0.0299,0.3318,0.0265,0.0265,0.0409,0.0025,0.0,0.5419,4326.0,3.50,4.78,4.57,4.69,1.0,10.0,10.0
4,1-10-14,0.0287,0.3307,0.0379,0.0311,0.0489,0.0038,0.0,0.5189,4293.0,3.92,4.61,4.37,4.45,1.0,10.0,14.0
5,1-10-18,0.0299,0.3504,0.0542,0.0345,0.0505,0.0082,0.0,0.4722,4822.0,4.30,4.35,4.07,4.20,1.0,10.0,18.0
6,1-10-22,0.0314,0.3537,0.0655,0.0404,0.0554,0.0179,0.0,0.4357,5235.0,4.61,4.10,3.78,3.95,1.0,10.0,22.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1012,4-9-74,0.0002,0.0000,0.0000,0.0000,0.0000,0.9832,0.0,0.0165,124.0,1.04,0.12,-0.85,-1.48,4.0,9.0,74.0
1013,4-9-78,0.0055,0.0000,0.0000,0.0000,0.0000,0.9724,0.0,0.0220,184.0,1.03,0.15,-0.81,-1.72,4.0,9.0,78.0
1014,4-9-82,0.0119,0.0012,0.0003,0.0008,0.0011,0.9590,0.0,0.0257,86.0,1.11,0.18,-0.78,-1.97,4.0,9.0,82.0
1015,4-9-86,0.0189,0.0000,0.0000,0.0000,0.0000,0.9807,0.0,0.0004,53.0,1.02,0.00,-0.98,-2.22,4.0,9.0,86.0


In [253]:
(df==df2).all()

State                 True
Downs                 True
FGMade                True
FGMiss                True
FUM                   True
INT                   True
PUNT                  True
SAF                   True
TD                    True
Frequency             True
E[Plays Remaining]    True
Markov EP             True
Recursive EP          True
Regressive EP         True
Down                  True
DTG                   True
Ydline                True
dtype: bool

If needed

In [242]:
num_bad_rows=0
for i in range(len(df2)):
    if (df.loc[i+2]== df2.loc[i+2]).all():
        continue
    print(i)
    num_bad_rows+=1
print("Bad", num_bad_rows)

Bad 0


## Confirming data matches spreadsheet

In [246]:
df.loc[1016]

State                 4-9-90
Downs                    0.0
FGMade                   0.0
FGMiss                   0.0
FUM                      0.0
INT                      0.0
PUNT                  0.9732
SAF                      0.0
TD                    0.0268
Frequency               39.0
E[Plays Remaining]      1.05
Markov EP               0.19
Recursive EP           -0.77
Regressive EP          -2.46
Down                     4.0
DTG                      9.0
Ydline                  90.0
Name: 1016, dtype: object

In [247]:
df.loc[2]

State                  1-1-1
Downs                 0.0155
FGMade                0.0717
FGMiss                0.0027
FUM                   0.0222
INT                   0.0135
PUNT                  0.0002
SAF                      0.0
TD                    0.8742
Frequency             1504.0
E[Plays Remaining]      1.95
Markov EP               6.32
Recursive EP            6.22
Regressive EP           6.24
Down                     1.0
DTG                      1.0
Ydline                   1.0
Name: 2, dtype: object

In [248]:
df.loc[10]

State                 1-10-38
Downs                  0.0518
FGMade                 0.2466
FGMiss                 0.0582
FUM                    0.0514
INT                     0.077
PUNT                   0.1724
SAF                       0.0
TD                     0.3424
Frequency              6858.0
E[Plays Remaining]       5.68
Markov EP                3.13
Recursive EP             2.57
Regressive EP            2.97
Down                      1.0
DTG                      10.0
Ydline                   38.0
Name: 10, dtype: object

In [249]:
df.loc[100]

State                 1-5-30
Downs                  0.045
FGMade                0.2892
FGMiss                0.0623
FUM                    0.069
INT                   0.0692
PUNT                  0.0609
SAF                      0.0
TD                    0.4044
Frequency               69.0
E[Plays Remaining]      5.54
Markov EP               3.69
Recursive EP            3.23
Regressive EP            3.8
Down                     1.0
DTG                      5.0
Ydline                  30.0
Name: 100, dtype: object