# A Least Squares Predictor for Fantasy Football

In Fantasy Football, contestants choose from a pool of available (American) football players to build a team.  Contestants' teams score points depending on how their chosen players performed in real-life.  The more points scored, the better!

There are literally hundreds of websites and blogs dedicated to predicting who will have a good game.  They use a variety of methodologies (including no methodology at all) to generate their predictions.  We will try to develop a predictor using Linear Least Squares that will answer the question: "Should I pick this player?"

We'll import our standard packages, along with `pandas`, which is a `python` data analysis library.

In [1]:
import numpy as np
import numpy.linalg as la
import pandas as pd

There are two data sets, `FF-data-2018.csv` and `FF-data-2019.csv` that were collected using scoring from the Yahoo Fantasy Football platform.  The 2018 data was collected from [here](http://rotoguru1.com/cgi-bin/fyday.pl?week=16&year=2018&game=yh&scsv=1).  You can choose other years going back to 2011 from a variety of platforms.

Let's read in the data and see what it looks like.

In [2]:
ff_2018 = pd.read_csv('FF-data-2018.csv')
ff_2018

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,YH points,YH salary
0,1,2018,1242,Fitzpatrick; Ryan,QB,tam,a,nor,42.28,25.0
1,1,2018,1151,Brees; Drew,QB,nor,h,tam,31.56,33.0
2,1,2018,1231,Rivers; Philip,QB,lac,h,kan,29.96,31.0
3,1,2018,1523,Mahomes II; Patrick,QB,kan,a,lac,28.34,27.0
4,1,2018,1252,Rodgers; Aaron,QB,gnb,h,chi,24.94,39.0
...,...,...,...,...,...,...,...,...,...,...
6350,16,2018,7013,Indianapolis,Def,ind,h,nyg,2.00,13.0
6351,16,2018,7010,Denver,Def,den,a,oak,1.00,16.0
6352,16,2018,7029,Tampa Bay,Def,tam,a,dal,1.00,10.0
6353,16,2018,7015,Kansas City,Def,kan,a,sea,-1.00,13.0


There are 6,355 data points which have a number of fields.  They are:
- **Week**: The NFL season features 17 weeks of games, and each team plays 16 games in this time period.  This column tells you which week the player's game was.  I didn't include week 17, because many of the best players take that week off.


- **Year**: Which year the game was played.  For this data set, all the year values are equal to 2018.


- **GID**: A unique ID tag for each player.  We'll ignore this column.


- **Name**: The actual name of the player.  In the case of defenses, the defense of the entire team is included, so in that case, this is the name of a city.


- **Pos**: This is the position of the player.  The available choices are quarterback (QB), running back (RB), wide receiver (WR), tight end (TE), and defense (Def).


- **Team**: An abbreviation that indicates which team the player belongs to.  Ryan Fitzpatrick was a member of the Tampa Bay Buccaneers, so his Team value is "tam".


- **h/a**: Whether the player's game was played at home or on the road.  The possible values are 'h' (home) and 'a' (away).


- **Oppt**: The opposing team that the player faced.  Ryan Fitzpatrick played against the New Orleans Saints in week 1, so his Oppt value is "nor".


- **YH points**: The amount of points the player scored that week.  Ryan Fitzpatrick scored a whopping 42.28 points in week 1.


- **YH salary**: On many Fantasy Football sites, you start with a certain budget, and select a team of players within the constraints of that budget.  Ryan Fitzpatrick only took 25.0 "dollars" of your available budget if you selected him on your team.  It gives an indication of how the platform judges the quality of a player.

We can access the labels and put them in a list:

In [3]:
labels = list(ff_2018.columns)
print(labels)

['Week', 'Year', 'GID', 'Name', 'Pos', 'Team', 'h/a', 'Oppt', 'YH points', 'YH salary']


We can print out the available values of the positions for the data set by passing the key `Pos` as a string to the data set.

In [4]:
print(ff_2018['Pos'].values)

['QB' 'QB' 'QB' ... 'Def' 'Def' 'Def']


To remove all the duplicates, we can call the function `numpy.unique` to access all distinct values (just like every other time you use a new function, review the documentation of `numpy.unique`. You can do so by running a cell with the following command: `np.unique?`).

**Check your answer:**

Create the variable `positions` to store the unique players positions existing in the column `Pos` of the dataframe:

In [5]:
#grade_clear
positions = np.unique(ff_2018['Pos'])
print(positions)
print(type(positions))
print(positions.shape)
print(type(positions[0]))

['Def' 'QB' 'RB' 'TE' 'WR']
<class 'numpy.ndarray'>
(5,)
<class 'str'>


Since the positions in football are so different, we really want to focus on one at a time.  It would be very ambitious to try and create a general predictor for all positions.  Let's focus on quarterbacks first.  

**Check your answer:**

Extract all the data for quarterbacks, by finding the rows in the dataframe that has position equal to  `QB`. Create a **copy** of this smaller dataframe as a new dataset named `df_POS`.

In [8]:
POS = 'QB'

In [9]:
#grade_clear
df_POS = ff_2018[ff_2018['Pos'] == POS].copy()
df_POS

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,YH points,YH salary
0,1,2018,1242,Fitzpatrick; Ryan,QB,tam,a,nor,42.28,25.0
1,1,2018,1151,Brees; Drew,QB,nor,h,tam,31.56,33.0
2,1,2018,1231,Rivers; Philip,QB,lac,h,kan,29.96,31.0
3,1,2018,1523,Mahomes II; Patrick,QB,kan,a,lac,28.34,27.0
4,1,2018,1252,Rodgers; Aaron,QB,gnb,h,chi,24.94,39.0
...,...,...,...,...,...,...,...,...,...,...
5968,16,2018,1536,Allen; Kyle,QB,car,h,atl,1.52,0.0
5969,16,2018,1507,Sudfeld; Nate,QB,phi,h,hou,0.00,20.0
5970,16,2018,1484,Mannion; Sean,QB,lar,a,ari,-0.20,20.0
5971,16,2018,1336,Hoyer; Brian,QB,nwe,h,buf,-0.20,20.0


We can access the names of all the quarterbacks by referring to the columns `Name`

In [10]:
df_POS['Name']

0         Fitzpatrick; Ryan
1               Brees; Drew
2            Rivers; Philip
3       Mahomes II; Patrick
4            Rodgers; Aaron
               ...         
5968            Allen; Kyle
5969          Sudfeld; Nate
5970          Mannion; Sean
5971           Hoyer; Brian
5972           Hill; Taysom
Name: Name, Length: 586, dtype: object

Linear Least Squares works with numerical data, not strings.  Eventually, we will want our predictive models to incorporate whether the player played at home or on the road, or how good their opponent was.  But the columns `h/a` and `Oppt` are strings:

In [11]:
df_POS['h/a']

0       a
1       h
2       h
3       a
4       h
       ..
5968    h
5969    h
5970    a
5971    h
5972    h
Name: h/a, Length: 586, dtype: object

In [12]:
df_POS['Oppt']

0       nor
1       tam
2       kan
3       lac
4       chi
       ... 
5968    atl
5969    hou
5970    ari
5971    buf
5972    pit
Name: Oppt, Length: 586, dtype: object

At this point, we need to make decisions about what numerical values these should take.  

For the home/away column, let's make an array with the value +1.0 when the game is played at home, and -1.0 when the game is played away.

**Check your answer:** 

You can use the function `np.where` (https://numpy.org/doc/stable/reference/generated/numpy.where.html) to accomplish that:
```python
np.where(df_POS['h/a']=='a',-1,1)
```



Store this numerical array as another column in the pandas dataframe `df_POS`, with label `home_away`

In [13]:
#grade_clear
df_POS['home_away'] = np.where(df_POS['h/a']=='a',-1,1)
df_POS

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,YH points,YH salary,home_away
0,1,2018,1242,Fitzpatrick; Ryan,QB,tam,a,nor,42.28,25.0,-1
1,1,2018,1151,Brees; Drew,QB,nor,h,tam,31.56,33.0,1
2,1,2018,1231,Rivers; Philip,QB,lac,h,kan,29.96,31.0,1
3,1,2018,1523,Mahomes II; Patrick,QB,kan,a,lac,28.34,27.0,-1
4,1,2018,1252,Rodgers; Aaron,QB,gnb,h,chi,24.94,39.0,1
...,...,...,...,...,...,...,...,...,...,...,...
5968,16,2018,1536,Allen; Kyle,QB,car,h,atl,1.52,0.0,1
5969,16,2018,1507,Sudfeld; Nate,QB,phi,h,hou,0.00,20.0,1
5970,16,2018,1484,Mannion; Sean,QB,lar,a,ari,-0.20,20.0,-1
5971,16,2018,1336,Hoyer; Brian,QB,nwe,h,buf,-0.20,20.0,1


For the opponents, we need some kind of information about how many points they give up to a position on average.  We have compiled that information in a separate file, called `team_rankings.py`.  Importing this file will give us access to a collection of dictionaries that provides the information we need.

In [14]:
from team_rankings import *  # asterik just means we import everything from that namespace

Since we are using data from 2018, the stored dictionary `vs_2018` has the relevant ranking. Take a look at the keys in the dictionary:

In [15]:
vs_2018.keys() 

dict_keys(['QB', 'WR', 'RB', 'TE', 'Def'])

Note that the keys are just the player positions. Let's see the information for the key `QB` (we have been storing this string in the variable `POS`)

In [16]:
vs_2018[POS]

{'ari': 28,
 'atl': 1.0,
 'bal': 29.0,
 'buf': 32.0,
 'car': 9.0,
 'chi': 31.0,
 'cin': 3.0,
 'cle': 13.0,
 'dal': 24.0,
 'den': 27.0,
 'det': 15.0,
 'gnb': 12.0,
 'hou': 19.0,
 'ind': 21.0,
 'jac': 23.0,
 'kan': 5.0,
 'lac': 25.0,
 'lar': 20.0,
 'mia': 10.0,
 'min': 30.0,
 'nor': 2.0,
 'nwe': 18,
 'nyg': 16.0,
 'nyj': 6.0,
 'oak': 8,
 'phi': 11.0,
 'pit': 17.0,
 'sea': 22.0,
 'sfo': 7.0,
 'tam': 4.0,
 'ten': 26.0,
 'was': 14}

In [17]:
print(vs_2018[POS]['atl'])
print(vs_2018[POS]['buf'])

1.0
32.0


There are 32 football teams in the NFL.  

The fact that `vs_2018['QB']['atl']` has the value 1.0, means that the Atlanta Falcons gave up the **most** points to quarterbacks on average in the 2018 season.  

Since `vs_2018['QB']['buf']` has the value 32.0, this means that the Buffalo Bills gave up the **least** points to quarterbacks on average in the 2018 season.

So, we would expect a better performance out of a quarterback if he is playing the Atlanta Falcons, compared to the Buffalo Bills. 

The rankings can be very different for different positions:

In [18]:
print(vs_2018['RB']['atl'])
print(vs_2018['RB']['buf'])
print()
print(vs_2018['WR']['atl'])
print(vs_2018['WR']['buf'])
print()
print(vs_2018['TE']['atl'])
print(vs_2018['TE']['buf'])
print()
print(vs_2018['Def']['atl'])
print(vs_2018['Def']['buf'])
print()

4.0
7.0

6.0
29.0

20.0
32.0

21.0
2.0



For the quarterback position (`POS = 'QB'`), we want to convert the strings in the column `Oppt` into their corresponding numerical values using the dictionary `vs_2018`. 

To accomplish that, we will use the `apply` function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

In [19]:
def get_rank(x):
    # in this example, x will have row entries of the column in df_POS with label 'Oppt'
    # i.e. x can be 'car', 'buf', 'atl', etc (the abbreviation of the team name)
    # the function will return the ranking for that team given by the dictionary 'vs_2018'
    return vs_2018[POS][x]

df_POS['Oppt'].apply(get_rank)

0        2.0
1        4.0
2        5.0
3       25.0
4       31.0
        ... 
5968     1.0
5969    19.0
5970    28.0
5971    32.0
5972    17.0
Name: Oppt, Length: 586, dtype: float64

**Check your answer:** 

Store this new numeric information as another column of the pandas dataframe `df_POS` with label `oppt_rank`:

In [20]:
#grade_clear
df_POS['oppt_rank'] = df_POS['Oppt'].apply(get_rank)

Your dataframe now should have 586 rows and 12 columns:

In [21]:
df_POS

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,YH points,YH salary,home_away,oppt_rank
0,1,2018,1242,Fitzpatrick; Ryan,QB,tam,a,nor,42.28,25.0,-1,2.0
1,1,2018,1151,Brees; Drew,QB,nor,h,tam,31.56,33.0,1,4.0
2,1,2018,1231,Rivers; Philip,QB,lac,h,kan,29.96,31.0,1,5.0
3,1,2018,1523,Mahomes II; Patrick,QB,kan,a,lac,28.34,27.0,-1,25.0
4,1,2018,1252,Rodgers; Aaron,QB,gnb,h,chi,24.94,39.0,1,31.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5968,16,2018,1536,Allen; Kyle,QB,car,h,atl,1.52,0.0,1,1.0
5969,16,2018,1507,Sudfeld; Nate,QB,phi,h,hou,0.00,20.0,1,19.0
5970,16,2018,1484,Mannion; Sean,QB,lar,a,ari,-0.20,20.0,-1,28.0
5971,16,2018,1336,Hoyer; Brian,QB,nwe,h,buf,-0.20,20.0,1,32.0


Now, players' names will be repeated in the array `names` for every game they played.  We will find it convenient to have another array collecting the names without these repeats.  We'll use `pandas.Series.unique` to do this.

In [22]:
unique_players = df_POS['Name'].unique()
len(unique_players)

73

So 73 quarterbacks played in 2018.  But there are only 32 teams!  Who are all these people?

In [23]:
print(unique_players[7])
print(unique_players[72])

Brady; Tom
Sudfeld; Nate


I know who Tom Brady is, but I've never heard of Nate Sudfeld. Let's count how many times a players played a game.

We can use `groupby` to group players by Name, and then count the number of times each player appears.

In [24]:
df_POS.groupby('Name')['Name'].count()

Name
Allen; Brandon      1
Allen; Josh        11
Allen; Kyle         1
Anderson; Derek     2
Barkley; Matt       1
                   ..
Webb; Joe           2
Weeden; Brandon     1
Wentz; Carson      11
Wilson; Russell    15
Winston; Jameis    10
Name: Name, Length: 73, dtype: int64

We want to add the frequency back to the original dataframe, and for that we will use transform to return an aligned index.

In [25]:
df_POS.groupby('Name')['Name'].transform('count')

0        8
1       15
2       15
3       15
4       15
        ..
5968     1
5969     1
5970     2
5971     4
5972    15
Name: Name, Length: 586, dtype: int64

**Check your answer:** 

Add this new array as a column of `df_POS` with label `game_count`.

In [26]:
#grade_clear
df_POS['game_count'] = df_POS.groupby('Name')['Name'].transform('count')
df_POS

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,YH points,YH salary,home_away,oppt_rank,game_count
0,1,2018,1242,Fitzpatrick; Ryan,QB,tam,a,nor,42.28,25.0,-1,2.0,8
1,1,2018,1151,Brees; Drew,QB,nor,h,tam,31.56,33.0,1,4.0,15
2,1,2018,1231,Rivers; Philip,QB,lac,h,kan,29.96,31.0,1,5.0,15
3,1,2018,1523,Mahomes II; Patrick,QB,kan,a,lac,28.34,27.0,-1,25.0,15
4,1,2018,1252,Rodgers; Aaron,QB,gnb,h,chi,24.94,39.0,1,31.0,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5968,16,2018,1536,Allen; Kyle,QB,car,h,atl,1.52,0.0,1,1.0,1
5969,16,2018,1507,Sudfeld; Nate,QB,phi,h,hou,0.00,20.0,1,19.0,1
5970,16,2018,1484,Mannion; Sean,QB,lar,a,ari,-0.20,20.0,-1,28.0,2
5971,16,2018,1336,Hoyer; Brian,QB,nwe,h,buf,-0.20,20.0,1,32.0,4


Note that Nate Sudfeld only played in 1 game in 2018.  He probably took over when the starter was injured, or when his team was involved in a lopsided game.  We probably want to remove his data, since it won't be very helpful.

**Check your answer:**

Create a new (smaller) dataframe (remember to use the `copy` function) that only includes the rows from `df_POS` in which the number of games played (stored in the column `game_count`) is greater or equal than `min_games`. For now, we will use `min_games = 5`. 

Store this new dataframe as `df_new` (Check: this new dataframe will have 512 rows and 13 columns).

In [27]:
min_games = 5

In [28]:
#grade_clear
df_new = df_POS[df_POS['game_count']>=min_games].copy()
df_new

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,YH points,YH salary,home_away,oppt_rank,game_count
0,1,2018,1242,Fitzpatrick; Ryan,QB,tam,a,nor,42.28,25.0,-1,2.0,8
1,1,2018,1151,Brees; Drew,QB,nor,h,tam,31.56,33.0,1,4.0,15
2,1,2018,1231,Rivers; Philip,QB,lac,h,kan,29.96,31.0,1,5.0,15
3,1,2018,1523,Mahomes II; Patrick,QB,kan,a,lac,28.34,27.0,-1,25.0,15
4,1,2018,1252,Rodgers; Aaron,QB,gnb,h,chi,24.94,39.0,1,31.0,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5962,16,2018,1466,Mariota; Marcus,QB,ten,h,was,5.10,23.0,1,14.0,14
5963,16,2018,1340,Stafford; Matthew,QB,det,h,min,4.64,21.0,1,30.0,15
5964,16,2018,1442,Bortles; Blake,QB,jac,a,mia,4.06,20.0,-1,10.0,12
5967,16,2018,1492,Kessler; Cody,QB,jac,a,mia,2.44,20.0,-1,10.0,5


**Check your answer:**

Use the `unique` function to get an array with the name of all the players that are relevant to our analysis (i.e., included in the dataframe `df_new`). Store this data in a 1d numpy array named `relevant_players`, which should be sorted in alphabetical order. 

In [29]:
#grade_clear
relevant_players =  np.sort(df_new['Name'].unique())

In [30]:
print(relevant_players.shape)
print(type(relevant_players))
relevant_players

(43,)
<class 'numpy.ndarray'>


array(['Allen; Josh', 'Beathard; C.J.', 'Bortles; Blake', 'Brady; Tom',
       'Brees; Drew', 'Carr; Derek', 'Cousins; Kirk', 'Dalton; Andy',
       'Daniel; Chase', 'Darnold; Sam', 'Dobbs; Joshua', 'Driskel; Jeff',
       'Fitzpatrick; Ryan', 'Flacco; Joe', 'Gabbert; Blaine',
       'Goff; Jared', 'Heinicke; Taylor', 'Hill; Taysom',
       'Jackson; Lamar', 'Keenum; Case', 'Kessler; Cody', 'Luck; Andrew',
       'Mahomes II; Patrick', 'Manning; Eli', 'Mariota; Marcus',
       'Mayfield; Baker', 'Mullens; Nick', 'Newton; Cam',
       'Osweiler; Brock', 'Prescott; Dak', 'Rivers; Philip',
       'Rodgers; Aaron', 'Roethlisberger; Ben', 'Rosen; Josh',
       'Ryan; Matt', 'Smith; Alex', 'Stafford; Matthew',
       'Tannehill; Ryan', 'Trubisky; Mitchell', 'Watson; Deshaun',
       'Wentz; Carson', 'Wilson; Russell', 'Winston; Jameis'],
      dtype=object)

Note that we only consider 43 quarterbacks playing in 2018.

### Let's put all of this together! 


**Check your answer:**

Write a function `prepare_data` that creates the dataframe `df_POS` for a given player position `POS`. The function also returns as an argument the list of relevant unique players, given the minimum number of games played threshold:


In [31]:
#grade_clear
#clear
def prepare_data(ff_data,POS,min_games): 
    # enter your code here (this is basically a compilation of the steps above)   
    
    df_POS = ...
    relevant_players = ...
    
    #clear
    
    df_POS = ff_data[ff_data['Pos'] == POS].copy()
    df_POS['home_away'] = np.where(df_POS['h/a']=='a',-1,1)
    df_POS['oppt_rank'] = df_POS['Oppt'].apply(get_rank)
    df_POS['game_count'] = df_POS.groupby('Name')['Name'].transform('count')
    df_new = df_POS[df_POS['game_count']>=min_games].copy()
    relevant_players =  np.sort(df_POS[df_POS['game_count']>=min_games]['Name'].unique())
    
    #clear
    
    return(df_POS, relevant_players)


Test out that your function works as expected (i.e., before clicking "Save&Grade", make sure that the function returns something reasonable).

In [32]:
df_test,players_test = prepare_data(ff_2018,'WR',3)
df_test

Unnamed: 0,Week,Year,GID,Name,Pos,Team,h/a,Oppt,YH points,YH salary,home_away,oppt_rank,game_count
144,1,2018,5485,Hill; Tyreek,WR,kan,a,lac,38.8,28.0,-1,25.0,15
145,1,2018,5459,Thomas; Michael,WR,nor,h,tam,30.0,37.0,1,4.0,15
146,1,2018,3770,Jackson; DeSean,WR,tam,a,nor,29.1,14.0,-1,2.0,12
147,1,2018,5125,Cobb; Randall,WR,gnb,h,chi,24.7,15.0,1,31.0,8
148,1,2018,5212,Stills; Kenny,WR,mia,h,ten,24.6,17.0,1,26.0,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6231,16,2018,5570,Cole; Keelan,WR,jac,a,mia,0.0,10.0,-1,10.0,15
6232,16,2018,5387,Hardy; Justin,WR,atl,a,car,0.0,10.0,-1,9.0,15
6233,16,2018,5692,Beebe; Chad,WR,min,a,det,0.0,10.0,-1,15.0,3
6234,16,2018,5595,Hall; Marvin,WR,atl,a,car,0.0,10.0,-1,9.0,15


# Simple Model - Last $n$ games

We'll start with a simple linear model. For now, we will keep using our example where we constructed a dataset for quarterbacks in the variable `df_POS`, along with `relevant_players`

The points scored in the previous $n$ games will be the only data considered when making a prediction.  Let's look at what the model would look like for only one player, say Andy Dalton, with $n = 3$.

In [33]:
pl = relevant_players[7]
pl_points = df_POS[df_POS['Name']==pl]['YH points'].values

print('Player:', pl)
print('Number of games played:', len(pl_points))
print('Points:', pl_points)

Player: Dalton; Andy
Number of games played: 11
Points: [17.52 26.6  18.08 25.78 13.92 17.16  8.92 20.2   8.92 19.34  9.1 ]


Andy Dalton played 11 games.  So we could try to build a model that predicted the points he scored in his 4th game, based on his first 3,  and write:

$$ s_1 x_1 + s_2 x_2 + s_3 x_3 = s_4 $$

where $s_i$ is the score of game $i$ and $x_i$ are the coefficients we want to find.

Similarly we can try to predict the points he scored in the 5th games based on games 2,3, and 4.

$$ s_2 x_1 + s_3 x_2 + s_4 x_3 = s_5 $$

or more generally:

$$ s_{j-3} x_1 + s_{j-2} x_2 + s_{j-1} x_3 = s_{j} $$

I.e. a "local" least squares system might look something like

$$\mathbf{Ax}\cong \mathbf{b}$$

where

$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08\\ 26.6 & 18.08 & 25.78 \\ 18.08 & 25.78 & 13.92 \\
25.78 & 13.92 & 17.16 \\ 13.92 & 17.16 & 8.92 \\ 17.16 & 8.92 & 20.2 \\ 8.92 & 20.2 & 8.92 \\
20.2 & 8.92 & 19.34 \end{pmatrix}, \hspace{5mm} \mathbf{b}= \begin{pmatrix} 25.78 \\ 13.92 \\ 17.16 \\ 8.92 \\
20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix}$$

This was with $n = 3$ games.  If instead, we base our "local" least squares on the previous $n = 4$ games, then our system would instead look like:

$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08 & 25.78\\ 26.6 & 18.08 & 25.78 & 13.92 \\ 
18.08 & 25.78 & 13.92 & 17.16\\ 25.78 & 13.92 & 17.16 & 8.92 \\ 13.92 & 17.16 & 8.92 & 20.2 \\ 
17.16 & 8.92 & 20.2 & 8.92\\ 8.92 & 20.2 & 8.92 & 19.34 \end{pmatrix},\hspace{4mm} \mathbf{b}= \begin{pmatrix} 13.92 \\ 17.16 \\ 8.92 \\
20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix} $$

**Check your answer:**

Write a function that generates this local system for a given (relevant) player. The function returns the arrays ${\bf A}$ and ${\bf b}$.

In [35]:
#grade_clear
#clear
def player_point_history(df, pl, n_games):   
    # df: dataframe
    # pl (string): name of a player
    # n_games (int): number of games used for the prediction
    
    pts = df[df['Name']==pl]['YH points'].values
    
    # complete the rest of the function to get A and b  
    A = ...
    b = ...
    
    #clear
      
    m = pts[n_games:].shape[0]
    A = np.zeros((m,n_games))
    for k in range(n_games):
        A[:,k] = pts[k:-n_games + k]
    b = pts[n_games:]
      
    #clear
    return A,b

 Use the example above to debug your function (i.e., data for Andy Dalton). Your function should however work for any player.

In [36]:
A,b = player_point_history(df_POS, relevant_players[7], 4) 
print(A)
print(b)

[[17.52 26.6  18.08 25.78]
 [26.6  18.08 25.78 13.92]
 [18.08 25.78 13.92 17.16]
 [25.78 13.92 17.16  8.92]
 [13.92 17.16  8.92 20.2 ]
 [17.16  8.92 20.2   8.92]
 [ 8.92 20.2   8.92 19.34]]
[13.92 17.16  8.92 20.2   8.92 19.34  9.1 ]


Now, with this function, we can loop over the relevant players, generate their local systems, and "stack" them on top of each other to generate the global system.  We'll do this with $n = 3$

In [37]:
n_games = 3

# empty array for right hand side of size M x 1
pts_scored = np.array([])

# empty array for matrix of size M x n_games.  We had to reshape to size 0 x n_games to allow for "stacking" 
game_hist = np.array([]).reshape(0,n_games)

for pl in relevant_players:
    # generate local system
    a,c = player_point_history(df_POS,pl,n_games)
    
    # use numpy.append to append local system to global vector
    pts_scored = np.append(pts_scored,c)
    
    # use numpy.vstack (i.e. "vertical stack") to stack the global matrix and the local matrix
    game_hist = np.vstack((game_hist,a))
    
print(pts_scored.shape)
print(game_hist.shape)

(383,)
(383, 3)


### When should we start a player?

It would be an overly ambitious task to try to predict a players exact point total.  What we can do instead is set a "threshold".  I.e. if a player's points exceed this threshold, then we can deem them "startable".  If they don't exceed this threshold, then we should look choose a different player.

What threshold should we use?  That's debatable, but I've compiled the following dictionary based on additional data I collected from nfl.com.

In [38]:
start_threshold = {'QB': 19.3999, 'RB': 14.599, 'WR': 15.099, 'TE': 7.899, 'Def': 7.499}

So, if a quarterback scores more than 19.3999, we declare them startable.  If a defense scores less than 7.499, then we should pick a different defense, etc.

**We can finally set up our least squares system.**

Set the matrix `A_stack` to the variable `game_hist` defined above. 

In [39]:
A_stack = game_hist

**Check your answer:**

The components of the vector `b_stack` should have a value of +1.0 if the corresponding component of `pts_scored` exceeds the threshold, and -1.0 if it lies below the threshold.  (I chose the thresholds so that it is impossible for the points to equal the threshold). Define the 1d numpy array `b_stack`. Recall we are still setting up our problem to select the quarterback players.

In [40]:
#grade_clear
threshold = start_threshold[POS]
b_stack = np.sign(pts_scored - threshold)

**Check your answer:**

Solve the Linear Least Squares ${\bf A}_{stack} {\bf x}_s \cong {\bf b}_{stack}$ problem for $\mathbf{x}_s$.  You can use `numpy.linalg.lstsq` to compute the least-squares solution. Store your result as `xs`.

In [41]:
#grade_clear
LSTQ = la.lstsq(A_stack,b_stack,rcond=None)
xs = LSTQ[0]

Once you have the model coefficients in `xs`, you can compute the numpy array `b_predict`:

In [42]:
b_predict = np.sign(A_stack@xs)

We can have the following situations:
- The prediction tells you to start a player that ends up performing poorly (a "false positive")
- The prediction tells you to exclude a player that ends up performing well (a "false negative")
- The prediction tells you to start a player that ends up performing well (a correct prediction)

Here we compute the number of false positives, false negatives, and correct prediction.  We store them as `false_positive`, `false_negative` and `correct_prediction` respectively.

In [43]:
false_positive = np.sum(b_predict > b_stack)
false_negative = np.sum(b_stack > b_predict)
correct_prediction = np.sum(b_stack == b_predict)

What percentage of each do we obtain on the data?

In [44]:
print(false_positive)
print(false_negative)
print(correct_prediction)
print()
print(false_positive/b_stack.shape[0])
print(false_negative/b_stack.shape[0])
print(correct_prediction/b_stack.shape[0])

13
138
232

0.033942558746736295
0.360313315926893
0.6057441253263708


The model is only correct 60.57% of the time.  However, it only return a "false positive" 3.39% of the time, which is very nice: if the model tells you to start a player, there's a good chance you will be happy with the results.

**Check your answer:**

Let's put it all together into a single function `linear_predictor`.  This will mostly be copying and pasting from above.  The function should return the variables `A`, `b`, `x`.  
```python
def linear_predictor(ff_data, Pos, min_games, n_games, threshold):    
    df,relevant_players = prepare_data(ff_data,Pos,min_games)   
    # complete the code  
    return A, b, x
```

In [45]:
#grade_clear
#clear
def linear_predictor(ff_data, Pos, min_games, n_games, threshold):
    
    df,relevant_players = prepare_data(ff_data,Pos,min_games)
    
    # complete the code
    
    A = ...
    b = ...
    x = ...
    
    #clear   
    pts_scored = np.array([])
    game_hist = np.array([]).reshape(0,n_games)

    for pl in relevant_players:
        a,c = player_point_history(df,pl,n_games)
        pts_scored = np.append(pts_scored,c)
        game_hist = np.vstack((game_hist,a))
    
    A = game_hist
    b = np.sign(pts_scored - threshold)
    
    LSTQ = np.linalg.lstsq(A,b,rcond = None)
    x = LSTQ[0]
    #clear

    
    return A, b, x

We can call the routine for any position, and we can tweak the number of `min_games` and `n_games`.  You can also tweak the threshold.  Try changing the input variables and see how this affects model accuracy

In [46]:
Pos = 'WR'
min_games = 5  # minimum number of games for a player to be included as "relevant"
n_games = 3    # number of prior games to be used for the prediction
threshold = start_threshold[Pos]

A, b, x = linear_predictor(ff_2018, Pos, min_games, n_games, threshold)
b_predict = np.sign(A@x)

**Try this!**

Compute the number of false positives, false negatives, and correct prediction.  Store them as `false_positive`, `false_negative` and `correct_prediction` respectively.

In [47]:
#clear
false_negative = np.sum(b > b_predict)
false_positive = np.sum(b_predict > b)
correct_prediction = np.sum(b == b_predict)

In [48]:
print(false_negative)
print(false_positive)
print(correct_prediction)
print()
print(false_negative/b.shape[0])
print(false_positive/b.shape[0])
print(correct_prediction/b.shape[0])

201
144
1287

0.12316176470588236
0.08823529411764706
0.7886029411764706


# Enriched model

Notice we didn't make use of the fact that a player is playing on home or on the road, or the ranking of the opponent.  Let's try to enrich the features used in this problem to include this data.  Let's go back to Andy Dalton:

In [50]:
pl = relevant_players[7]
pl_points = df_POS[df_POS['Name']==pl]['YH points'].values
pl_home_away = df_POS[df_POS['Name']==pl]['home_away'].values
pl_oppt_rank = df_POS[df_POS['Name']==pl]['oppt_rank'].values

print('Player:', pl)
print('Points:', pl_points)
print('Location:', pl_home_away)
print('Opp Rank:', pl_oppt_rank)

Player: Dalton; Andy
Points: [17.52 26.6  18.08 25.78 13.92 17.16  8.92 20.2   8.92 19.34  9.1 ]
Location: [-1  1 -1 -1  1  1 -1  1  1 -1  1]
Opp Rank: [21. 29.  9.  1. 10. 17.  5.  4.  2. 29. 13.]


When $n = 3$ we had the following system when we only took previous games played:

$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08\\ 26.6 & 18.08 & 25.78 \\ 18.08 & 25.78 & 13.92 \\
25.78 & 13.92 & 17.16 \\ 13.92 & 17.16 & 8.92 \\ 17.16 & 8.92 & 20.2 \\ 8.92 & 20.2 & 8.92 \\
20.2 & 8.92 & 19.34 \end{pmatrix}, \hspace{5mm} \mathbf{b}= \begin{pmatrix} 25.78 \\ 13.92 \\ 17.16 \\ 8.92 \\
20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix}$$

With the location and opponent data, it should now look like this:

$$\mathbf{A} = \begin{pmatrix} 17.52 & 26.6 & 18.08 & -1 & 1\\ 26.6 & 18.08 & 25.78 & 1 & 10 \\ 18.08 & 25.78 & 13.92 & 1 & 17\\ 25.78 & 13.92 & 17.16 & -1 & 5\\ 13.92 & 17.16 & 8.92 & 1 & 4\\ 17.16 & 8.92 & 20.2 & 1 & 2\\ 8.92 & 20.2 & 8.92 & -1 & 29 \\
20.2 & 8.92 & 19.34  & 1 & 13\end{pmatrix}, \hspace{5mm} \mathbf{b}= \begin{pmatrix} 25.78 \\ 13.92 \\ 17.16 \\ 8.92 \\
20.2 \\ 8.92 \\ 19.34 \\ 9.1\end{pmatrix}$$

which represents the model:

$$ s_{j-3}\, x_1 + s_{j-2}\, x_2 + s_{j-1}\, x_3 + h_{j} \,x_4 + r_{j}\, x_5 = s_{j} $$

where $h_j$ indicates if game $j$ was played home or away, and $r_j$ indicates the ranking of the opponent team.

**Check your answer:**

Create an enriched linear regression, by adding these two extra columns to the matrix $\mathbf{A}$.  

Complete the function `linear_predictor_enriched` which is very similar to the function `linear_predictor` from above, with the modified `A` matrix that includes these two additional columns.

In [51]:
#grade_clear
#clear
def linear_predictor_enriched(ff_data, Pos, min_games, n_games, threshold):
    
    df,relevant_players = prepare_data(ff_data,Pos,min_games)
    
    A = ...
    b = ...
    x = ...
    
    #clear   
    
    pts_scored = np.array([])
    game_hist = np.array([]).reshape(0,n_games+2)

    for pl in relevant_players:
        a,c = player_point_history(df,pl,n_games)
        location = df[df['Name']==pl]['home_away'].values
        opponent = df[df['Name']==pl]['oppt_rank'].values
        last_two_columns = np.vstack((location[n_games:],opponent[n_games:])).T
        anew = np.hstack((a,last_two_columns ))
        
        pts_scored = np.append(pts_scored,c)
        game_hist = np.vstack((game_hist,anew))
        
    
    b = np.sign(pts_scored - threshold)
    A = game_hist
    
    LSTQ = np.linalg.lstsq(A,b,rcond = None)
    x = LSTQ[0]
    
    #clear

    return A, b, x

This enriched version is considerably better for running backs, with our standard inputs:

In [52]:
Pos = 'RB'
min_games = 5
n_games = 3
threshold = start_threshold[Pos]

A, b, x  = linear_predictor(ff_2018, Pos, min_games, n_games, threshold)
b_predict = np.sign(A@x)

Compute the number of false positives, false negatives, and correct prediction.  Store them as `false_positive`, `false_negative` and `correct_prediction` respectively.

In [53]:
false_negative = np.sum(b > b_predict)
false_positive = np.sum(b_predict > b)
correct_prediction = np.sum(b == b_predict)

In [54]:
print('Standard Model')
print('Fraction of false negatives:    ', false_negative/b.shape[0])
print('Fraction of false positives:    ', false_positive/b.shape[0])
print('Fraction of correct predictions:', correct_prediction/b.shape[0])

Standard Model
Fraction of false negatives:     0.161400512382579
Fraction of false positives:     0.161400512382579
Fraction of correct predictions: 0.677198975234842


And with the enriched model:

In [55]:
A, b, x = linear_predictor_enriched(ff_2018, Pos, min_games, n_games, threshold)
b_predict = np.sign(A@x)

Compute the number of false positives, false negatives, and correct prediction.  Store them as `false_positive`, `false_negative` and `correct_prediction` respectively.

In [56]:
false_negative = np.sum(b > b_predict)
false_positive = np.sum(b_predict > b)
correct_prediction = np.sum(b == b_predict)

In [57]:
print('Enriched Model')
print('Fraction of false negatives:    ', false_negative/b.shape[0])
print('Fraction of false positives:    ', false_positive/b.shape[0])
print('Fraction of correct predictions:', correct_prediction/b.shape[0])

Enriched Model
Fraction of false negatives:     0.10930828351836037
Fraction of false positives:     0.05807002561912895
Fraction of correct predictions: 0.8326216908625107


# Validation set
Of course, you never want to conclude anything about your model based on the data you used to construct it.  You should validate its accuracy on a different data set.  We can do so using football data from another year.  We can also select the optimal **hyperparameters** (a fancy word for parameters) based on this validation set.

Some questions to ask as you test the model on the validation set:

- Should we include the home/away and opponent data or not?
- Is our decision to exclude players that have played less than 5 games a good one?  Should we bump that number up to 7 games?  Or down to 3?
- How many games should we include in our history?  Is 3 games really the best choice?  What about 5?  What about just the last game?

I.e. the inclusion of the extra data, the minimum number of games, and the history length are the **hyperparameters** for this model.

In [58]:
ff_2019 = pd.read_csv('FF-data-2019.csv')

# position
Pos = 'WR'

# these are your hyperparameters
min_games = 5
n_games = 2
enriched = True

# build model on 2018 data and retrieve least squares solution x
if enriched:
    OUT_2018 = linear_predictor_enriched(ff_2018, Pos, min_games,n_games,threshold)
    x = OUT_2018[2]
else:
    OUT_2018 = linear_predictor(ff_2018, Pos, min_games,n_games,threshold)
    x = OUT_2018[2]
    

# retrieve Data matrix A and outcomes vector b using 2019 data
if enriched:
    OUT_2019 = linear_predictor_enriched(ff_2019, Pos, min_games,n_games,threshold)
    A,b = OUT_2019[0], OUT_2019[1]
else:
    OUT_2019 = linear_predictor(ff_2019, Pos, min_games,n_games,threshold)
    A,b = OUT_2019[0], OUT_2019[1]
    
# predict results from 2019, using the model parameters (array x) from the 2018 data
b_predict = np.sign(A@x)

Compute the number of false positives, false negatives, and correct prediction.  Store them as `false_positive`, `false_negative` and `correct_prediction` respectively.

In [59]:
false_negative = np.sum(b > b_predict)
false_positive = np.sum(b_predict > b)
correct_prediction = np.sum(b == b_predict)

In [60]:
print('Fraction of false negatives:    ', false_negative/b.shape[0])
print('Fraction of false positives:    ', false_positive/b.shape[0])
print('Fraction of correct predictions:', correct_prediction/b.shape[0])

Fraction of false negatives:     0.11243781094527364
Fraction of false positives:     0.022885572139303482
Fraction of correct predictions: 0.8646766169154229


;6397652130579010^26081203740?
## What do you think? Tune your hyperparameters, check different positions. 