Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- ✔️ Choose your target. Which column in your tabular dataset will you predict?
- ✔️ Is your problem regression or classification?
- ✔️ How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- ✔️ Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- ✔️ Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

# TARGET: PLAYER

Type of problem: *classification*

In [50]:
import pandas as pd
df0 = pd.read_csv('https://socratesdidnothingwrong.com/nfl/qbcf/master.txt')
print(df0.shape)
df0.head()

(10076, 19)


Unnamed: 0,player,date,team,home,opp,game,week,day,completions,passatt,passyards,passtds,ints,sacks,sackyards,rushatt,rushyards,rushtds,fumbles
0,Geno Smith,2013-12-01,NYJ,1,MIA,12,13,Sun,4,10,29,0,1,1.0,8.0,1,2,0,0
1,Ryan Tannehill,2013-12-01,MIA,0,NYJ,12,13,Sun,28,43,331,2,1,1.0,3.0,3,22,0,0
2,Brandon Weeden,2013-12-01,CLE,1,JAX,12,13,Sun,24,40,370,3,2,3.0,28.0,2,5,0,2
3,Joe Flacco,2013-11-28,BAL,1,PIT,12,13,Thu,24,35,251,1,0,2.0,14.0,4,7,0,1
4,Matt Flynn,2013-11-28,GNB,0,DET,12,13,Thu,10,20,139,0,1,7.0,37.0,2,4,0,2


In [51]:
df0['player'].nunique()

262

In [52]:
df0['player'].value_counts(normalize=True)

Tom Brady             0.026002
Drew Brees            0.025109
Eli Manning           0.024315
Ben Roethlisberger    0.023720
Philip Rivers         0.022926
                        ...   
Tim Boyle             0.000099
Tony Pike             0.000099
Dominique Davis       0.000099
Cardale Jones         0.000099
Garrett Grayson       0.000099
Name: player, Length: 262, dtype: float64

In [86]:
# Remove "games" with no stats by the QB
df1 = df0[(df0['passatt'] > 0) | (df0['rushatt'] > 0)]
print(df1.shape)
print(df1['player'].nunique())
df1.head()

(9683, 19)
259


Unnamed: 0,player,date,team,home,opp,game,week,day,completions,passatt,passyards,passtds,ints,sacks,sackyards,rushatt,rushyards,rushtds,fumbles
0,Geno Smith,2013-12-01,NYJ,1,MIA,12,13,Sun,4,10,29,0,1,1.0,8.0,1,2,0,0
1,Ryan Tannehill,2013-12-01,MIA,0,NYJ,12,13,Sun,28,43,331,2,1,1.0,3.0,3,22,0,0
2,Brandon Weeden,2013-12-01,CLE,1,JAX,12,13,Sun,24,40,370,3,2,3.0,28.0,2,5,0,2
3,Joe Flacco,2013-11-28,BAL,1,PIT,12,13,Thu,24,35,251,1,0,2.0,14.0,4,7,0,1
4,Matt Flynn,2013-11-28,GNB,0,DET,12,13,Thu,10,20,139,0,1,7.0,37.0,2,4,0,2


In [93]:
# Remove QBs that don't have at least half a season of games
import numpy as np
vc = df1['player'].value_counts().to_dict()
littleplay = [p for p in vc if vc[p] < 8]
enoughplay = df1.apply(lambda x: x['player'] not in littleplay, axis=1)
df2 = df1[enoughplay]
print(df2.shape)
print(df2['player'].nunique())
df2.head()

(9417, 19)
175


Unnamed: 0,player,date,team,home,opp,game,week,day,completions,passatt,passyards,passtds,ints,sacks,sackyards,rushatt,rushyards,rushtds,fumbles
0,Geno Smith,2013-12-01,NYJ,1,MIA,12,13,Sun,4,10,29,0,1,1.0,8.0,1,2,0,0
1,Ryan Tannehill,2013-12-01,MIA,0,NYJ,12,13,Sun,28,43,331,2,1,1.0,3.0,3,22,0,0
2,Brandon Weeden,2013-12-01,CLE,1,JAX,12,13,Sun,24,40,370,3,2,3.0,28.0,2,5,0,2
3,Joe Flacco,2013-11-28,BAL,1,PIT,12,13,Thu,24,35,251,1,0,2.0,14.0,4,7,0,1
4,Matt Flynn,2013-11-28,GNB,0,DET,12,13,Thu,10,20,139,0,1,7.0,37.0,2,4,0,2


In [92]:
vc = df2['player'].value_counts(normalize=True).to_dict()
for k in vc.keys():
  print(vc[k] * 100, k)

2.782202399915047 Tom Brady
2.6866305617500266 Drew Brees
2.6016778167144525 Eli Manning
2.537963257937772 Ben Roethlisberger
2.4423914197727514 Philip Rivers
2.028246787724328 Matt Ryan
2.028246787724328 Peyton Manning
1.996389508335988 Aaron Rodgers
1.9751513220770944 Carson Palmer
1.9645322289476477 Joe Flacco
1.8371031113942868 Alex Smith
1.6459594350642455 Jay Cutler
1.592863969417012 Matthew Stafford
1.5291494106403312 Ryan Fitzpatrick
1.5079112243814379 Matt Hasselbeck
1.4654348518636509 Tony Romo
1.4017202930869703 Russell Wilson
1.4017202930869703 Cam Newton
1.3911011999575236 Andy Dalton
1.2636720824041627 Matt Schaub
1.2424338961452692 Michael Vick
1.210576616756929 Brett Favre*
1.1150047785919084 Matt Cassel
1.1043856854624616 Donovan McNabb
0.998194754167994 Andrew Luck
0.9663374747796539 Ryan Tannehill
0.9450992885207603 Jason Campbell
0.9450992885207603 Josh McCown
0.9344801953913136 Kirk Cousins
0.91324200913242 Kyle Orton
0.9026229160029733 Jake Delhomme
0.892003822873

# CLASSES: 175

Distribution of classes:

- Tom Brady has the most games (262 for 2.78%)
- 5 players are tied for fewest (8 for 0.08%)

In [96]:
df2.dtypes

player                 object
date           datetime64[ns]
team                   object
home                    int64
opp                    object
game                    int64
week                    int64
day                    object
completions             int64
passatt                 int64
passyards               int64
passtds                 int64
ints                    int64
sacks                 float64
sackyards             float64
rushatt                 int64
rushyards               int64
rushtds                 int64
fumbles                 int64
dtype: object

In [100]:
# Only go to week 7 of 2019
df2['date'] = pd.to_datetime(df2['date'], infer_datetime_format=True)
df3 = df2[(df2['date'].dt.year < 2019) | df2['week'] < 8].sort_values('date')
df3.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,player,date,team,home,opp,game,week,day,completions,passatt,passyards,passtds,ints,sacks,sackyards,rushatt,rushyards,rushtds,fumbles
5775,Peyton Manning,2004-09-09,IND,0,NWE,1,1,Thu,16,29,256,2,1,1.0,12.0,2,18,0,0
5774,Tom Brady,2004-09-09,NWE,1,IND,1,1,Thu,26,38,335,3,1,2.0,15.0,1,-1,0,0
5773,Steve McNair,2004-09-11,TEN,0,MIA,1,1,Sat,9,14,73,1,0,2.0,12.0,2,11,0,0
5772,Jay Fiedler,2004-09-11,MIA,1,TEN,1,1,Sat,5,13,42,0,2,1.0,1.0,1,0,0,0
5771,A.J. Feeley,2004-09-11,MIA,1,TEN,1,1,Sat,21,31,168,1,1,2.0,11.0,0,0,0,0
5753,Jeff Garcia,2004-09-12,CLE,1,BAL,1,1,Sun,15,24,180,1,0,2.0,15.0,3,13,1,0
5754,Trent Green,2004-09-12,KAN,0,DEN,1,1,Sun,16,32,174,0,1,1.0,23.0,1,3,0,0
5751,Ken Dorsey,2004-09-12,SFO,1,ATL,1,1,Sun,9,15,111,0,0,1.0,7.0,0,0,0,0
5750,Daunte Culpepper,2004-09-12,MIN,1,DAL,1,1,Sun,17,23,242,5,0,2.0,0.0,6,25,0,0
5748,David Carr,2004-09-12,HOU,1,SDG,1,1,Sun,19,25,229,0,2,2.0,3.0,3,12,0,0


# TRAIN/TEST SPLIT

Training: all games up thru 2010

Testing: all games from 2011 up

(These numbers are based on a 20% test size)

# EVALUATION METRIC: PRECISION