America's Top Sport: NFL Breakdown

Anish Vallabhaneni

INTRODUCTION

In modern society, two of the most popular activities for our generation are watching sports and gambling. Of all the sports in America specifically, the one that constantly gathers the most viewers is football. The NFL is a multi billion dollar industry and is one that also ends up being the most bet on sport in the United States as well. These bets come from a place of knowledge of the game, but with a better model they can be much more accurately predicted.

To create a model to predict the results of NFL games, I needed to use a website with plenty of data regarding the NFL. The company which I got this data from is Pro Football Focus. I interned at Pro Football Focus or PFF last summer and used their resources to get plenty of advanced data on the NFL. Not only do they have raw stats available, but they also present access to advanced statistics which can help make much more accurate predictions as well.

From PFF, I gathered the data by examining 6 CSV files, 3 for offense and 3 for defense. From there I went and tidied up the data by removing any unnecessary columns that would not be beneficial in getting stats to predict scores. The next step was to visualize the gathered data and choosing specific stats to let us understand what information we should feed to our machine learning model.

The final step in the project was to create the actual machine learning model that would allow me to predict these scores. By gathering stats from previous weeks, our machine learning model is able to predict scores for a certain week. The model will be taking offensive and defensive stats from both the home and away team and using that information to create score predictions.

IMPORTS

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stat
import sklearn.model_selection 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import seaborn as sns
import warnings
header_row = 0

PROCESSING AND READING DATA

The three offensive CSVs are from Pro Football Focus Offensive Stats and the three defensive CSVs are from Pro Football Focus defensive stats. The specific stats are "Team Offense", "Passing Offense" and "Rushing Offense" along with "Team Defense", "Passing Defense" and "Rushing Defense". I believe that this information most effeciently can tell us how good a team is offensively or defensively which furthermore will indicate how well they perform

In [None]:
offense_stats = pd.read_csv('offense_stats.csv')
defense_stats = pd.read_csv('defense_stats.csv')
passing_stats = pd.read_csv('passing_stats.csv')
passdef_stats = pd.read_csv('passdef_stats.csv')
rushing_stats = pd.read_csv('rushing_stats.csv')
rushdef_stats = pd.read_csv('rushdef_stats.csv')

offense_stats.columns = offense_stats.iloc[header_row]
offense_stats = offense_stats.drop(header_row)

defense_stats.columns = defense_stats.iloc[header_row]
defense_stats = defense_stats.drop(header_row)

TIDYING DATA

After collecting all the initial data, it is neccessary to remove certain information that does not apply for our model. We can do that and are left with the important stats to base our model off of.

In [None]:
offense_stats = pd.read_csv('offense_stats.csv') # Offensive Stats
offense_stats.columns = offense_stats.iloc[header_row]
offense_stats = offense_stats.drop(header_row)
offense_stats = offense_stats.drop(columns=['Pen'])
offense_stats = offense_stats.drop(columns=['TO%'])
offense_stats = offense_stats.drop(columns=['Sc%'])
offense_stats = offense_stats.drop(columns=['Rk'])
offense_stats = offense_stats.drop(columns=['1stPy'])
offense_stats = offense_stats.iloc[: , :-2]
offense_stats = offense_stats.drop(columns=['1stD'])
offense_stats = offense_stats.drop(columns=['FL'])
offense_stats.columns = ['Team', 'Games','Total Points For','Total Yards', 'Total Plays', 'Total Yards/Play', 'Total Turnovers', 'Passing Completions', 'Passing Attempts', 'Passing Yards',\
    'Passing TD', 'Passing Int', 'Net Yards/Attempt', 'Rushing Attempts', 'Total Rushing Yards', 'Rushing TDs', 'Yards/Attempt']
offense_stats = offense_stats.drop([33, 34,35])
offense_stats = offense_stats.sort_values(by=['Team']).reset_index(drop=True)
offense_stats.head()

I removed certain stats in order to prevent deviation in the model. Specifically, I removed penalties, rank, first downs, first down percentage, and fumbles lost. Furthermore, stats like turnover percentage and sack percentage were best fit by getting numbers from the opposing defense. The stats we focused on were: 'Team', 'Games', 'Total Points For', 'Total Yards', 'Total Plays', 'Total Yards/Play', 'Total Turnovers', 'Passing Completions', 'Passing Attempts', 'Passing Yards', 'Passing TD', 'Passing Int', 'Net Yards/Attempt', 'Rushing Attempts', 'Total Rushing Yards', 'Rushing TDs', 'Yards/Attempt'.

In [None]:
passing_stats = pd.read_csv('passing_stats.csv') # Passing Stats
passing_stats = passing_stats.drop(columns=['4QC'])
passing_stats = passing_stats.drop(columns=['GWD'])
passing_stats = passing_stats.drop(columns=['Sk%'])
passing_stats = passing_stats.drop(columns=['Rk'])
passing_stats = passing_stats.drop(columns=['EXP'])
passing_stats = passing_stats.drop(columns=['Yds.1'])
passing_stats = passing_stats.drop(columns=['Int%'])
passing_stats = passing_stats.drop(columns=['TD%'])
passing_stats = passing_stats.drop(columns=['Rate'])
passing_stats = passing_stats.drop(columns=['Sk'])
passing_stats = passing_stats.drop(columns=['Lng'])
passing_stats = passing_stats.rename(columns={"Tm": "Team", "G": "Games", "Ply": "Total Ply", "Y/P":"Total Y/P", 'Cmp':'Completions', 'Cmp%':'Completion %'})
passing_stats.rename(columns={passing_stats.columns[5]: "Total Yards" }, inplace = True)
passing_stats = passing_stats.drop([32, 33,34])
passing_stats = passing_stats.sort_values(by=['Team']).reset_index(drop=True)
passing_stats.head()

Similar to passing stats, it is required we tidy up the rushing data. Stats such as longest run and experience have no bearing on how well a team will perform.

In [None]:
rushing_stats = pd.read_csv('rushing_stats.csv') 
rushing_stats = rushing_stats.drop(columns=['Rk'])
rushing_stats = rushing_stats.drop(columns=['EXP'])
rushing_stats = rushing_stats.drop(columns=['Lng'])
rushing_stats = rushing_stats.rename(columns={"Tm": "Team", "G": "Games"})
rushing_stats.rename(columns={rushing_stats.columns[3]: "Total Yards" }, inplace = True)
rushing_stats = rushing_stats.drop([32,33,34])
rushing_stats = rushing_stats.sort_values(by=['Team']).reset_index(drop=True)
rushing_stats.head()

It is also neccesary that we tidy up defensive data. There are a plethora of defensive stats from the CSV and a vast majority of them are not needed for our model.

In [None]:
defense_stats = pd.read_csv('defense_stats.csv', index_col=False) # Defensive Stats
defense_stats.columns = defense_stats.iloc[header_row]
defense_stats = defense_stats.drop(header_row)
defense_stats = defense_stats.drop(columns= ["FL"])
defense_stats = defense_stats.drop(columns= ["1stD"])
defense_stats = defense_stats.drop(columns= ["TO%"])
defense_stats = defense_stats.drop(columns= ["Sc%"])
defense_stats = defense_stats.drop(columns= ["EXP"])
defense_stats = defense_stats.drop(columns= ["Pen"])
defense_stats = defense_stats.drop(columns= ["1stPy"])
defense_stats = defense_stats.drop(columns= ["Int"])
defense_stats = defense_stats.drop(columns= ["TD"])
defense_stats = defense_stats.drop(columns= ["Cmp"])
defense_stats = defense_stats.drop(columns= ["Att"])
# defense_stats = defense_stats.drop(columns= defense_stats.columns[[16]], axis=1)
defense_stats = defense_stats.iloc[:,~defense_stats.columns.duplicated()]
defense_stats = defense_stats.rename(columns={"Tm": "Team", "G": "Games", "Yds": "Total Yards Allowed", "Ply": "Total Ply", "NY/P":"Net Yards per Pass Allowed", "Y/A":"Rushing Yards Allowed"\
    , "TO":"Total TOs", "PA":"Points Allowed"})
defense_stats = defense_stats.drop([33, 34,35])
defense_stats = defense_stats.sort_values(by=['Team']).reset_index(drop=True)
defense_stats.head()

Continuing with the tidying up of data we must individually move to pass defense data.

In [None]:
passdef_stats = pd.read_csv('passdef_stats.csv') # Passing Defense Stats
passdef_stats = passdef_stats.drop(columns= ["Rk"])
passdef_stats = passdef_stats.drop(columns= ["QBHits"])
passdef_stats = passdef_stats.drop(columns= ["PD"])
passdef_stats = passdef_stats.drop(columns= ["Rate"])
passdef_stats = passdef_stats.drop(columns= ["Sk"])
passdef_stats = passdef_stats.drop(columns= ["Yds.1"]) 
passdef_stats = passdef_stats.drop(columns= ["TFL"])
passdef_stats = passdef_stats.drop(columns= ["Sk%"])
passdef_stats = passdef_stats.drop(columns= ["EXP"])
passdef_stats = passdef_stats.drop([32,33, 34])
passdef_stats = passdef_stats.sort_values(by=['Tm']).reset_index(drop=True)
passdef_stats.head()

Finally, we must tidy our rushing defense data and remove unneccesary data points from this CSV as well.

In [None]:
rushdef_stats = pd.read_csv('rushdef_stats.csv') 
rushdef_stats = rushdef_stats.drop(columns= ["EXP"])
rushdef_stats = rushdef_stats.drop(columns= ["Rk"])
rushdef_stats = rushdef_stats.drop([32,33, 34])
rushdef_stats = rushdef_stats.sort_values(by=['Tm']).reset_index(drop=True)
rushdef_stats.head()

DATA ANALYSIS

POINTS ALLOWED PER GAME VS POINTS SCORED PER GAME

In [None]:
offense_stats['Points Per Game'] = offense_stats['Total Points For'].astype(float) / offense_stats['Games'].astype(float)
ppg = offense_stats['Points Per Game']
defense_stats['Points Allowed Per Game'] = defense_stats['Points Allowed'].astype(float) / defense_stats['Games'].astype(float)
papg = defense_stats['Points Allowed Per Game']

plt.scatter(offense_stats['Points Per Game'], defense_stats['Points Allowed Per Game'])
plt.xlabel("Points Per Game")
plt.ylabel("Points Allowed Per Game")
m, b = np.polyfit(offense_stats['Points Per Game'].head(32), defense_stats['Points Allowed Per Game'], 1)
plt.plot(offense_stats['Points Per Game'], m*offense_stats['Points Per Game'] + b)
for i, row in offense_stats.iterrows():
    plt.annotate(row.Team, (ppg[i], papg[i]))

Our first visual breaks down how teams are performin in terms of scoring points vs allowing them. Teams in the bottom right of the chart are generally best as they are scoring the most points and allowing the least. Meanwhile, teams in the top left are generally the worst teams in the NFL as they allow opposing teams to score plenty of points while struggling to do so themselves.

DEFENSIVE VS OFFENSIVE PASSING EFFICIENCY

In [None]:
passing_stats['Passing Efficiency'] = ((passing_stats["Completion %"].astype(float) * (passing_stats["Total Yards"].astype(float)/passing_stats["Att"].astype(float))) + \
    (passing_stats["Completion %"].astype(float) * (passing_stats["TD"].astype(float)/passing_stats["Att"].astype(float))*100*6) - \
    (passing_stats["Completion %"].astype(float) * (passing_stats["Int"].astype(float)/passing_stats["Att"].astype(float))*100*3)) / 100
passdef_stats['Defensive Passing Efficiency'] = (passdef_stats["Int%"].astype(float))*3 - \
    (passdef_stats["TD%"].astype(float)/passing_stats["Att"].astype(float)*100)*(6) - \
    (passdef_stats["Y/C"].astype(float)/passing_stats["Att"].astype(float)*100*3)

plt.scatter(passing_stats['Passing Efficiency'], passdef_stats['Defensive Passing Efficiency'])
plt.xlabel("Passing Efficiency")
plt.ylabel("Defensive Passing Efficiency")

m, b = np.polyfit(passing_stats['Passing Efficiency'].head(32), passdef_stats['Defensive Passing Efficiency'], 1)
plt.plot(passing_stats['Passing Efficiency'], m*passing_stats['Passing Efficiency'] + b)
for i, row in passing_stats.iterrows():
    plt.annotate(row.Team, (passing_stats['Passing Efficiency'][i], passdef_stats['Defensive Passing Efficiency'][i]))

The second visual breaks down how individually efficient each team's passing game is. Having a high Passing Efficiency tells us a team displays that a team is generally successful when passing the ball. Similarly, Defensiving Passing Efficiency displays a team's ability to stop opposing team's passing offense. Team's with a high passing defense efficiency are able to limit their opponents passing game success.

DEFENSIVE VS OFFENSIVE RUSHING EFFICIENCY

In [None]:
rushing_stats['Rushing Efficiency'] = (rushing_stats["Y/A"].astype(float))\
     + (rushing_stats["TD"].astype(float) /rushing_stats["Att"].astype(float)) * 100*6\
     - (rushing_stats["Fmb"].astype(float) /rushing_stats["Att"].astype(float)) * 100*3
rushdef_stats['Defensive Rushing Efficiency'] = (rushdef_stats["Y/A"].astype(float)) - ((rushdef_stats["TD"].astype(float) /rushdef_stats["Att"].astype(float)) * 100)*6
plt.scatter(rushing_stats['Rushing Efficiency'], rushdef_stats['Defensive Rushing Efficiency'])
plt.xlabel("Rushing Efficiency")
plt.ylabel("Defensive Rushing Efficiency")

m, b = np.polyfit(rushing_stats['Rushing Efficiency'].head(32), rushdef_stats['Defensive Rushing Efficiency'], 1)
plt.plot(rushing_stats['Rushing Efficiency'], m*rushing_stats['Rushing Efficiency'] + b)
for i, row in rushing_stats.iterrows():
    plt.annotate(row.Team, (rushing_stats['Rushing Efficiency'][i], rushdef_stats['Defensive Rushing Efficiency'][i]))

This visual operates very similarly to the last one except it is focused on rushing efficiency. It compares every team to the mean in both offensive and defensive rushing efficiency and displays it through a chart.

TYPE OF PLAYS RUN

In [None]:
# Run vs Pass
offense_stats['Total Plays(Only Run and Pass'] = offense_stats['Passing Attempts'].astype(float) + offense_stats['Rushing Attempts'].astype(float)
offense_stats['Pass Percent'] = offense_stats['Passing Attempts'].astype(float)/offense_stats['Total Plays(Only Run and Pass'].astype(float)
offense_stats['Run Percent'] = offense_stats['Rushing Attempts'].astype(float)/offense_stats['Total Plays(Only Run and Pass'].astype(float)

plt.scatter(offense_stats['Pass Percent'], offense_stats['Run Percent'])
plt.xlabel("Pass Percent")
plt.ylabel("Run Percent")
m, b = np.polyfit(offense_stats['Pass Percent'].head(32), offense_stats['Run Percent'], 1)
plt.plot(offense_stats['Pass Percent'], m*offense_stats['Pass Percent'] + b)
for i, row in offense_stats.iterrows():
    plt.annotate(row.Team, (offense_stats['Pass Percent'][i], offense_stats['Run Percent'][i]))

This visual serves as more of a determinant of what type of plays teams are running. It breaks down the frequency at which teams pass or run and puts it on a chart. It may be surprising to some that the visual displayed is a line, but every play must be a pass or run which shows that teams that pass more must run less and teams that run more must also pass less.

MODEL PREDICTIONS

TEAM PLAY STYLE

In [None]:
df = pd.DataFrame(offense_stats['Team'])
df['RushPercent'] = offense_stats['Run Percent']
df['RushEfficiency'] = rushing_stats['Rushing Efficiency']
df['PassPercent'] = offense_stats['Pass Percent']
df['PassEfficiency'] = passing_stats['Passing Efficiency']
df['DRushEfficiency'] = rushdef_stats['Defensive Rushing Efficiency']
df['DPassEfficiency'] = passdef_stats['Defensive Passing Efficiency']
df['DefenseRank'] = defense_stats['Rk']
df['OffensiveTO%'] = offense_stats['Total Turnovers'].astype(float)/offense_stats['Total Plays'].astype(float)
df['DefensiveTO%'] = defense_stats['Total TOs'].astype(float)/defense_stats['Total Ply'].astype(float)
df.insert(10, "Value", 1)
df.head()

This dataframe displays information about how often teams run and pass along with their efficiency agaisnt both of them. It will serve as an important element of our model's ability to create predictions.

MODEL FORMULA

The most important part of the model is creating an effective formula to project the results of each game. By using efficiency numbers along with projecting the likelihood of a turnover we can make an effective formula.

The formula I created was:
(Pass Percent * Pass Efficiency)) - (Pass Percent * PassDef Efficiency) + ((Rush Percent * Rush Efficiency) - ((Rush Percent * RushDef Efficiency) - (rank * (Offensive Turnover% - Defensive Turnover%) + 1.0)

The formula uses both rushing and passing efficiency and rate on both the offensive and defensive side. By adding turnover probability to the formula, I believe we can create some form of predictions for NFL games

In [None]:
df['Value'] = ((df['RushPercent'] * df['RushEfficiency']) + (df['PassPercent'] * df['PassEfficiency'])) - ((df['RushPercent'] * df['DRushEfficiency']) - (df['PassPercent'] * df['DPassEfficiency'])) - (df['DefenseRank'].astype(float) * (df['OffensiveTO%'] - df['DefensiveTO%']) + 1.0)
df

WIN PROBABILITY

In [None]:
new_df = pd.DataFrame()
i = 0
for index in range(len(df.Team.unique())):
    arr = []
    for index2 in range(len(df.Team.unique())):
        total = df["Value"][index]+df["Value"][index2]
        arr.append(df["Value"][index]/total)
    new_df = new_df.append(pd.DataFrame([arr]))
new_df.columns = df.Team.unique()
new_df.index = df.Team.unique()
new_df

This data represents each team's chances of beating one another. With this data we can make a chart which plots the information and displays which teams are the most likely to win or lose

In [None]:
from matplotlib.pyplot import figure
figure(figsize=(32, 16), dpi=80)
for i in range(len(df.Team.unique())):
    plt.plot(df.Team.unique(), new_df.iloc[i], label = df.Team[i])
plt.title('Winning Percentage Against Team')
plt.xlabel('Team')
plt.ylabel('Winning Percent')
plt.legend()

LEAST SQUARES REGRESSION

GETTING TRAINING DATA

To perform our least squares regression, the first thing we must do is split up training and test data. 75 percent of our data will serve as training while the other 25 percent will serve as test data.

In [None]:
X = df.drop(columns=['Value', 'Team'], axis=1)
Y = df['Value']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.25, random_state=0)
X_train['Value'] = Y_train

REGRESSION

Using our previously created formula and passing it into an OLS function we can create a Least Squares Regression Model with our training data.

In [None]:
from statsmodels.formula.api import ols
formula_str = "Value ~ ((RushPercent * RushEfficiency) + (PassPercent * PassEfficiency)) - ((RushPercent * DRushEfficiency) - (PassPercent * DPassEfficiency)) - (DefenseRank * (OffensiveTO% - DefensiveTO%))"
mod = ols(formula=formula_str, data=X_train).fit()
warnings.filterwarnings('ignore')
mod.summary()

The R^2 value indicates how much of the variation in the value can be explained by variation in the columns (Rush Efficiency, Rush Percent, etc.). If we predict our test values, how would they come out?

CONCLUSION

The model that I created is based on the general information on how effeciently teams operate on offense and defense along with the type of plays they call and their turnover rates. On an actual football field there are much more factors at play. Injuries and matchups are a huge factor in picking games and mean a lot more then simply ranking teams based on these efficiency numbers.

With my current knowledge of computer science this model did an effective job of creating a basic representation of football games to an NFL fan. On a large scale of being consistently accurate to a bettor however it does not do an excellent job.

To create a more effecient model and improve on this it is neccesary to understand all of the various little factors at play in the NFL. Furthermore, each factor also carries a different weight and I would need to look through historical and current data to determine how much each factor matters.

Football is also a week to week game where anyone can always upset any other team. No model will ever be perfectly accurate but in a game where one play can be all the difference there is an even greater amount of variance. Regardless, after this class is over I will attempt to make this model more advanced and be the individual who takes down gambling empires in finding a way to beat the odds.