# NFL Score Predictor

For this project we will be looking at prediciting NFL Scores based on a database given to us. This database contains the date a game was played, the season, the elo rating for both teams, the elo probability, the team tags, whether it was a playoff game, whether it was played at a nuetral area or not, the scores of the games and the results. I think it would be useful to have the average points allowed and points scored for each team, but the average should be a rolling average from the more recent games. Since the goal is to predict future games, the scores and results will not be included in the training data since this will not be given for any games that haven't been played yet. However, I think it could be useful to train a classifier to predict the winner before hand and then have that prediction be used to train the regression model that predicts the scores of each game.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import statistics as stc
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv("nfl-elo-game-master/data/nfl_games.csv" )

check for empty data points

In [3]:
data.isnull().sum()

date          0
season        0
neutral       0
playoff      48
team1         0
team2         0
elo1          0
elo2          0
elo_prob1     0
score1        0
score2        0
result1       0
dtype: int64

fill the values with 0

In [4]:
data = data.fillna(0)

now lets take out any data from before 1980 because football was a lot different back then.

In [5]:
data = data.truncate(before=9056)

In [6]:
data = data.set_index("date")

now lets split our data set into test train datasets, one group will be for predicting score1 and the other will be for predicting score 2

In [7]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ["team1","team2", "score1", "score2","result1"]), data["score1"], test_size=.2, random_state=41)
X2_train, X2_test, y2_train, y2_test = train_test_split(data.drop(columns = ["team1","team2", "score1", "score2","result1"]), data["score2"], test_size=.2, random_state=41)

lets create two models, one for each score to predict. XGBoost regressors will be used for these models.

In [8]:
model = xgb.XGBRegressor()
model.fit(X_train, y_train)
model2 = xgb.XGBRegressor()
model2.fit(X2_train, y2_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

now lets make predictions with the model.

In [9]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
y2_pred = model2.predict(X_test)
predictions2 = [round(value) for value in y2_pred]

We will check how the model does with Mean Squared Error measurement

In [10]:
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print("RMSE: %f" % (rmse))
rmse2 = np.sqrt(mean_squared_error(y_test, predictions2))
print("RMSE: %f" % (rmse2))

RMSE: 9.663482
RMSE: 11.509092


we seem to have a Mean Squared error that is pretty high, but we have not done much editing to the database. Now lets do some feature engineering and see if we can reduce that MSE.

I think that it would be useful to take an average points allowed and average points scored column for each team, we want to have this for both the home and away team. Although we do not want this average to be from the entire database, it would be more useful if it were a moving average of say about 16 games.

In [11]:
pointsscored = {}
pointsallowed = {}
for team in data["team1"].unique():
    pointsscored[team] = [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10]
    pointsallowed[team] = [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10]


In [12]:
pfadata = pd.read_csv("nfl-elo-game-master/data/nfl_games.csv" ).truncate(before=9056)
pfadat = pd.DataFrame({"team1ps": np.nan,"team1pa": np.nan,"team2ps": np.nan,"team1pa": np.nan},  index=pfadata.index)

In [13]:
for index, row in pfadata.iterrows():
    
    pfadat.at[index, "team1ps"] = (round(stc.mean(pointsscored[row[4]])))
    pfadat.at[index, "team1pa"] = (round(stc.mean(pointsallowed[row[4]])))
    pfadat.at[index, "team2ps"] = (round(stc.mean(pointsscored[row[5]])))
    pfadat.at[index, "team2pa"] = (round(stc.mean(pointsallowed[row[5]])))
    
    pointsscored[row[4]].append(row[9])
    pointsscored[row[4]].pop(0)
    pointsallowed[row[4]].append(row[10])
    pointsallowed[row[4]].pop(0)
    
    pointsscored[row[5]].append(row[10])
    pointsscored[row[5]].pop(0)
    pointsallowed[row[5]].append(row[9])
    pointsallowed[row[5]].pop(0)
    

In [14]:
newdata = pd.concat([pfadata, pfadat], axis=1, sort=False)
newdata = newdata.fillna(0)

#label encode the teams
newdata["team1"] = newdata["team1"].astype("category").cat.codes
newdata["team2"] = newdata["team2"].astype("category").cat.codes

Here we will create another machine learning model that will try and predict the winner of the game. This prediction will then be place in our dataset and be used by our regression model to help try and improve the MSE.

In [15]:
winner = xgb.XGBClassifier()
winner.fit(newdata.drop(columns = [ "date", "score1", "score2","result1"]), newdata["result1"])
y_pred = winner.predict(newdata.drop(columns = [ "date", "score1", "score2","result1"]))
predictions = [round(value) for value in y_pred]



accuracy = accuracy_score(newdata["result1"].astype(int), predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 67.36%


In [16]:
newdata["predResult"] = predictions

Split the data up again.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(newdata.drop(columns = [ "date", "score1", "score2","result1"]), newdata["score1"], test_size=.2, random_state=41)
X2_train, X2_test, y2_train, y2_test = train_test_split(newdata.drop(columns = ["date",  "score1", "score2","result1"]), newdata["score2"], test_size=.2, random_state=41)

In [18]:
model = xgb.XGBRegressor()
model.fit(X_train, y_train)
model2 = xgb.XGBRegressor()
model2.fit(X2_train, y2_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

make our predictions again with our new data.

In [19]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
y2_pred = model2.predict(X_test)
predictions2 = [round(value) for value in y2_pred]

In [20]:
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print("RMSE: %f" % (rmse))
rmse2 = np.sqrt(mean_squared_error(y_test, predictions2))
print("RMSE: %f" % (rmse2))

RMSE: 9.572480
RMSE: 11.482973


Our RMSE went down a little bit but not a lot.

In [21]:
newdata.tail()

Unnamed: 0,date,season,neutral,playoff,team1,team2,elo1,elo2,elo_prob1,score1,score2,result1,team1ps,team1pa,team2ps,team2pa,predResult
16584,9/27/2020,2020,0,0.0,27,8,1592.877541,1526.916862,0.680021,38,31,1.0,26.0,25.0,27.0,21.0,1.0
16585,9/27/2020,2020,0,0.0,9,29,1474.230438,1501.272864,0.554409,10,28,0.0,18.0,20.0,30.0,28.0,0.0
16586,9/27/2020,2020,0,0.0,0,10,1487.264238,1382.367502,0.726712,23,26,0.0,23.0,27.0,22.0,28.0,1.0
16587,9/27/2020,2020,0,0.0,21,11,1589.800167,1618.642744,0.551848,30,37,0.0,29.0,21.0,26.0,22.0,1.0
16588,9/28/2020,2020,0,0.0,2,15,1672.269182,1685.130853,0.574475,20,34,0.0,32.0,17.0,30.0,19.0,1.0


Lets try our predictions on the entire dataset.

In [22]:
y_pred = model.predict(newdata.drop(columns = [ "date", "score1", "score2","result1"]))
predictions = [round(value) for value in y_pred]
y2_pred = model2.predict(newdata.drop(columns = [ "date", "score1", "score2","result1"]))
predictions2 = [round(value) for value in y2_pred]

In [23]:
rmse = np.sqrt(mean_squared_error(newdata["score1"], predictions))
print("RMSE: %f" % (rmse))
rmse2 = np.sqrt(mean_squared_error(newdata["score2"], predictions2))
print("RMSE: %f" % (rmse2))

RMSE: 9.388746
RMSE: 9.153151


The RMSE for this well a lot lower than our split database, especially on our score2 prediction.

In [24]:
newdata["predScore1"] = predictions
newdata["predScore2"] = predictions2
newdata.tail()

Unnamed: 0,date,season,neutral,playoff,team1,team2,elo1,elo2,elo_prob1,score1,score2,result1,team1ps,team1pa,team2ps,team2pa,predResult,predScore1,predScore2
16584,9/27/2020,2020,0,0.0,27,8,1592.877541,1526.916862,0.680021,38,31,1.0,26.0,25.0,27.0,21.0,1.0,29.0,25.0
16585,9/27/2020,2020,0,0.0,9,29,1474.230438,1501.272864,0.554409,10,28,0.0,18.0,20.0,30.0,28.0,0.0,20.0,26.0
16586,9/27/2020,2020,0,0.0,0,10,1487.264238,1382.367502,0.726712,23,26,0.0,23.0,27.0,22.0,28.0,1.0,25.0,23.0
16587,9/27/2020,2020,0,0.0,21,11,1589.800167,1618.642744,0.551848,30,37,0.0,29.0,21.0,26.0,22.0,1.0,31.0,27.0
16588,9/28/2020,2020,0,0.0,2,15,1672.269182,1685.130853,0.574475,20,34,0.0,32.0,17.0,30.0,19.0,1.0,27.0,28.0


lets use our model to predict the upcomoing games this week.

In [25]:
def getPredictions(filename):
    week4 = pd.read_csv(filename)
    pfadat = pd.DataFrame({"team1ps": np.nan,"team1pa": np.nan,"team2ps": np.nan,"team1pa": np.nan},  index=week4.index)
    for index, row in week4.iterrows():
    
        pfadat.at[index, "team1ps"] = (round(stc.mean(pointsscored[row[4]])))
        pfadat.at[index, "team1pa"] = (round(stc.mean(pointsallowed[row[4]])))
        pfadat.at[index, "team2ps"] = (round(stc.mean(pointsscored[row[5]])))
        pfadat.at[index, "team2pa"] = (round(stc.mean(pointsallowed[row[5]])))
    
    week4 = pd.concat([week4, pfadat], axis=1, sort=False)
    week4 = week4.fillna(0)

    tempt1 = week4["team1"]
    tempt2 = week4["team2"]
    week4["team1"] = week4["team1"].astype("category").cat.codes
    week4["team2"] = week4["team2"].astype("category").cat.codes
        
    y_pred = winner.predict(week4.drop(columns = [ "date", "score1", "score2","result1"]))
    predictions = [round(value) for value in y_pred]
    week4["predResult"] = predictions
    y_pred = model.predict(week4.drop(columns = [ "date", "score1", "score2","result1"]))
    predictions = [round(value) for value in y_pred]
    y2_pred = model2.predict(week4.drop(columns = [ "date", "score1", "score2","result1"]))
    predictions2 = [round(value) for value in y2_pred]
    week4["predScore1"] = predictions
    week4["predScore2"] = predictions2
    
    week4["team1"] = tempt1
    week4["team2"] = tempt2
    return week4

In [26]:
def printWinners(dataset):
    for el in range(len(dataset)):
        if(dataset.at[el,"predScore1"] > dataset.at[el,"predScore2"] ):
            string = str(dataset.at[el,"team1"]) + " is predicted to beat " + str(dataset.at[el,"team2"]) + " with a score of " + str(dataset.at[el,"predScore1"]) + " - " + str(dataset.at[el,"predScore2"])
        else:
            string = str(dataset.at[el,"team2"]) + " is predicted to beat " + str(dataset.at[el,"team1"]) + " with a score of " + str(dataset.at[el,"predScore2"]) + " - " + str(dataset.at[el,"predScore1"])
        
        print (string)

In [27]:
test = getPredictions("week4.csv")
printWinners(test)


NYJ is predicted to beat DEN with a score of 21.0 - 20.0
TB is predicted to beat LAC with a score of 30.0 - 22.0
CHI is predicted to beat IND with a score of 26.0 - 21.0
JAX is predicted to beat CIN with a score of 23.0 - 22.0
CAR is predicted to beat ARI with a score of 24.0 - 23.0
DAL is predicted to beat CLE with a score of 28.0 - 20.0
SEA is predicted to beat MIA with a score of 29.0 - 22.0
MIN is predicted to beat HOU with a score of 26.0 - 23.0
BAL is predicted to beat WSH with a score of 30.0 - 15.0
TEN is predicted to beat PIT with a score of 30.0 - 24.0
NO is predicted to beat DET with a score of 30.0 - 20.0
LAR is predicted to beat NYG with a score of 30.0 - 20.0
KC is predicted to beat NE with a score of 30.0 - 24.0
BUF is predicted to beat OAK with a score of 28.0 - 22.0
SF is predicted to beat PHI with a score of 32.0 - 22.0
GB is predicted to beat ATL with a score of 29.0 - 21.0
