<h1>NFL Quarterback Hall of Fame Predictor<h1>
    <h2>Adam Yamarik<h2>
        
        

The purpose of the model defined below is learn the statistics most important to making the Hall of Fame, and use that model to be able to predict whether a player will eventually make the HoF.

This will be acomplished through both logistic regression, as well as applying a kernelized perceptron. The logistic regression model will also be used to learn the weights of the different features, which will be helpful in applying the kernelized perceptron.

We start with the implementation of the logistic regression, but first, we need to import the data and translate it into something that we can work with. Each instance will be of the form:<br>
Games played, completions, total pass attempts, total pass yards, total touchdowns, total interceptions, championchips, name<br>
The name will not be of any use, and is just kept as a way to keep track of who is who. 
   

In [1]:
# import cell

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
# Import the data

xlabels = np.zeros((192, 7))
ylabels = np.zeros(192)

raw_data = open("xData.txt")
true_labels = open("yData.txt")
                      
data = raw_data.read()
labels = true_labels.read()

splitData = data.splitlines()
splitLabels = labels.splitlines()

for i in range(0, 192):
    currLine = splitData[i]

    currArray = currLine.split(',')

    for j in range(7):
        
        xlabels[i][j] = (int(currArray[j].strip()))
        
    
        
for i in range(0, 192):
    ylabels[i] = int(splitLabels[i]) 
                
        
raw_data.close()
true_labels.close()

Now with the data imported, we need to make a few modifications to the data. The whole point of this model is to predict if any quartarback will make the HoF. So there are some stats that would not be very helpful when making comparisons to those who are eligible to be in the Hall. Generally, these are total stats, like total completions, total touchdowns, etc. Comparing someone who has retired after 15 years, and someone who has only played 2 or 3 would make it near impossible to for the latter player to be predicted to make the Hall.

For the specific changes, we can break them down by a per year, or even a per game basis. The smaller the scope, the better the comparison. 

In [3]:
# Note; indecies for the stats are as follows:
# 0: Total Games played; will be discarded for the reason described above
# 1: Total Completions; Changed to completions per game
# 2: Total Attempts; Changed to attempts per game
# 3: Total Pass Yard; Changed to yards per game
# 4: Total Passing TD; Changed to TD's per game
# 5: Total Interceptions; Changed to int's per game
# 6: Championchips; Unchanged

newXData = np.zeros((192, 6))

for i in range(0, 192):
    compPerGame = (xlabels[i][1] / xlabels[i][0])
    attPerGame = (xlabels[i][2] / xlabels[i][0])
    yardsPerGame = (xlabels[i][3] / xlabels[i][0])
    tdPerGame = (xlabels[i][4] / xlabels[i][0])
    intPerGame = (xlabels[i][5] / xlabels[i][0])
    
    newXData[i][0] = compPerGame
    newXData[i][1] = attPerGame
    newXData[i][2] = yardsPerGame
    newXData[i][3] = tdPerGame
    newXData[i][4] = intPerGame
    newXData[i][5] = xlabels[i][6]
    

The the data is in a form that can be used for our purposes. Now, we need to split the data, and run logistic regression on it.

In [4]:
# Split the data into testing and training
x_train, x_test, y_train, y_test = train_test_split(newXData, ylabels, test_size=.25, random_state=2)

# Define the logistic regresion function
logReg = LogisticRegression(penalty='l2', max_iter = 1000)

logReg.fit(x_train, y_train)

LogisticRegression(max_iter=1000)

Now we can get the use the test set to see how well we did

In [5]:
#print(x_test)
logReg.predict(x_test)

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])

In [6]:
print(y_test)

[0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.]


In [7]:
predictions = logReg.predict(x_test)


In [8]:
score = logReg.score(x_test, y_test)
print(score)

0.875


Lets get the weight vector and analyze what the model says about the stats.

In [9]:
weights = logReg.coef_
print(weights)

[[-0.12354417  0.05243707  0.03272731  1.21452078 -0.03722771  1.54188867]]


Lets proceed by breaking down the weights<br>
Weights for feature 0, 1, and 2 are are pretty minimal. So according to the model, attempts, completions, and yards dont hold much importance to making the HoF. The last weight is championchips, which has a signigicant positive weight. This makes sense, since it is generally accepted that winning superbowls is the best way to leave your mark for the HoF. The weight for touchdowns is also significant. I find the interceptions to be interesting. While it does have a negative effect, it is very close to zero, so this model does not value interceptions highly for this prediction. 

Now lets do a couple of tests and players which are not eligible for the HoF. Lets start with a player who is almost certianly be a first ballot Hall of Famer after he retires, Tom Bardy

In [10]:
totGames = 318
totComp = 7263
totAtt = 11317
totYards = 84520
totTD = 624
totInt = 203
totChamp = 7

compPG = totComp / totGames
attPG = totAtt / totGames
yardsPG = totYards / totGames
tdPG = totTD / totGames
intPG = totInt / totGames

playerTomBrady = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [11]:
predTB = logReg.predict(playerTomBrady)
print(predTB)

[1.]


As excpected, Tom Brady is predicted to make the Hall of Fame. Now lets test another likely candidate, Aaron Rodgers.

In [12]:
totGames = 213
totComp = 4651
totAtt = 9380
totYards = 71940
totTD = 449
totInt = 93
totChamp = 1

compPG = totComp / totGames
attPG = totAtt / totGames
yardsPG = totYards / totGames
tdPG = totTD / totGames
intPG = totInt / totGames

playerAaronRodgers = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [13]:
predAR = logReg.predict(playerAaronRodgers)
print(predAR)

[1.]


Once again, we see the model has the expected outcome. Now lets try the other side, a player who will most likely not make the HoF. We can test using Case Keenum.

In [14]:
totGames = 76
totComp = 1356
totAtt = 2173
totYards = 14876
totTD = 78
totInt = 48
totChamp = 0

compPG = totComp / totGames
attPG = totAtt / totGames
yardsPG = totYards / totGames
tdPG = totTD / totGames
intPG = totInt / totGames

playerCaseKeenum = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [15]:
predCK = logReg.predict(playerCaseKeenum)
print(predCK)

[0.]


As expected, Kennum is prdeicted to not make the HoF. The above examples are pretty simple. Most people agree with the model, but what about players whose analyists and fans may have differing opinions about? Lets try the model on Eli Manning, perhaps the most controversial player in regards to this discussion. Eli has two superbowls, but never had been viewed as a top teir QB, so there is more ambiguity to whether he will make the HoF or not. For reference, the site which I got my data from, Pro Football Reference, includes a stat of their own, called Hall of Fame Monitor Score. This stat is their own way to predict the likelyhood of those players making the HoF. They put Eli Manning at 85.01. An average QB in the HoF has a score of 109. So Eli is a bit below that threashold.

In [16]:
totGames = 236
totComp = 4895
totAtt = 8119
totYards = 57023
totTD = 366
totInt = 244
totChamp = 2

compPG = totComp / totGames
attPG = totAtt / totGames
yardsPG = totYards / totGames
tdPG = totTD / totGames
intPG = totInt / totGames

playerEliManning = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [17]:
predEM = logReg.predict(playerEliManning)
print(predEM)

[1.]


Interestingly, our model predicts the Eli Manning will make the Hall of Fame. Lets dig a little bit into why this is the case. Firstly, 2 superbowls is much higher than average, most QB's will not get anywhere close to a superbowl in their carrers, and the weight vector supports that, giving the highest weight to championchips feature. Our model also doesnt value inteceptions too highly, so the just over 1 int per game that Manning has doesnt negatively effect his probability for our model. <br>
Since Eli is not eligible to be inducted to the HoF, we wont know whether the model is right for a while, and even if he does make it, it may be first ballot. Despite Manning being more abiguous in this discussion, I would say the models prediction is justifiable, but it would be justifiable if it predicted '0' as well.

One last test. The Winner of the previous Super Bowl, Super Bowl 56, was the Los Angelos Rams. Their quarteback, Matthew Stafford, had high expectations placed on him after his trade from the Detroit Lions, and he lived up to them. But many people still believe that Stafford is not worthy of being in the Hall of Fame, so lets see what the model predicts. (For reference, Staffords Hall of Fame Monitor Score is at 68.44)

In [49]:
totGames = 182
totComp = 4302
totAtt = 6825
totYards = 49995
totTD = 323
totInt = 161
totChamp = 1

compPG = totComp / totGames
attPG = totAtt / totGames
yardsPG = totYards / totGames
tdPG = totTD / totGames
intPG = totInt / totGames

playerMattStafford = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [50]:
predMS = logReg.predict(playerMattStafford)
print(predMS)

[0.]


Now lets use the weights to apply a kernelized perceptron to the data, and see how the two classifiers compare.

In [20]:
from sklearn.svm import SVC

In [21]:
weightedXData = np.zeros((192,6))

for i in range(0,192):
    for j in range(0,6):
        x = int(newXData[i][j])
        currWeight = int(weights[0][j])
        weightedXData[i][j] = x * currWeight
                           

In [22]:
# Split the data into testing and training
x_train, x_test, y_train, y_test = train_test_split(weightedXData, ylabels, test_size=.25, random_state=2)


perceptron = SVC(kernel='poly')
perceptron.fit(x_train, y_train)

SVC(kernel='poly')

In [23]:
perceptron.predict(x_test)

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])

In [24]:
print(y_test)

[0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.]


In [25]:
perScore = perceptron.score(x_test, y_test)

In [26]:
print(perScore)

0.8333333333333334


So we see that the score is a bit worse than the logistic regression, but that is ok. The kernelized perceptron still does a good job of predicting the data. I was curious to see what would occur if we used unweighted values, so I tested that below.

In [27]:
# Split the data into testing and training
x_train, x_test, y_train, y_test = train_test_split(newXData, ylabels, test_size=.25, random_state=2)


perceptron2 = SVC(kernel='poly')
perceptron2.fit(x_train, y_train)

SVC(kernel='poly')

In [28]:
perceptron2.predict(x_test)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [29]:
perScore2 = perceptron2.score(x_test, y_test)

In [30]:
print(perScore2)

0.7916666666666666


We can see two interesting results from this. The first is that the unweighted data seems to more commonly be classified as a 0, whereas the weighted ones are a closer the the true split in the data. I fact, with the split given, none of the test values are classified as a 1 in the model. The second is that the unweighted values gives us a worse model with a lower score than the weighted one. So it seems using logistic regression to help taylor the features helped the perceptron

Lets see if this model disagrees with the logistic regression one with the examples used before.

In [31]:
totGames = 318
totComp = 7263
totAtt = 11317
totYards = 84520
totTD = 624
totInt = 203
totChamp = 7 * weights[0][5]

compPG = (totComp / totGames) * weights[0][0]
attPG = (totAtt / totGames) * weights[0][1]
yardsPG = (totYards / totGames) * weights[0][2]
tdPG = (totTD / totGames) * weights[0][3]
intPG = (totInt / totGames) * weights[0][4]

playerTomBrady = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [32]:
predTB = perceptron.predict(playerTomBrady)
print(predTB)

[1.]


In [33]:
totGames = 213
totComp = 4651
totAtt = 9380
totYards = 71940
totTD = 449
totInt = 93
totChamp = 1 * weights[0][5]

compPG = (totComp / totGames)  * weights[0][0]
attPG = (totAtt / totGames) * weights[0][1]
yardsPG = (totYards / totGames) * weights[0][2]
tdPG = (totTD / totGames) * weights[0][3]
intPG = (totInt / totGames) * weights[0][4]

playerAaronRodgers = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [34]:
predAR = perceptron.predict(playerAaronRodgers)
print(predAR)

[1.]


In [35]:
totGames = 76
totComp = 1356
totAtt = 2173
totYards = 14876
totTD = 78
totInt = 48
totChamp = 0 * weights[0][5]

compPG = totComp / totGames * weights[0][0]
attPG = totAtt / totGames * weights[0][1]
yardsPG = totYards / totGames * weights[0][2]
tdPG = totTD / totGames * weights[0][3]
intPG = totInt / totGames * weights[0][4]

playerCaseKeenum = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [36]:
predCK = perceptron.predict(playerCaseKeenum)
print(predCK)

[0.]


In [52]:
totGames = 236
totComp = 4895
totAtt = 8119
totYards = 57023
totTD = 366
totInt = 244
totChamp = 2 * weights[0][5]

compPG = (totComp / totGames) * weights[0][0]
attPG = (totAtt / totGames) * weights[0][1]
yardsPG = (totYards / totGames) * weights[0][2]
tdPG = (totTD / totGames) * weights[0][3]
intPG = (totInt / totGames) * weights[0][4]

playerEliManning = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [53]:
predEM = perceptron.predict(playerEliManning)
print(predEM)

[1.]


In [54]:
totGames = 182
totComp = 4302
totAtt = 6825
totYards = 49995
totTD = 323
totInt = 161
totChamp = 1 * weights[0][5]

compPG = totComp / totGames * weights[0][0]
attPG = totAtt / totGames * weights[0][1]
yardsPG = totYards / totGames * weights[0][2]
tdPG = totTD / totGames * weights[0][3]
intPG = totInt / totGames * weights[0][4]

playerMattStafford = [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [55]:
predMS = perceptron.predict(playerMattStafford)
print(predMS)

[1.]


No chnages in the predictions. Lets try and see if we can find someone who is different between the two.

In [41]:
totGames = 178
totComp = 3771
totAtt = 6108
totYards = 41269
totTD = 3227
totInt = 144
totChamp = 1

compPG = totComp / totGames
attPG = totAtt / totGames
yardsPG = totYards / totGames
tdPG = totTD / totGames
intPG = totInt / totGames

playerJoeFlacco= [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [42]:
predJF = logReg.predict(playerJoeFlacco)
print(predJF)

[1.]


In [43]:
totGames = 178
totComp = 3771
totAtt = 6108
totYards = 41269
totTD = 3227
totInt = 144
totChamp = 1 * weights[0][5]

compPG = totComp / totGames * weights[0][0]
attPG = totAtt / totGames * weights[0][1]
yardsPG = totYards / totGames * weights[0][2]
tdPG = totTD / totGames * weights[0][3]
intPG = totInt / totGames * weights[0][4]

playerJoeFlacco= [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [44]:
predJF = perceptron.predict(playerJoeFlacco)
print(predJF)

[1.]


In [45]:
totGames = 58
totComp = 857
totAtt = 1329
totYards = 9967
totTD = 84
totInt = 31
totChamp = 0

compPG = totComp / totGames
attPG = totAtt / totGames
yardsPG = totYards / totGames
tdPG = totTD / totGames
intPG = totInt / totGames

playerLamarJackson= [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [46]:
predLJ = logReg.predict(playerLamarJackson)
print(predLJ)

[0.]


In [47]:
totGames = 58
totComp = 857
totAtt = 1329
totYards = 9967
totTD = 84
totInt = 31
totChamp = 0 * weights[0][5]

compPG = totComp / totGames * weights[0][0]
attPG = totAtt / totGames * weights[0][1]
yardsPG = totYards / totGames * weights[0][2]
tdPG = totTD / totGames * weights[0][3]
intPG = totInt / totGames * weights[0][4]

playerLamarJackson= [[compPG, attPG, yardsPG, tdPG, intPG, totChamp]]

In [48]:
predLJ = perceptron.predict(playerLamarJackson)
print(predLJ)

[1.]


So Joe Flacco is predicted by both to make it, which is interesting in and of itself, but we have found a discrepency, in Lamar Jackson. Lamar is predicted to not make the Hall of Fame using the logistic regression model, and predicted to make the HoF. I think the majority of people would predict that Jacksons current time in the NLF is not enough to make the Hall of Fame. Much like our weights suggest, the best thing for him to do is to win a Super Bowl, but it is interesting to the see Jackson is a quarterback with this discrepency. 