# Linear Regression Assignment

## PART 1: MLR

### Create a multiple linear regression model that would predict the rating based on all the available features (except year and name) using linear algebra approach (inv or pinv). Provide the model parameters.

In [4]:
import numpy as np
import pandas as pd

In [5]:
# Read the CSV file
df = pd.read_csv('QB2022_MLR.csv')
df.head()

Unnamed: 0,Year,Player,Pass Yds,Yds/Att,Att,Cmp,Cmp %,TD,INT,Rate
0,2022,Jared Goff,4438,7.6,587,382,65.1,29,7,99.3
1,2022,Josh Allen,4283,7.6,567,359,63.3,35,14,96.6
2,2022,Geno Smith,4282,7.5,572,399,69.8,30,11,100.9
3,2022,Trevor Lawrence,4113,7.0,584,387,66.3,25,8,95.2
4,2022,Jalen Hurts,3701,8.0,460,306,66.5,22,6,101.6


In [6]:
# Drop the year and player columns
dfrate = df.drop(columns = ['Year', 'Player'])
dfrate.head()

Unnamed: 0,Pass Yds,Yds/Att,Att,Cmp,Cmp %,TD,INT,Rate
0,4438,7.6,587,382,65.1,29,7,99.3
1,4283,7.6,567,359,63.3,35,14,96.6
2,4282,7.5,572,399,69.8,30,11,100.9
3,4113,7.0,584,387,66.3,25,8,95.2
4,3701,8.0,460,306,66.5,22,6,101.6


In [7]:
# Use the linear algebra to predict rating
# Let the output (vector b) be Rate
b = dfrate['Rate']
print(b.shape)
# This shows us the shape of the input (first 7 columns)
dfrate.iloc[:,:7].shape

(70,)


(70, 7)

In [8]:
# Find matrix A with the first 7 columns of the dataframe
# Use hstack to stack arrays next to each other
A = np.hstack([np.ones((dfrate.shape[0],1)), dfrate.iloc[:,:7]])
print(A.shape)

(70, 8)


In [9]:
# Use the pseudoinverse expression to find the linear regression model
x = np.linalg.inv(A.T@A)@A.T@b
x

array([-4.27770064e+01, -6.69029754e-03,  4.52455396e+00,  2.65961280e-01,
       -3.34880718e-01,  1.47996495e+00,  1.40284500e+00, -2.28659449e+00])

Our linear regression model above has the following parameters:

Intercept = -4.27770064e+01

Pass Yds Coefficient = -6.69029754e-03

Yds/Att Coefficient = 4.52455396e+00

Att Coefficient = 2.65961280e-01

Cmp Coefficient = -3.34880718e-01

Cmp % Coefficient = 1.47996495e+00

TD Coefficient = 1.40284500e+00

INT Coefficient = -2.28659449e+00

### Repeat the above using the library in sklearn. Provide the model parameters.

In [12]:
from sklearn import linear_model

reg = linear_model.LinearRegression()

reg.fit(np.array(dfrate.iloc[:,:7]), np.array(dfrate['Rate']))
reg.intercept_, reg.coef_

(-42.777006364868726,
 array([-0.0066903 ,  4.52455396,  0.26596128, -0.33488072,  1.47996495,
         1.402845  , -2.28659449]))

Our linear regression model above has the following parameters:

Intercept = -42.777006364868726

Pass Yds Coefficient = -0.0066903

Yds/Att Coefficient = 4.52455396

Att Coefficient = 0.26596128

Cmp Coefficient = -0.33488072

Cmp % Coefficient = 1.47996495

TD Coefficient = 1.402845

INT Coefficient = -2.28659449

Our linear regression model using the library in sklearn matches the model using the linear algebra approach, therefore we can conclude this is the linear regression model used to predict the rating based on all the available features.

### Make a prediction for each player in QB2022_MLR_test.csv.  Download QB2022_MLR_test.csv and print the result.

In [16]:
# Read the CSV file
dftest = pd.read_csv('QB2022_MLR_test.csv')
dftest.head()

Unnamed: 0,Year,Player,Pass Yds,Yds/Att,Att,Cmp,Cmp %,TD,INT,Rate
0,2022,Patrick Mahomes,5250,8.1,648,435,67.1,41,12,105.2
1,2022,Justin Herbert,4739,6.8,699,477,68.2,25,10,93.2
2,2022,Tom Brady,4694,6.4,733,490,66.8,25,9,90.7
3,2022,Kirk Cousins,4547,7.1,643,424,65.9,29,14,92.5
4,2022,Joe Burrow,4475,7.4,606,414,68.3,35,12,100.8


In [17]:
# Drop the year and player columns
dftest= dftest.drop(columns = ['Year', 'Player'])
dftest.head()

Unnamed: 0,Pass Yds,Yds/Att,Att,Cmp,Cmp %,TD,INT,Rate
0,5250,8.1,648,435,67.1,41,12,105.2
1,4739,6.8,699,477,68.2,25,10,93.2
2,4694,6.4,733,490,66.8,25,9,90.7
3,4547,7.1,643,424,65.9,29,14,92.5
4,4475,7.4,606,414,68.3,35,12,100.8


In [18]:
# Use sklearn to predict each player's rate
pred = reg.predict(np.array(dftest.iloc[:,:7]))
pred

array([114.8007751 ,  95.59226257,  98.98738212,  94.15009489,
       106.03957719])

Patrick Mahomes has a predicted rating of 114.8007751

Justin Herbert has a predicted rating of 95.59226257

Tom Brady has a predicted rating of 98.98738212

Kirk Cousins has a predicted rating of 94.15009489

Joe Burrow has a predicted rating of 106.03957719

### Calculate MSE for the data points you made a prediction for.

In [21]:
# Create an array for the actual rate values
actual = np.array(dftest['Rate'])
actual

array([105.2,  93.2,  90.7,  92.5, 100.8])

In [22]:
# Finding the mean square error using sklearn
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(pred, actual)
mse

39.35089751300326

The mean squared error for the predicted ratings is 39.35089751300326

## PART 2: Feature Engineering

### Create a new feature called "Cmp/Att" that is the ratio of completions to number of attempts. Use the Cmp and Att columns to calculate the new feature.

In [26]:
dfrate['Cmp/Att'] = dfrate['Cmp'] / dfrate['Att']
dfrate

Unnamed: 0,Pass Yds,Yds/Att,Att,Cmp,Cmp %,TD,INT,Rate,Cmp/Att
0,4438,7.6,587,382,65.1,29,7,99.3,0.650767
1,4283,7.6,567,359,63.3,35,14,96.6,0.633157
2,4282,7.5,572,399,69.8,30,11,100.9,0.697552
3,4113,7.0,584,387,66.3,25,8,95.2,0.662671
4,3701,8.0,460,306,66.5,22,6,101.6,0.665217
...,...,...,...,...,...,...,...,...,...
65,90,6.0,15,10,66.7,0,0,82.6,0.666667
66,59,11.8,5,3,60.0,0,1,61.7,0.600000
67,58,7.2,8,6,75.0,1,1,94.8,0.750000
68,52,4.3,12,8,66.7,1,0,103.5,0.666667


### Create a new feature called "TD/Att" that is the ratio of touchdowns to number of attempts. Use the TD and Att columns to calculate the new feature.

In [28]:
dfrate['TD/Att'] = dfrate['TD'] / dfrate['Att']
dfrate

Unnamed: 0,Pass Yds,Yds/Att,Att,Cmp,Cmp %,TD,INT,Rate,Cmp/Att,TD/Att
0,4438,7.6,587,382,65.1,29,7,99.3,0.650767,0.049404
1,4283,7.6,567,359,63.3,35,14,96.6,0.633157,0.061728
2,4282,7.5,572,399,69.8,30,11,100.9,0.697552,0.052448
3,4113,7.0,584,387,66.3,25,8,95.2,0.662671,0.042808
4,3701,8.0,460,306,66.5,22,6,101.6,0.665217,0.047826
...,...,...,...,...,...,...,...,...,...,...
65,90,6.0,15,10,66.7,0,0,82.6,0.666667,0.000000
66,59,11.8,5,3,60.0,0,1,61.7,0.600000,0.000000
67,58,7.2,8,6,75.0,1,1,94.8,0.750000,0.125000
68,52,4.3,12,8,66.7,1,0,103.5,0.666667,0.083333


### Use sklearn to create a multiple linear regression model that would predict the rating based on 3 features: "Yds/Att", "Cmp/Att", and "TD/Att". Provide the model parameters.

In [30]:
reg_1 = linear_model.LinearRegression()
reg_1.fit(np.array(dfrate[['Yds/Att','Cmp/Att','TD/Att']]), 
          np.array(dfrate['Rate']))
reg_1.intercept_, reg_1.coef_

(-22.431076218617463, array([  3.28389874, 110.49147089, 365.44531551]))

### Calculate the MSE of this model for the test set. What is MSE of the new model?


In [32]:
# Create new features for dftest, then find predicted rate
dftest['Cmp/Att'] = dftest['Cmp'] / dftest['Att']
dftest['TD/Att'] = dftest['TD'] / dftest['Att']
pred = reg_1.predict(np.array(dftest[['Yds/Att','Cmp/Att','TD/Att']]))
pred

array([101.46333041,  88.36948459,  84.91186976,  90.22565899,
        98.4606243 ])

In [33]:
# find the mean square error
actual = np.array(dftest['Rate'])
mean_squared_error(pred, actual)

16.28886722823315

The MSE of the new model is 16.288867228232995.

### Which model performs better? The one with more input features or the ones with fewer input features?

The ones with fewer input features is better.