## Simple Linear Regression in Python

Model how many goals are scored (dependent variable), as more shots are taken (independent variable).

This might help us to see how much a squad might need to invest to avoid relegation, make European spots or to create a data-driven target for our team.

_Initial set-up & exploration_

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
data = pd.read_csv('../data/positionsvsValue.csv')

In [None]:
data.head()

In [None]:
data.describe()

We have a 220-row dataset, with each row being a team in each Premier League season since 2008/09. For each of the teams, we get squad sizes, ages, squad value (in Euros) as well as performance data with goal difference, points & position. The values are taken from Transfermarkt

In [None]:
sns.pairplot(data[['Season', 'GD', 'Squad Value', 'Points', 'Position']])

Some interesting points to keep in mind:

- Points & goal difference correlate really strongly, as you might expect.
- Squad value goes up as goal difference and points go up, but as more of a curve than a line.
- Squad value has increased over time

In [None]:
abs(data['Squad Value'].corr(data['Position'])) > data['Squad Value'].corr(data['Points'])

_Building our Model_

1) Get and reshape the two columns that we want to use in our model: Points & Squad Value

2) Split each of the two variables into a training set, and a test set. The train set will build our model, the test set will allow us to see how good the model is.

3) Create an empty linear regression model, then fit it against our two training sets

4) Examine and test the model

In [None]:
#1- Get our two columns into variables, then reshape them

X = data['Squad Value']
y = data['Points']

X = X.values.reshape(-1,1)
y = y.values.reshape(-1,1)

We can use train_test_split to easily create our training and test sets. There are a few arguments we have to pass, in addition to the variables that will be split. There is test_size, which tells the function what % of the split should be in the test side. Random_state is not necessary, but it sets a starting point for the random number generation involved in the split

In [None]:
#2- Use the train_test_split function to create our training sets & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)

In [None]:
lm = LinearRegression()
lm.fit(X_train,y_train)

The final part is examining the model. This means seeing what conclusions it gives to answer our main question (value -> performance), and importantly, how valid they are.

We can start by checking the coefficient. This is the amount that we expect our response variable (points) to change for every unit that our predictor variable changes (squad value in m Euros). Simply, for every extra million we put into our squad value, how many extra points should we get?

In [None]:
print(lm.coef_)

We now need to test the model by checking predictions from the trained model against the test data that we know is true.

In [None]:
predictions = lm.predict(X_test)

In [None]:
plt.scatter(X_test, y_test,  color='purple')
plt.plot(X_test, predictions, color='green', linewidth=3)
plt.title("EPL Squad value vs points - Model One")

plt.show()

In [None]:
plt.scatter(y_test,predictions)

histogram to plot the differences between the predictions and the true data

In [None]:
plt.title('How many points out is each prediction?')
sns.distplot((y_test-predictions),bins=50, color = 'purple')

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, predictions))

In [None]:
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': predictions.flatten()})
df.head()

In [None]:
df['Actual'].corr(df['Predicted'])

_Improving the model_

When we took an exploratory look at the data, we found that team values had increased over seasons. As such, comparing a 100m squad in 2008 to a 100m squad in 2018 probably isn’t fair.

To counter this, we are going to create a new ‘Relative Value’ column. This will take each team in a season, and divide it by the highest value in that league. These values will be between 0 & 1 and give a better impression of comparative buying power, hence performance in the league.

In [None]:
#Blank list
relativeValue = []

#Loop through each row
for index, team in data.iterrows():
    
    #Obtain which season we are looking at
    season = team['Season']
    
    #Create a new dataframe with just this season
    teamseason = data[data['Season'] == season]
    
    #Find the max value
    maxvalue = teamseason['Squad Value'].max()
    
    #Divide this row's value by the max value for the season
    tempRelativeValue = team['Squad Value']/maxvalue
    
    #Append it to our list
    relativeValue.append(tempRelativeValue)
    
#Add list to new column in main dataframe
data["Relative Value"] = relativeValue

data.head()

pairplot to check out the new column’s relationship with the others.

In [None]:
sns.pairplot(data[['GD', 'Squad Value', 'Relative Value', 'Points', 'Position']])

In [None]:
#Assign relevant columns to variables and reshape them
X = data['Relative Value']
y = data['Points']
X = X.values.reshape(-1,1)
y = y.values.reshape(-1,1)

#Create training and test sets for each of the two variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)

#Create an empty model, then train it against the variables
lm = LinearRegression()
lm.fit(X_train,y_train)

We'll look at the coefficient to see what our model tells us to expect. We’ll divide it by 10, to see how many points increasing our squad value by 10% of the most expensive team should earn

In [None]:
print(lm.coef_/10)

In [None]:
predictions = lm.predict(X_test)

In [None]:
plt.scatter(X_test, y_test,  color='purple')
plt.plot(X_test, predictions, color='green', linewidth=3)
plt.title("Relative Squad value vs points - Model Two")
plt.show()

The model predicts just over 5 points. This seems to make sense, as the difference between top and bottom would often range around 53 or so points.

So for every 10% that you are off of the most expensive team, our model suggests that you should expect to drop 5.3 points.

Let’s run the same tests as before to check out whether or not this new model performs better.

In [None]:
plt.scatter(y_test,predictions)

In [None]:
plt.title('How many points out is each prediction?')
sns.distplot((y_test-predictions),bins=50,color='purple')

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))

So that’s nearly an 8% improvement…