# <font color='darkred'> Model</font>

**In this Notebook I will train the dataset on different models. I will perform a GridSearch to find out which hyperparameters give the best model. I will use both classification models, and regression-models. Since the regression models will not give floats instead of integers as output, I will be cautious and round the predictions to closest integer. Lastly, I will use this model to make predicitons on the 2020 games, and create a dataframe with these predictions. This dataframe will be the input of a html-file that I have created, where by adding two teams, will predict the score in this match.**

---


**Importing useful packages**

In [1]:
# Import all the necessary libraries.
import pandas as pd
from pandas import DataFrame
import numpy as np
import pickle
import random as rn

#Making sure the results are reproducable
np.random.seed(40)
rn.seed(40)

# Tools
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV

# Model_packages:
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVC, SVR
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression, SGDClassifier, Lasso, ElasticNet

#When training several different models, I want to avoid the mess of some warnings
import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", None)

**Reading the CSV file created in preparation. I seperate the 2020 games from the rest since I am not able to train, validate or test on these matches without the score in these matches.**

In [2]:
#Reading the dataset
imported_df = pd.read_csv("cleaned_data.csv")

get_dummy_df = pd.get_dummies(imported_df, columns = ['Team', 'Opponent', 'Home_Away'])
data = (get_dummy_df.loc[(get_dummy_df['Season'] != 2020)])
data = data.drop(['Season' , 'Match ID'], axis = 1)

data_20 = get_dummy_df.loc[(get_dummy_df['Season'] == 2020)]
data_20 = data_20.drop(['Goals', 'Season', 'Match ID'], axis = 1)

**Splitting into dataframe(X) and label(y), and using train_test_split to create training- validation- and test set**

In [3]:
X = pd.get_dummies(data.loc[:, data.columns != 'Goals'])
y = pd.DataFrame(data['Goals'].values) # Goals er det jeg ønsker å predikere

In [4]:
#Splitting my dataset into train(70%), val(15%) and test(15%) data 
X_train, X_valtest, y_train, y_valtest = train_test_split(X, y, test_size=0.3, random_state = 50, shuffle=False)
X_val, X_test, y_val, y_test = train_test_split(X_valtest, y_valtest, test_size=0.5, random_state = 50, shuffle=False)

### Dummy models
**We always want to test if a simple model will perform satisfactory. Before i test for multiple models, with dummy models among them, I want to find out how they perform. I create them from scratch by counting the number of times each score appears and finds the most common values. I use the matches from the train set and predict the score for the validation set. Finally, I compute the mean square error for this model**

In [5]:
print(y_train.value_counts())
print('the mean of goals is: ', round(y_train.mean(), 2))

1    158
0    148
2    127
3     62
4     24
5     19
6     18
9      1
dtype: int64
the mean of goals is:  0    1.63
dtype: float64


**Finding '1' as the most common score, and the average as above 1.5. A common model is also to use the mean, but since a football score always must be an integer, this would not be an accurate model.**

**I Therefore test for always predicting 0, 1, and 2 below:**

In [6]:
y_val2_0 = np.zeros(len(y_val))
y_val2_1 = np.ones(len(y_val))
y_val2_2 = np.linspace(2,2,len(y_val))

In [7]:
print('The RMSE of always predicting 0 is: ', mean_squared_error(y_val, y_val2_0, squared = False))
print('The RMSE of always predicting 1 is: ', mean_squared_error(y_val, y_val2_1, squared = False))
print('The RMSE of always predicting 2 is: ', mean_squared_error(y_val, y_val2_2, squared = False))

The RMSE of always predicting 0 is:  2.0681660775599595
The RMSE of always predicting 1 is:  1.5062893357603013
The RMSE of always predicting 2 is:  1.5034973234697402


- Since football scores have a high variance and is quite difficult to predict, an RMSE of 1.44 might not be a very bad result. Guessing 1 in every match might therefore get high accuracy, but might not be the most useful tool.

---
# Model building

I have created a lot of different models that will be iterated over the hyperparameters to tune the best model. They will be created as a nested dictionary where the key is the name of the model, and the value is a tuple with the model as the first value, and a new dictionary with the hyperparameters as the next value.

**WHY BOTH CLASSIFICATION- AND REGRESSION MODELS**

**The reason for why we can use both classification- and regression models is because the we use continous data to predict a score which does not have infinite possible outcomes. For soccer scores, it does not make sense to predict a score larger than say 5. Therefore, classification models will be able to predict which of the [0, 1, 2, 3, 4, 5] the score will be. Because the output has a hierchy meaning that a score 3 is larger than a score 2, and score 2 is larger than score 0, the regression models will be able to make a prediction. The regression models will however not give answer as an integer, but I can round the prediction up or down to the closest integer to get a reasonable prediction.**

**Prediction for model performance**

Since we are trying to find the models with the least loss, I predict the regression models to perform better overall. This is also becuase the forementioned fact that this is continous data, and a classification model will not alway try to.

I also believe the dummy variables might perform well since football scores vary in a large degree making it harder to be precise.

I will train the model on 7 classification models, 3 dummy models and 9 regression models

In [8]:
models = {
# DUMMY MODELS 
    
    'Dummy_Regressor (Mean)' : (DummyRegressor('mean'), {}),
    
    'Dummy_Regressor (Median)' : (DummyRegressor(strategy = 'median'), {}),
    
    #I only use constant of 0 because Mean and median will give a constant prediction of 2 and 1 respectively
    'Dummy_Regressor (Constant_0)' : (DummyRegressor('constant', constant = 0), {}),
      

    
# CLASSIFICATION MODELS
    'Logistic Regression' : (LogisticRegression(), {}), 
    
    
    'Stochastic Gradient Descent' : (SGDClassifier(loss = 'modified_huber', shuffle = True, random_state = 100), {}),
    
    'k Nearest Neighbour' : (KNeighborsClassifier(), {
        'n_neighbors' : [1, 3, 5, 7]}),
    
    
    'DecisionTree Classifier' : (DecisionTreeClassifier(), {
        'criterion' : ["entropy", "gini"],
        'splitter' : ["random", "best"],
        'min_samples_split' : [1, 3, 5, 7],
        'min_samples_leaf': [1, 3, 5, 7]}),
    
    'RFC' : (RandomForestClassifier(), {
        'n_estimators': [100, 500, 800],
        'max_features': ['auto', 'sqrt', 'log2']}),
    
    'SVC' : (SVC(), { 
        'gamma': [0.1, 0.01, 0.001, 0.0005], 
        'C': [5, 15, 75, 150]}),
    
    'Naive Bayes (GaussianNB)' : (GaussianNB(), {
        'var_smoothing': np.logspace(0, -9, num = 100)
    }),
    
    
# REGRESSION MODELS (Need to be rounded to create plausible predictions)  
    
    'Kernel_Ridge' : (KernelRidge(alpha=0.1, kernel='linear', gamma=0.1), {
        'alpha' :  [0.1, 0.5, 1, 5],
        'kernel' : ['linear', 'rbf', 'sigmoid', 'poly'],
        'gamma' : [0.1, 0.5, 1, 5],
        'degree' : [3]}),
    
    'Lasso' : (Lasso(), {
        'alpha': [0.005, 0.01, 0.1, 1, 5],
        'fit_intercept': [False, True],
        'max_iter': [50, 750, 7500]}),
    
    'RFR' : (RandomForestRegressor(), {
        'max_depth': [10, 50, 75, 100, 150],
        'max_features': ['auto', 'sqrt', 'log2'],
        'min_samples_leaf': [1, 3, 5, 7],
        'min_samples_split': [1, 2, 10],
        'n_estimators': [10, 50, 100, 150]}),
    
    'Linear_Regression' : (LinearRegression(), {}),

    'SVR' : (SVR(), {
        'gamma': [0.01, 0.001, 0.0005],
        'C': [5, 15, 75, 150],}),
    
    'Multinomial_Naive_Bayes' : (MultinomialNB(), {
        'alpha' : [0.05, 0.5, 1, 5, 10],
        'fit_prior': [True, False]}),
    
    'Decision_Tree_Regressor' : (DecisionTreeRegressor(), {
        'criterion' : ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
        'splitter' : ['best', 'random'],
        'min_samples_split' : [1, 2, 5],
        'min_samples_leaf': [1, 2, 5]}),

    'KNeighbors_Regressor' : (KNeighborsRegressor(), {
        'n_neighbors' : [1, 3, 5, 7]})
}

**The reason behind the choice of hyperparameters and their values is that these are the most common hyperparameters and values for each model.**

**GridSearching over the best hyperparameters** 

In [9]:
model_grid = {}
for model_name, model_info in models.items():
    model, parameters = model_info
    alternative_model = GridSearchCV(estimator = model, param_grid = parameters, n_jobs = -1)
    alternative_model.fit(X_train, y_train)
    model_grid[model_name] = alternative_model

The run time for the modelfitting is about 4 minutes.

**The mission of this assignment is to find a model with the lowest root mean square value. In the next cell I calculate this for the best hyperparameters within the different models**

In [10]:
# Root mean squared error for models on validation data
rmse_dict = {}
for model_name, alternative_model in model_grid.items():
    rmse_dict[model_name] = round(mean_squared_error(y_val, np.around(alternative_model.predict(X_val)), squared = False),6)

rmse_table = pd.DataFrame(rmse_dict, index = ["Loss - RMSE"])

rmse_table = rmse_table.transpose().sort_values(by="Loss - RMSE")
rmse_table

Unnamed: 0,Loss - RMSE
Lasso,1.36277
RFR,1.402279
KNeighbors_Regressor,1.417181
Kernel_Ridge,1.420143
SVR,1.452324
Linear_Regression,1.469579
Decision_Tree_Regressor,1.472436
Dummy_Regressor (Mean),1.503497
Dummy_Regressor (Median),1.506289
Stochastic Gradient Descent,1.509076


**From the table we can see that the top seven models were regression models while classification models over all performed much worse. The top Classification model were far down the list. The fact that the DummyRegressor for median, were the 8th best model and close to the best in RSME is alarming. It may imply that this type of prediction is too random for a machine learning model to predict.**

In [11]:
best_model = model_grid[rmse_table.index[0]]

print('The best model on the training data was:', best_model.best_estimator_)

The best model on the training data was: Lasso(alpha=0.005, fit_intercept=False, max_iter=750)


---

## Best model on test data

**Lastly, I want to test the models ability to generalize by evaluating it on the unseen test data**

In [12]:
print('The root mean square error on the test set is:', round(mean_squared_error(y_test, np.around(best_model.predict(X_test)), squared = False),4))

The root mean square error on the test set is: 1.5193


### Comments about best model

**The best model's ability to generalise on new data is represented by its score on the test set. As we can see, the RMSE has jumped a bit from the validation set and is probably a result of overfitting. With so many models as I have presented it is natural that some of the models overperform well on the validation set.**

Saving the model to disc:

In [13]:
#Lagrer modellen
pickle.dump(best_model, open('model_best.pkl', 'wb'))

---
# Predictions on 2020 set (Predictions.csv)

**In the cells below i create a dataframe 'predictions.csv' that is a part of the hand-in, but I will also use it for the input to the html-file.**

In [14]:
#The input data for the model is the 2020-columns from the imported .csv file at the top of the notebook.
predicted20 = np.around(best_model.predict(data_20))
output_df = imported_df.loc[(imported_df['Season'] == 2020)]
#Adding the score column to the table which is the predicted scores for 2020
output_df['Score'] = predicted20
output_df = DataFrame(output_df)
output_df = output_df[['Team', 'Opponent', 'Home_Away', 'Score', 'Match ID']]
output_df.Score = output_df.Score.astype(int)

**Splitting the dataframe to two tables to create only one line per match:**

In [15]:
output_df1 = output_df[output_df['Home_Away'] == 'Home']
output_df2 = output_df[output_df['Home_Away'] == 'Away']

**merging the two dataframes to contain one row per match:**

In [16]:
output_df = pd.merge(output_df1, output_df2, left_on='Match ID', right_on='Match ID', how='left', suffixes=('_Home', '_Away')).drop(['Home_Away_Away', 'Home_Away_Home', 'Opponent_Home', 'Opponent_Away'], axis=1)
output_df['Score'] = output_df["Score_Home"].astype(str) + '-' + output_df["Score_Away"].astype(str)
output_df = output_df[['Match ID', 'Team_Home', 'Team_Away', 'Score']]

**The entire table:**

In [17]:
pd.set_option("display.max_rows", None)
#Showing every single match in the dataframe below
output_df

Unnamed: 0,Match ID,Team_Home,Team_Away,Score
0,399,Trondheims-Ørn,LSK Kvinner,1-3
1,400,Røa,Arna-Bjørnar,2-2
2,401,Avaldsnes,Lyn,2-1
3,402,Sandviken,Vålerenga,1-1
4,403,Klepp,Kolbotn,2-0
5,404,Kolbotn,Avaldsnes,1-1
6,405,Vålerenga,Røa,2-1
7,406,Lyn,Trondheims-Ørn,2-2
8,407,Arna-Bjørnar,Klepp,1-1
9,408,LSK Kvinner,Sandviken,3-1


**Exporting the dataframe to csv**

In [18]:
output_df.to_csv("predictions.csv", index = False)

---

# Final Comments

I have now trained the variables on 18 different model types, distributed among dummy models, classification models and regression models. I have tuned each model on different hyperparameters, and found a model that performed the best on the validation data. The risk of training a lot of different models is that one model might 'get lucky' and overperform. If the difference in RMSE loss against the test set is large, this might have happened. The model that performed the best was the **Lasso**-model with hyperparameters **alpha = 0.005** , **fit_intercept = False** and **max_iter = 750**. Lasso is an abbreviation for 'Least Absolute Shrinkage and Selection Operator' and is a complex model. I will not go too deeply into the theory behind this model, but try to give a brief overview of the hyperparamters. The lasso model tries to stop the model from overfitting by adding a penalty to the model for the amount of variables it contains. The alpha specifies how much penalty should be added. This is 0.005 and is the lowest of the options I introduced to alpha in the GridSearch. max_iter = 750 meaning it will run 750 before we will force it to converge if it has not converged before.
