# PART 2: Model

**Why data are normalized?**
Normalization is common technique in data preparation part in machine learning.
The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
Here different ranges we can notice in `Age`, `Goals90min`, `GoalDifference` etc.  

There was an attepmt conducted without normalized data and then data models performance was worse. 

### Baseline
*Always start with a stupid model, no exceptions*

Baseline is a model that is both simple to set up and has a reasonable chance of providing decent results. It is usually quick and low cost. 

Here in this project I use regressor that makes predictions using simple rules.

According to the documentation this regressor is useful as a simple baseline to compare with other (real) regressors. 
I use **mean** strategy to generate predictions.

## Random Forest
**Random forest** is a supervised machine learning algorithm that grows and combines multiple decision trees to create a “forest.” 

1. Random Forest grows multiple decision trees which are merged together for a more accurate prediction.
2. The logic model is that multiple uncorrelated models (trees) perform much better as a group than they do alone. 3. Each tree gets  a classification or a “vote.” 
4. The forest chooses the classification with the majority of the “votes.” 
5. When using Random Forest for regression, the forest picks the average of the outputs of all trees.

**Why random forest**
- Random forest is much more efficient than a single decision
- Resolves the problem of over-fitting and
- deals with missing data and usually maintains models accuracy.
- Anyways random Forest is less efficient than a neural network. A neural network

## Multiple Linear Regression
**Linear regression** 
In this exercise response variable(`ScoreHome` and `ScoreAway`) is affected by more than one predictor variable thus in this case the **Multiple Linear Regression** algorithm is used.

Multiple Linear Regression is an extension of Simple Linear since it takes more than one predictor variable to predict the response variable. 

This aproach is used because I wanted to check how it will perform knowing that this is a regression model, which means the output is a continuous variable. Additionaly regression models are used to predict continuous data such as home prices, temperature, profits etc. Output is not the classification problem since it is not binary or with defined labels. 

## K-Nearest Neighbors
**KNN** algorithm can be used It can be used for both classification and regression problems. Usually used in classification problems. 
The KNN algorithm uses feature similarity to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.


In [None]:
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import numpy as np

data = pd.read_csv('data.csv')
# Simply drop teams that were not in specific season
data = data.dropna()


X = data[['NonPenaltyGoalsHome', 'AgeHome', 'Goals90minHome',
          'NonPenaltyGoalsAway', 'AgeAway', 'Goals90minAway', 'RankingPlaceHome',
          'GoalForHome', 'GoalAgainstHome', 'GoalDifferenceHome', 'PointsHome',
          'TopTeamScorerGoalsHome', 'RankingPlaceAway', 'GoalForAway',
          'GoalAgainstAway', 'GoalDifferenceAway', 'PointsAway',
          'TopTeamScorerGoalsAway', 'HomeT', 'AwayT', 'Season']]


normaliza = MinMaxScaler() 
X_normal = normaliza.fit_transform(X)
X = pd.DataFrame(X_normal)


y = data[['ScoreHome', 'ScoreAway']]

X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6)

test_size = 0.5
X_valid, X_test, y_valid, y_test = train_test_split(
    X_rem, y_rem, test_size=0.5)


#BaseLine model
dummy_regr = DummyRegressor(strategy="mean")
model = LinearRegression()
model_KN = KNeighborsRegressor()
model_T = DecisionTreeRegressor()
model_RF = RandomForestRegressor(n_estimators = 1000, random_state = 42)
en = ElasticNet()

In [None]:
def run_model_calc_errors(model,X_train,y_train,X_val,y_val,X_test,y_test):

    model.fit(X=X_train, y=y_train)

    y_pred_train=model.predict(X_train)
    MSE_train=mean_squared_error(y_train, y_pred_train)

    y_pred_val=model.predict(X_val)
    MSE_val=mean_squared_error(y_val, y_pred_val)
  

    y_pred_test=model.predict(X_test)
    MSE_test=mean_squared_error(y_test, y_pred_test)
    RMSE=np.sqrt(MSE_test)
    
    return [MSE_train, MSE_val, RMSE]

models = [model, model_KN, model_RF, model_T,en, dummy_regr]


error = pd.DataFrame(columns=['MSE_Train', 'MSE_Valid', 'RMSE_Test'])
for model in models:
    res = run_model_calc_errors(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
    acc_score = model.score(X_test, y_test)
    error = error.append({'MSE_Train': res[0],'MSE_Valid': res[1], 'RMSE_Test': res[2]}, ignore_index=True) 





error.rename(index={0:'MultipleLinearRegression',
                    1:'KNeighborsRegressor',
                    2: 'RandomForestRegressor',
                    3: 'DecisionTreeRegressor',
                    4: 'ElasticNet',
                    5:  'Dummy_regr'},inplace=True)

In [None]:
display(HTML(error.to_html()))

A perfect MSE and RMSE value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests that predictive modeling problem is trivial.

A good RMSE and RMSE is relative to your specific dataset.

Comparing baseline model with other models it is shown that duymmy model is not the worst performing model. Decission tree regressor results note even worse results according to RMSE in valid dataset.

I am choosing RandomForestRegressor as the best performing model from all conducted simulations.

## Create CSV

In [None]:
#CSV


test = pd.read_csv('2020/test.csv')
test2 = pd.DataFrame(columns=[['Home', 'Away']])

home = []
away = []
for ind in test.index:
    if (test['Venue'][ind]) == 'Home':
        home.append(test['Team'][ind])
        away.append(test['Opponent'][ind])
    else:
        home.append(test['Opponent'][ind])
        away.append(test['Team'][ind])

test['Home'] = home
test['Away'] = away
test = test.drop(columns=['Date', 'Team', 'Opponent', 'Venue'])

y_pred_Linear_csv_home = []
y_pred_Linear_csv_away = []

y_pred_KN_csv_home = []
y_pred_KN_csv_away = []

y_pred_T_csv_home = []
y_pred_T_csv_away = []

y_pred_RF_csv_home = []
y_pred_RF_csv_away = []

team_transformed = pd.DataFrame()
team_transformed['Home'] = data.Home
team_transformed['SquadNO'] = data.HomeT
description = team_transformed.drop_duplicates(
    ['Home', 'SquadNO'], keep='last')


# get X values that will be used in model (need it to predit)
get_data = data[['Home', 'Away', 'NonPenaltyGoalsHome', 'AgeHome', 'Goals90minHome',
                 'NonPenaltyGoalsAway', 'AgeAway', 'Goals90minAway', 'RankingPlaceHome',
                 'GoalForHome', 'GoalAgainstHome', 'GoalDifferenceHome', 'PointsHome',
                 'TopTeamScorerGoalsHome', 'RankingPlaceAway', 'GoalForAway',
                 'GoalAgainstAway', 'GoalDifferenceAway', 'PointsAway',
                 'TopTeamScorerGoalsAway', 'HomeT', 'AwayT', 'Season']]


for ind in test.index:
    get_h = (test['Home'][ind])
    get_a = (test['Away'][ind])
    get = get_data[(get_data['Home'] == get_h) & (get_data['Away'] == get_a)]
    get = get.drop(columns=['Home', 'Away'])
    pred1 = (model.predict(get))
    pred2 = (model_KN.predict(get))
    pred3 = (model_T.predict(get))
    pred4 = (model_RF.predict(get))

    
    
    y_pred_Linear_csv_home.append(round(pred1[0][0]))    
    y_pred_Linear_csv_away.append(round(pred1[0][1]))   
    
    y_pred_KN_csv_home.append(round(pred2[0][0]))   
    y_pred_KN_csv_away.append(round(pred2[0][1]))

    y_pred_T_csv_home.append(round(pred3[0][0]))    
    y_pred_T_csv_away.append(round(pred3[0][1]))    
    
    y_pred_RF_csv_home.append(round(pred4[0][0]))    
    y_pred_RF_csv_away.append(round(pred4[0][1]))

test['HomeScore_tree'] = y_pred_T_csv_home
test['AwayScore_tree'] = y_pred_T_csv_away

test['HomeScore_Linear'] = y_pred_Linear_csv_home
test['AwayScore_Linear'] = y_pred_Linear_csv_away

test['HomeScore_KN'] = y_pred_KN_csv_home
test['AwayScore_KN'] = y_pred_KN_csv_away

test['HomeScore_RF'] = y_pred_RF_csv_home
test['AwayScore_RF'] = y_pred_RF_csv_away
test.to_csv(r'/home/edyta/git/INF161/Project/app/predictions.csv')