Milestone 5 - Model Training

Task 2: Perform a feature selection

This task will focus on the second step of the workflow mentioned in the previous task. Feature selection is a process that allows you to select a subset of the features in your dataset. This is useful if you have a lot of features and you don't want to use all of them. It's great if you have a model that overfits the data, and you want to reduce the number of features. Perform feature selection, so you can use the most important features to train your model. You can use LASSO regression for this task, but it's not recommended since it doesn't work very well with multiclass classification. Instead, you can simply take a look at the weights of your features and see which ones are important. Remove those that have low weights and check again the performance. Before you do that, you should check the performance of the model without feature selection on both training and testing sets, and observe if, by removing some features, the metrics on both sets get closer. Don't worry if you underfit right now, you will improve your model later.

In [40]:
import re
import pandas as pd
import os
import numpy as np
from csv import reader
import plotly.express as px
import missingno as msno
import pickle

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MinMaxScaler
from numpy import set_printoptions

from sklearn.preprocessing import StandardScaler

pd.options.mode.chained_assignment = None

In [2]:
# READ IN cleaned_dataset.csv
full_pd = pd.read_csv("cleaned_dataset_b.csv")
full_pd

Unnamed: 0,League,Season,Round,Home_Team,Away_Team,Elo_home,Elo_away,HOMETEAM_HOME_GOAL_SO_FAR,HOMETEAM_AWAY_GOAL_SO_FAR,AWAYTEAM_HOME_GOAL_SO_FAR,AWAYTEAM_AWAY_GOAL_SO_FAR,HOME_LASTEST_GOAL_DIFF,AWAY_LASTEST_GOAL_DIFF,Result
0,championship,2021,4,Coventry City,AFC Bournemouth,46.0,62.0,3,2,4,2,0,2,0
1,championship,2021,4,Norwich City,Derby County,62.0,60.0,2,2,0,6,0,-7,0
2,championship,2021,4,Blackburn Rovers,Cardiff City,58.0,60.0,5,0,1,4,8,-1,0
3,championship,2021,4,Luton Town,Wycombe Wanderers,51.0,41.0,2,1,0,3,1,-8,1
4,championship,2021,4,Middlesbrough,Barnsley,61.0,46.0,1,1,0,1,-1,-3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111647,serie_b,1997,38,Pescara,Padova,59.0,54.0,32,15,22,15,5,-3,0
111648,serie_b,1997,38,Genoa,Palermo FC,61.0,58.0,33,12,24,24,2,-1,1
111649,serie_b,1997,38,Torino,Ravenna FC,63.0,54.0,27,23,22,18,-2,-2,0
111650,serie_b,1997,38,Salernitana,Reggina,52.0,52.0,20,7,23,18,-2,3,0


In [45]:
# Create functions to filter different league
def getLeagueData(data, league, season=None):
    if season is None:
        league_pd =  data[(data["League"]==league)]
    else:
        league_pd =  data[(data["League"]==league) & (data["Season"]==season)]
    return league_pd

In [47]:
def get_ELO_diff(record):
    hscore = record['Elo_home']
    ascore = record['Elo_away']
    return (hscore - ascore)

In [48]:
def get_recent_goal_diff_diff(record):
    hscore = record['HOME_LASTEST_GOAL_DIFF']
    ascore = record['AWAY_LASTEST_GOAL_DIFF']
    return hscore - ascore

In [49]:
def get_home_away_total_goal_diff(record):
    hgoal = record['HOMETEAM_HOME_GOAL_SO_FAR']
    agoal = record['AWAYTEAM_AWAY_GOAL_SO_FAR']
    return hgoal - agoal

In [52]:
model_pd = getLeagueData(full_pd, "eerste_divisie")
model_pd

Unnamed: 0,League,Season,Round,Home_Team,Away_Team,Elo_home,Elo_away,HOMETEAM_HOME_GOAL_SO_FAR,HOMETEAM_AWAY_GOAL_SO_FAR,AWAYTEAM_HOME_GOAL_SO_FAR,AWAYTEAM_AWAY_GOAL_SO_FAR,HOME_LASTEST_GOAL_DIFF,AWAY_LASTEST_GOAL_DIFF,Result


In [55]:
# load all directory as league name list
dir = "./Results"
leagues = [name for name in os.listdir(dir) if os.path.isdir(os.path.join(dir, name))]

# loop to open csv
result_with_goal_sofar_pd = pd.DataFrame()
for league in leagues:
    model_pd = getLeagueData(full_pd, league)
    model_pd = model_pd.dropna()

    if (model_pd.shape[0]==0):
        continue

    elo_diff_pd = model_pd.apply(get_ELO_diff, axis=1)
    model_pd.drop('Elo_home', inplace=True, axis=1)
    model_pd.drop('Elo_away', inplace=True, axis=1)
    model_pd.insert(loc=5, column="ELO_DIFF", value=elo_diff_pd.astype('Int64')) 
    
    recent_perf_diff_pd = model_pd.apply(get_recent_goal_diff_diff, axis=1)
    model_pd.drop('HOME_LASTEST_GOAL_DIFF', inplace=True, axis=1)
    model_pd.drop('AWAY_LASTEST_GOAL_DIFF', inplace=True, axis=1)
    model_pd.insert(loc=6, column="RECENT_PERF_DIFF", value=recent_perf_diff_pd.astype('Int64')) 

    goal_diff_pd = model_pd.apply(get_home_away_total_goal_diff, axis=1)
    model_pd.drop('HOMETEAM_HOME_GOAL_SO_FAR', inplace=True, axis=1)
    model_pd.drop('HOMETEAM_AWAY_GOAL_SO_FAR', inplace=True, axis=1)
    model_pd.drop('AWAYTEAM_HOME_GOAL_SO_FAR', inplace=True, axis=1)
    model_pd.drop('AWAYTEAM_AWAY_GOAL_SO_FAR', inplace=True, axis=1)
    model_pd.insert(loc=7, column="HOME_AWAY_GOAL_DIFF", value=recent_perf_diff_pd.astype('Int64')) 

    # delete no value column
    model_pd.drop('League', inplace=True, axis=1)
    model_pd.drop('Season', inplace=True, axis=1)
    model_pd.drop('Round', inplace=True, axis=1)
    model_pd.drop('Home_Team', inplace=True, axis=1)
    model_pd.drop('Away_Team', inplace=True, axis=1)

    array = model_pd.values
    X = array[:,0:(array.shape[1]-1)].astype('int')
    y = array[:,(array.shape[1]-1)].astype('int')

    # Scaler
    scaler = MinMaxScaler(feature_range=(0, 8))
    rescaledX = scaler.fit_transform(X)

    # summarize transformed data
    set_printoptions(precision=3)

    # Or Standardize
    #scaler = StandardScaler().fit(X)
    #rescaledX = scaler.transform(X)

    test_size = 0.3
    seed = 7
    X_train, X_test, Y_train, Y_test = train_test_split(rescaledX, y, test_size=test_size,
    random_state=seed)

    model = LogisticRegression() 
    model.fit(X_train, Y_train)

    print(league)


    result = model.score(X_train, Y_train) 
    print("Accuracy for train: %.3f%%" % (result*100.0))

    result = model.score(X_test, Y_test) 
    print("Accuracy for test: %.3f%%" % (result*100.0))
    

championship
Accuracy for train: 56.667%
Accuracy for test: 55.605%
primeira_liga
Accuracy for train: 63.976%
Accuracy for test: 63.979%
ligue_1
Accuracy for train: 59.726%
Accuracy for test: 59.981%
segunda_division
Accuracy for train: 56.377%
Accuracy for test: 57.125%
2_liga
Accuracy for train: 57.325%
Accuracy for test: 56.816%
serie_a
Accuracy for train: 64.270%
Accuracy for test: 63.238%
bundesliga
Accuracy for train: 60.166%
Accuracy for test: 61.699%
primera_division
Accuracy for train: 61.307%
Accuracy for test: 60.402%
ligue_2
Accuracy for train: 56.782%
Accuracy for test: 57.017%
premier_league
Accuracy for train: 62.647%
Accuracy for test: 62.785%
eredivisie
Accuracy for train: 66.198%
Accuracy for test: 64.830%
segunda_liga
Accuracy for train: 94.702%
Accuracy for test: 92.308%
serie_b
Accuracy for train: 57.626%
Accuracy for test: 56.017%


In [18]:
# Save the model
from joblib import dump, load
dump(model, 'baseline_t2.joblib')

['baseline_t2.joblib']

In [31]:
# Load the model
loaded_model = load('baseline.joblib') 

take a look at the weights of your features and see which ones are important. Remove those that have low weights and check again the performance. Before you do that, you should check the performance of the model without feature selection on both training and testing sets, and observe if, by removing some features, the metrics on both sets get closer. Don't worry if you underfit right now, you will improve your model later.