Milestone 4 - Model Training

Task 2: Perform feature selection

This task will focus on the second step of the workflow mentioned in the previous task. Feature selection is a process that allows you to select a subset of the features in your dataset. This is useful if you have a lot of features and you don't want to use all of them. It's great if you have a model that overfits the data, and you want to reduce the number of features. Perform feature selection, so you can use the most important features to train your model. You can use LASSO regression for this task, but it's not recommended since it doesn't work very well with multiclass classification. Instead, you can simply take a look at the weights of your features and see which ones are important. Remove those that have low weights and check again the performance. Before you do that, you should check the performance of the model without feature selection on both training and testing sets, and observe if, by removing some features, the metrics on both sets get closer. Don't worry if you underfit right now, you will improve your model later.

In [1]:
import re
import pandas as pd
import os
import numpy as np
from csv import reader
import plotly.express as px
import missingno as msno
import pickle

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

pd.options.mode.chained_assignment = None

High weight:	
RECENT_PERF_DIFF

Low weight:
ELO_DIFF
GOAL_SO_FAR_DIFF	

Try to remove ELO_DIFF, GOAL_SO_FAR_DIFF and testing the Accuracy again

In [40]:
# load csv (reset data)
result_with_goal_sofar_pd = pd.read_csv('cleaned_dataset.csv')
result_with_goal_sofar_pd

Unnamed: 0,Home_Team,Away_Team,Result,Home_Score,Away_Score,HOME_TOTAL_GOAL_SO_FAR,AWAY_TOTAL_GOAL_SO_FAR,HOME_LASTEST_GOAL_DIFF,AWAY_LASTEST_GOAL_DIFF,Link,ELO_HOME,ELO_AWAY,Season,Round,League
0,Watford,Middlesbrough,1-0,1.0,0.0,0,0,,,https://www.besoccer.com/match/watford-fc/midd...,65.0,60.0,2021,1,championship
1,Birmingham City,Brentford,1-0,1.0,0.0,0,0,,,https://www.besoccer.com/match/birmingham-city...,52.0,59.0,2021,1,championship
2,Wycombe Wanderers,Rotherham United,0-1,0.0,1.0,0,0,,,https://www.besoccer.com/match/wycombe-wandere...,41.0,48.0,2021,1,championship
3,AFC Bournemouth,Blackburn Rovers,3-2,3.0,2.0,0,0,,,https://www.besoccer.com/match/afc-bournemouth...,63.0,57.0,2021,1,championship
4,Barnsley,Luton Town,0-1,0.0,1.0,0,0,,,https://www.besoccer.com/match/barnsley-fc/lut...,47.0,50.0,2021,1,championship
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146636,Pescara,Padova,1-2,1.0,2.0,49,39,5.0,-3.0,https://www.besoccer.com/match/pescara-calcio/...,59.0,54.0,1997,38,serie_b
146637,Genoa,Palermo FC,4-1,4.0,1.0,54,39,2.0,-1.0,https://www.besoccer.com/match/genoa/palermo/1...,61.0,58.0,1997,38,serie_b
146638,Torino,Ravenna FC,0-4,0.0,4.0,45,39,-2.0,-2.0,https://www.besoccer.com/match/torino-fc/raven...,63.0,54.0,1997,38,serie_b
146639,Salernitana,Reggina,1-3,1.0,3.0,30,37,-2.0,3.0,https://www.besoccer.com/match/salernitana-cal...,52.0,52.0,1997,38,serie_b


In [41]:
def get_ELO_diff(record):
    hscore = record['ELO_HOME']
    ascore = record['ELO_AWAY']
    return hscore - ascore

elo_diff_pd = result_with_goal_sofar_pd.apply(get_ELO_diff, axis=1)

result_with_goal_sofar_pd.drop('ELO_HOME', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('ELO_AWAY', inplace=True, axis=1)

result_with_goal_sofar_pd.insert(loc=0, column="ELO_DIFF", value=elo_diff_pd.astype('Int64')) 

result_with_goal_sofar_pd

Unnamed: 0,ELO_DIFF,Home_Team,Away_Team,Result,Home_Score,Away_Score,HOME_TOTAL_GOAL_SO_FAR,AWAY_TOTAL_GOAL_SO_FAR,HOME_LASTEST_GOAL_DIFF,AWAY_LASTEST_GOAL_DIFF,Link,Season,Round,League
0,5,Watford,Middlesbrough,1-0,1.0,0.0,0,0,,,https://www.besoccer.com/match/watford-fc/midd...,2021,1,championship
1,-7,Birmingham City,Brentford,1-0,1.0,0.0,0,0,,,https://www.besoccer.com/match/birmingham-city...,2021,1,championship
2,-7,Wycombe Wanderers,Rotherham United,0-1,0.0,1.0,0,0,,,https://www.besoccer.com/match/wycombe-wandere...,2021,1,championship
3,6,AFC Bournemouth,Blackburn Rovers,3-2,3.0,2.0,0,0,,,https://www.besoccer.com/match/afc-bournemouth...,2021,1,championship
4,-3,Barnsley,Luton Town,0-1,0.0,1.0,0,0,,,https://www.besoccer.com/match/barnsley-fc/lut...,2021,1,championship
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146636,5,Pescara,Padova,1-2,1.0,2.0,49,39,5.0,-3.0,https://www.besoccer.com/match/pescara-calcio/...,1997,38,serie_b
146637,3,Genoa,Palermo FC,4-1,4.0,1.0,54,39,2.0,-1.0,https://www.besoccer.com/match/genoa/palermo/1...,1997,38,serie_b
146638,9,Torino,Ravenna FC,0-4,0.0,4.0,45,39,-2.0,-2.0,https://www.besoccer.com/match/torino-fc/raven...,1997,38,serie_b
146639,0,Salernitana,Reggina,1-3,1.0,3.0,30,37,-2.0,3.0,https://www.besoccer.com/match/salernitana-cal...,1997,38,serie_b


In [42]:
def get_goal_so_far_diff(record):
    hscore = record['HOME_TOTAL_GOAL_SO_FAR']
    ascore = record['AWAY_TOTAL_GOAL_SO_FAR']
    return hscore - ascore

goal_so_far_diff_pd = result_with_goal_sofar_pd.apply(get_goal_so_far_diff, axis=1)

result_with_goal_sofar_pd.drop('HOME_TOTAL_GOAL_SO_FAR', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('AWAY_TOTAL_GOAL_SO_FAR', inplace=True, axis=1)

result_with_goal_sofar_pd.insert(loc=1, column="GOAL_SO_FAR_DIFF", value=goal_so_far_diff_pd.astype('Int64')) 

result_with_goal_sofar_pd

Unnamed: 0,ELO_DIFF,GOAL_SO_FAR_DIFF,Home_Team,Away_Team,Result,Home_Score,Away_Score,HOME_LASTEST_GOAL_DIFF,AWAY_LASTEST_GOAL_DIFF,Link,Season,Round,League
0,5,0,Watford,Middlesbrough,1-0,1.0,0.0,,,https://www.besoccer.com/match/watford-fc/midd...,2021,1,championship
1,-7,0,Birmingham City,Brentford,1-0,1.0,0.0,,,https://www.besoccer.com/match/birmingham-city...,2021,1,championship
2,-7,0,Wycombe Wanderers,Rotherham United,0-1,0.0,1.0,,,https://www.besoccer.com/match/wycombe-wandere...,2021,1,championship
3,6,0,AFC Bournemouth,Blackburn Rovers,3-2,3.0,2.0,,,https://www.besoccer.com/match/afc-bournemouth...,2021,1,championship
4,-3,0,Barnsley,Luton Town,0-1,0.0,1.0,,,https://www.besoccer.com/match/barnsley-fc/lut...,2021,1,championship
...,...,...,...,...,...,...,...,...,...,...,...,...,...
146636,5,10,Pescara,Padova,1-2,1.0,2.0,5.0,-3.0,https://www.besoccer.com/match/pescara-calcio/...,1997,38,serie_b
146637,3,15,Genoa,Palermo FC,4-1,4.0,1.0,2.0,-1.0,https://www.besoccer.com/match/genoa/palermo/1...,1997,38,serie_b
146638,9,6,Torino,Ravenna FC,0-4,0.0,4.0,-2.0,-2.0,https://www.besoccer.com/match/torino-fc/raven...,1997,38,serie_b
146639,0,-7,Salernitana,Reggina,1-3,1.0,3.0,-2.0,3.0,https://www.besoccer.com/match/salernitana-cal...,1997,38,serie_b


In [25]:
def get_recent_goal_diff_diff(record):
    hscore = record['HOME_LASTEST_GOAL_DIFF']
    ascore = record['AWAY_LASTEST_GOAL_DIFF']
    return hscore - ascore

recent_perf_diff_pd = result_with_goal_sofar_pd.apply(get_recent_goal_diff_diff, axis=1)

result_with_goal_sofar_pd.drop('HOME_LASTEST_GOAL_DIFF', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('AWAY_LASTEST_GOAL_DIFF', inplace=True, axis=1)

#result_with_goal_sofar_pd.insert(loc=1, column="RECENT_PERF_DIFF", value=recent_perf_diff_pd.astype('Int64')) 

result_with_goal_sofar_pd.insert(loc=0, column="RECENT_PERF_DIFF", value=recent_perf_diff_pd.astype('Int64')) 

result_with_goal_sofar_pd

Unnamed: 0,RECENT_PERF_DIFF,Home_Team,Away_Team,Result,Home_Score,Away_Score,HOME_TOTAL_GOAL_SO_FAR,AWAY_TOTAL_GOAL_SO_FAR,Link,ELO_HOME,ELO_AWAY,Season,Round,League
0,,Watford,Middlesbrough,1-0,1.0,0.0,0,0,https://www.besoccer.com/match/watford-fc/midd...,65.0,60.0,2021,1,championship
1,,Birmingham City,Brentford,1-0,1.0,0.0,0,0,https://www.besoccer.com/match/birmingham-city...,52.0,59.0,2021,1,championship
2,,Wycombe Wanderers,Rotherham United,0-1,0.0,1.0,0,0,https://www.besoccer.com/match/wycombe-wandere...,41.0,48.0,2021,1,championship
3,,AFC Bournemouth,Blackburn Rovers,3-2,3.0,2.0,0,0,https://www.besoccer.com/match/afc-bournemouth...,63.0,57.0,2021,1,championship
4,,Barnsley,Luton Town,0-1,0.0,1.0,0,0,https://www.besoccer.com/match/barnsley-fc/lut...,47.0,50.0,2021,1,championship
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146636,8,Pescara,Padova,1-2,1.0,2.0,49,39,https://www.besoccer.com/match/pescara-calcio/...,59.0,54.0,1997,38,serie_b
146637,3,Genoa,Palermo FC,4-1,4.0,1.0,54,39,https://www.besoccer.com/match/genoa/palermo/1...,61.0,58.0,1997,38,serie_b
146638,0,Torino,Ravenna FC,0-4,0.0,4.0,45,39,https://www.besoccer.com/match/torino-fc/raven...,63.0,54.0,1997,38,serie_b
146639,-5,Salernitana,Reggina,1-3,1.0,3.0,30,37,https://www.besoccer.com/match/salernitana-cal...,52.0,52.0,1997,38,serie_b


In [43]:
# clean up records with NaN
# drop record with na
result_with_goal_sofar_pd = result_with_goal_sofar_pd.dropna()
result_with_goal_sofar_pd

Unnamed: 0,ELO_DIFF,GOAL_SO_FAR_DIFF,Home_Team,Away_Team,Result,Home_Score,Away_Score,HOME_LASTEST_GOAL_DIFF,AWAY_LASTEST_GOAL_DIFF,Link,Season,Round,League
36,-16,-1,Coventry City,AFC Bournemouth,1-3,1.0,3.0,0.0,2.0,https://www.besoccer.com/match/coventry-city/a...,2021,4,championship
37,2,2,Norwich City,Derby County,0-1,0.0,1.0,0.0,-7.0,https://www.besoccer.com/match/norwich-city-fc...,2021,4,championship
38,-2,8,Blackburn Rovers,Cardiff City,0-0,0.0,0.0,8.0,-1.0,https://www.besoccer.com/match/blackburn-rover...,2021,4,championship
39,10,3,Luton Town,Wycombe Wanderers,2-0,2.0,0.0,1.0,-8.0,https://www.besoccer.com/match/luton-town-fc/w...,2021,4,championship
40,15,2,Middlesbrough,Barnsley,2-1,2.0,1.0,-1.0,-3.0,https://www.besoccer.com/match/middlesbrough-f...,2021,4,championship
...,...,...,...,...,...,...,...,...,...,...,...,...,...
146636,5,10,Pescara,Padova,1-2,1.0,2.0,5.0,-3.0,https://www.besoccer.com/match/pescara-calcio/...,1997,38,serie_b
146637,3,15,Genoa,Palermo FC,4-1,4.0,1.0,2.0,-1.0,https://www.besoccer.com/match/genoa/palermo/1...,1997,38,serie_b
146638,9,6,Torino,Ravenna FC,0-4,0.0,4.0,-2.0,-2.0,https://www.besoccer.com/match/torino-fc/raven...,1997,38,serie_b
146639,0,-7,Salernitana,Reggina,1-3,1.0,3.0,-2.0,3.0,https://www.besoccer.com/match/salernitana-cal...,1997,38,serie_b


In [44]:
# delete no value column
result_with_goal_sofar_pd.drop('Result', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('Link', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('League', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('Season', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('Round', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('Home_Team', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('Away_Team', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('HOME_LASTEST_GOAL_DIFF', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('AWAY_LASTEST_GOAL_DIFF', inplace=True, axis=1)
result_with_goal_sofar_pd

Unnamed: 0,ELO_DIFF,GOAL_SO_FAR_DIFF,Home_Score,Away_Score
36,-16,-1,1.0,3.0
37,2,2,0.0,1.0
38,-2,8,0.0,0.0
39,10,3,2.0,0.0
40,15,2,2.0,1.0
...,...,...,...,...
146636,5,10,1.0,2.0
146637,3,15,4.0,1.0
146638,9,6,0.0,4.0
146639,0,-7,1.0,3.0


In [45]:
# find who win H:Home A:Away D:Draw
def get_result(record):
    hscore = record['Home_Score']
    ascore = record['Away_Score']
    if hscore is pd.NA or ascore is pd.NA:
        return pd.NA
    if hscore==ascore:
        return 0
    elif hscore>ascore:
        return 1
    else:
        return -1

result_pd = result_with_goal_sofar_pd.apply(get_result, axis=1)

result_with_goal_sofar_pd.drop('Home_Score', inplace=True, axis=1)
result_with_goal_sofar_pd.drop('Away_Score', inplace=True, axis=1)

result_with_goal_sofar_pd.insert(loc=len(result_with_goal_sofar_pd.columns), column="Result", value=result_pd.astype('Int64')) 
result_with_goal_sofar_pd

Unnamed: 0,ELO_DIFF,GOAL_SO_FAR_DIFF,Result
36,-16,-1,-1
37,2,2,-1
38,-2,8,0
39,10,3,1
40,15,2,1
...,...,...,...
146636,5,10,-1
146637,3,15,1
146638,9,6,-1
146639,0,-7,-1


In [46]:
array = result_with_goal_sofar_pd.values
array

array([[-16, -1, -1],
       [2, 2, -1],
       [-2, 8, 0],
       ...,
       [9, 6, -1],
       [0, -7, -1],
       [10, 0, 1]], dtype=object)

In [47]:
X = array[:,0:2].astype('int')
y = array[:,2].astype('int')

In [48]:
# Scaler
from sklearn.preprocessing import MinMaxScaler
from numpy import set_printoptions

scaler = MinMaxScaler(feature_range=(0, 3))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
set_printoptions(precision=3)

In [49]:
test_size = 0.3
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(rescaledX, y, test_size=test_size,
random_state=seed)

model = LogisticRegression() 
model.fit(X_train, Y_train)

In [50]:
result = model.score(X_train, Y_train) 
print("Accuracy for train: %.3f%%" % (result*100.0))

Accuracy for train: 49.396%


In [51]:
result = model.score(X_test, Y_test) 
print("Accuracy for test: %.3f%%" % (result*100.0))

Accuracy for test: 48.824%
