The chosen modeling has two aspects:
- a primary model, whose goal is to predict matches final issue (home, away or draw)
- secondary models, whose goals are to predict the number of goals scored by home and away teams knowing the match final issue

This notebooks focuses on the primary model, and has the following sections:
- feature engineering
- feature selection
- fitting
- model evaluation

In [None]:
import pandas as pd
import numpy as np
import optuna
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report, log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

root_path = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(root_path)

from src.config import load_config
from src.feature_engineering import create_diff_features
from src.feature_selection import find_highly_correlated_cols, remove_low_variance_features, select_top_features
from src.primary_modeling import run_grid_searches, load_model, evaluate_model_metrics

# config.yaml importation
config_file = 'config.yaml'
config_path = os.path.join(root_path, config_file)
config = load_config(config_path)

# Preprocessed data importation

In [None]:
preprocessed_data_path = os.path.join(root_path, config['preprocessed_dir'])
df_train_path = os.path.join(preprocessed_data_path, f"{config['preprocessed_train_df_name']}.csv")
df_test_path = os.path.join(preprocessed_data_path, f"{config['preprocessed_test_df_name']}.csv")

df_train = pd.read_csv(df_train_path)
df_test = pd.read_csv(df_test_path)
df_train.head()

In [None]:
primary_target = config['final_result_column']
secondary_target_home = config['nb_goals_home_column']
secondary_target_away = config['nb_goals_away_column']

X_train = df_train.drop(columns=[primary_target, secondary_target_home, secondary_target_away, config['date_column'], config['season_column']])
X_test = df_test.drop(columns=[primary_target, secondary_target_home, secondary_target_away, config['date_column'], config['season_column']])

y_train_primary = df_train[primary_target]
y_test_primary = df_test[primary_target]

In [None]:
X_train.columns

# Feature engineering

Since we want to model the final issue of a match, exhaustive data related to home and away teams is not necessary. That is why for each pair of similar columns for home and away teams, we create the difference between these two columns.

In [None]:
patterns = [
        ("_home_team_ranking_at_home", "_away_team_ranking_away"),
        ("_home_team_at_home", "_away_team_away"),
        ("_home_team", "_away_team"),
        ("_at_home", "_away"),
        ("_home", "_away")
]

X_train_primary = create_diff_features(X_train, patterns=patterns)
X_test_primary = create_diff_features(X_test, patterns=patterns)

X_train_primary.head()

# Feature Selection

We implemented 3 methods to select features:
- remove highly correlated features
- remove low variance features
- select top K features which could explain the primary target

## Correlation method

In [None]:
corr = X_train_primary.corr(numeric_only=True)
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation matrix")
plt.show()

In [None]:
highly_correlated_cols = find_highly_correlated_cols(X_train_primary)
highly_correlated_cols

In [None]:
X_train_primary = X_train_primary.drop(columns=highly_correlated_cols)
X_test_primary = X_test_primary.drop(columns=highly_correlated_cols)

## Low variance method

In [None]:
low_variance_cols = remove_low_variance_features(X_train_primary)
low_variance_cols

In [None]:
X_train_primary = X_train_primary.drop(columns=low_variance_cols)
X_test_primary = X_test_primary.drop(columns=low_variance_cols)

In [None]:
X_train_primary.columns

## Top K features 

In [None]:
top_k_cols = select_top_features(X_train_primary, y_train_primary)
top_k_cols

In [None]:
# Optional: only select these top k features
# X_train_primary = X_train_primary[top_k_cols]
# X_test_primary = X_test_primary[top_k_cols]

# Fitting

Three types of classifiers will be tested:
- logistic regression
- random forest
- XGBoost

For hyperparameters, we will fit with a GridSearch (Optuna is TODO).

## Logistic regression

In [None]:
cat_cols = [config['home_column'], config['away_column']]
num_cols = X_train_primary.select_dtypes(include=['int64','float64']).columns.tolist()
    
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

pipe_lr = Pipeline([
    ('pre', preprocessor),
    ('clf', LogisticRegression(multi_class='multinomial', solver='saga', max_iter=5000))
])

## Random forest

In [None]:
cat_cols = [config['home_column'], config['away_column']]
num_cols = X_train_primary.select_dtypes(include=['int64','float64']).columns.tolist()
    
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

pipe_rf = Pipeline([
    ('pre', preprocessor),
    ('clf', RandomForestClassifier(random_state=42, n_jobs=1))
])

## XGBoost

In [None]:
cat_cols = [config['home_column'], config['away_column']]
num_cols = X_train_primary.select_dtypes(include=['int64','float64']).columns.tolist()
    
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

pipe_xgb = Pipeline([
    ('pre', preprocessor),
    ('clf', XGBClassifier(objective='multi:softprob', use_label_encoder=False, eval_metric='mlogloss'))
])

## Run

In [None]:
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train_primary)

# run_grid_searches(X=X_train_primary,
#                   y=y_train_enc,
#                   param_grid_lr=config['param_grid_lr'],
#                   param_grid_rf=config['param_grid_rf'],
#                   param_grid_xgb=config['param_grid_xgb'],
#                   preprocessing_pipeline_lr=pipe_lr,
#                   preprocessing_pipeline_rf=pipe_rf,
#                   preprocessing_pipeline_xgb=pipe_xgb,
#                   outdir=config['primary_models_dir'])

## Optuna

In [None]:
# TODO

# Model evaluation

In [None]:
y_test_enc = le.transform(y_test_primary)

## Logistic Regression

In [None]:
best_lr = load_model(os.path.join('..', config['primary_models_dir'], 'logistic_grid.joblib'))
metrics = evaluate_model_metrics(best_lr, X_test_primary, y_test_enc)
metrics

## Random Forest

In [None]:
best_rf = load_model(os.path.join('..', config['primary_models_dir'], 'rf_grid.joblib'))
metrics = evaluate_model_metrics(best_rf, X_test_primary, y_test_enc)
metrics

## XGBoost

In [None]:
best_xgb = load_model(os.path.join(config['primary_models_dir'], 'xgb_grid.joblib'))
metrics = evaluate_model_metrics(best_xgb, X_test_primary, y_test_enc)
metrics