# Machine Learning Methods

In this phase, we apply supervised machine learning techniques to the dataset
to model and predict aspects of home advantage in football.

Two tasks are considered:
1. A classification task related to home advantage outcomes.
2. A regression task modeling the intensity of home advantage.



In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv("data/main_merged_and_cleaned.csv")
df.head()

Unnamed: 0,id,team,country,league,season_start,season_end,final_league_pos,home_matches,home_wins,home_draws,...,home_yellow_cards_won,home_yellow_cards_conceded,home_red_cards_won,home_red_conceded,away_fouls_won_avg,away_fouls_conceded_avg,away_yellow_cards_won,away_yellow_cards_conceded,away_red_cards_won,away_red_cards_conceded
0,1,Hoffenheim,de,Bundesliga,2014,2015,8,17,9,3,...,25.294118,20.0,2.352941,0.0,181.176471,190.588235,19.411765,18.823529,1.176471,1.176471
1,2,Udinese,it,Serie A,2014,2015,16,19,6,5,...,25.789474,21.052632,0.526316,1.578947,136.842105,150.0,20.526316,26.842105,2.105263,1.578947
2,3,Bayern Munich,de,Bundesliga,2014,2015,1,17,14,1,...,13.529412,6.470588,0.588235,1.176471,162.941177,140.0,16.470588,15.294118,0.588235,0.0
3,4,Augsburg,de,Bundesliga,2014,2015,5,17,9,4,...,24.705882,15.294118,0.588235,1.176471,150.0,161.176471,18.235294,20.588235,1.764706,0.588235
4,5,Hellas Verona,it,Serie A,2014,2015,13,19,7,5,...,,,,,,,,,,


## Target Variable Definition

We define a binary home advantage outcome variable based on match results.

The target variable:
- `home_advantage_win`: 1 if home win rate > away win rate, 0 otherwise.

This formulation allows us to model whether a team exhibits home advantage
using season-level statistics.

In [3]:
df['home_advantage_win'] = (df['home_win_rate'] > df['away_win_rate']).astype(int)

df['home_advantage_win'].value_counts()

Unnamed: 0_level_0,count
home_advantage_win,Unnamed: 1_level_1
1,997
0,284


## Feature Selection

Features are selected to reflect performance and referee-related indicators,
while avoiding data leakage.



In [4]:
feature_cols = [
    'home_goals_for_per_90',
    'home_goals_against_per_90',
    'away_goals_for_per_90',
    'away_goals_against_per_90',
    'home_xg_avg',
    'away_xg_avg',
    'home_penalties_won',
    'away_penalties_won'
]

X = df[feature_cols]
y = df['home_advantage_win']

X = X.dropna()
y = y.loc[X.index]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [6]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_model = LogisticRegression(max_iter=1000, random_state=42)
log_model.fit(X_train_scaled, y_train)

y_pred_log = log_model.predict(X_test_scaled)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

Logistic Regression Accuracy: 0.8205128205128205
              precision    recall  f1-score   support

           0       0.71      0.54      0.61        41
           1       0.85      0.92      0.88       115

    accuracy                           0.82       156
   macro avg       0.78      0.73      0.75       156
weighted avg       0.81      0.82      0.81       156



*The logistic regression model achieves a reasonable classification accuracy,
indicating that basic match performance indicators are informative for identifying
the presence of home advantage. Higher recall for the home-advantage class suggests that the model is more effective
at detecting seasons where home advantage exists than cases where it does not.*

## Random Forest Classification

A Random Forest classifier is used to capture potential nonlinear relationships
between match statistics and home advantage.

In [7]:
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    class_weight='balanced'
)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.8141025641025641
              precision    recall  f1-score   support

           0       0.83      0.37      0.51        41
           1       0.81      0.97      0.89       115

    accuracy                           0.81       156
   macro avg       0.82      0.67      0.70       156
weighted avg       0.82      0.81      0.79       156



*The Random Forest model performs similarly to logistic regression in terms of accuracy,
suggesting that nonlinear relationships provide limited additional explanatory power
for this classification task.*

*Overall, both classification models demonstrate that home advantage can be predicted
from a small set of performance-related features, although the prediction accuracy
remains moderate. This aligns with earlier findings that home advantage exists but is not overwhelmingly strong.*

## Regression Task

We model the magnitude of home advantage using the continuous variable:

- `home_points_per_match`

This allows us to estimate how match statistics explain home performance strength.

In [8]:
regression_target = 'home_points_per_match'

X_reg = df[feature_cols]
y_reg = df[regression_target]

X_reg = X_reg.dropna()
y_reg = y_reg.loc[X_reg.index]

In [9]:
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

In [10]:
lin_model = LinearRegression()
lin_model.fit(X_train_r, y_train_r)

y_pred_r = lin_model.predict(X_test_r)

print("Regression Performance:")
print("MSE:", mean_squared_error(y_test_r, y_pred_r))
print("MAE:", mean_absolute_error(y_test_r, y_pred_r))
print("R2:", r2_score(y_test_r, y_pred_r))

Regression Performance:
MSE: 0.022659006881768442
MAE: 0.11931758748599892
R2: 0.9059943472673437


*The regression model explains a large proportion of the variance in home points per match,
as indicated by the high R² value. This suggests that match performance indicators are strong predictors of
home performance intensity when modeled in a continuous framework.*


In [11]:
pred_df = pd.DataFrame({
    "actual_home_ppm": y_test_r.values,
    "predicted_home_ppm": y_pred_r,
    "residual": y_test_r.values - y_pred_r
})

pred_df.to_csv("regression_predictions.csv", index=False)

pred_df.head()

Unnamed: 0,actual_home_ppm,predicted_home_ppm,residual
0,1.74,1.721531,0.018469
1,0.89,1.199929,-0.309929
2,1.0,1.210654,-0.210654
3,2.11,2.232996,-0.122996
4,1.68,1.767047,-0.087047


*Overall, the machine learning results reinforce the main findings of the project. Home advantage is present and partially predictable using performance metrics,
but its magnitude remains moderate, consistent with the observed post-VAR decline
rather than a complete disappearance.*
