# Football Match Prediction Model v4

This notebook implements an optimized machine learning model for football match outcome prediction using pre-tuned hyperparameters.

## Objectives:
- Load the master dataset with comprehensive match features
- Perform exploratory data analysis and feature engineering
- Build and evaluate prediction models using **optimized hyperparameters**
- Generate predictions and model insights
- **Skip hyperparameter tuning** - use pre-found optimal parameters for faster execution

## Key Improvements in v4:
- ⚡ **Faster execution** - no hyperparameter optimization required
- 🎯 **Pre-optimized parameters** - uses best parameters found in previous experiments
- 📊 **Same analysis depth** - maintains all evaluation and visualization capabilities
- 🔄 **Reproducible results** - consistent performance with fixed parameters

## Dataset Features:
- Match statistics (possession, shots, passes, etc.)
- Team wage information and squad details
- Historical form metrics (rolling 5-match averages)
- Rest days and contextual match information
- Both team and opponent perspectives

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_selection import mutual_info_classif, f_classif
from scipy.stats import pearsonr,ranksums

# Machine Learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('default')
sns.set_palette("husl")

print("✓ Libraries imported successfully")

  from pandas.core import (


✓ Libraries imported successfully


In [2]:
# Set up paths
project_root = Path().resolve().parent.parent.parent
data_masters = project_root / 'data' / 'prod' / 'processed' / 'masters'
models_dir = project_root / 'models' / 'premier_league'

# Create models directory if it doesn't exist
models_dir.mkdir(parents=True, exist_ok=True)

print(f"Project root: {project_root}")
print(f"Masters data: {data_masters}")
print(f"Models directory: {models_dir}")

Project root: C:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football
Masters data: C:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football\data\prod\processed\masters
Models directory: C:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football\models\premier_league


## Data Loading

Load the master dataset created in the data engineering phase.

In [3]:
# Load master dataset (prefer parquet for efficiency)
dataset_file = data_masters / 'match_stats_master_complete_v1.parquet'

if dataset_file.exists():
    df = pd.read_parquet(dataset_file)
    print(f"✓ Dataset loaded successfully")
    print(f"  Shape: {df.shape}")
    print(f"  Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
else:
    print("✗ Master dataset not found. Please run master_creation_v1.ipynb first.")
    df = None

✓ Dataset loaded successfully
  Shape: (5711, 258)
  Memory usage: 15.7 MB


## Feature Dropping

In [4]:
id_cols = [
    'date',
    'comp',
    'round',
    'season',
    'team_id',
    'full_match_report_url',
    'team_name',
    'opponent',
    'opponent_id',
    
]

match_cols = [
    'referee',
    'start_time',
    'dayofweek'
]

target_cols = [
    'result',
]

stats_team_A = [
    'venue',
    'Aerials Won_favor_form_avg',
    'Aerials Won_favor_form_sum',
    'Clearances_favor_form_avg',
    'Clearances_favor_form_sum',
    'Corners_favor_form_avg',
    'Corners_favor_form_sum',
    'Crosses_favor_form_avg',
    'Crosses_favor_form_sum',
    'Fouls_favor_form_avg',
    'Fouls_favor_form_sum',
    'Goal Kicks_favor_form_avg',
    'Goal Kicks_favor_form_sum',
    'Interceptions_favor_form_avg',
    'Interceptions_favor_form_sum',
    'Long Balls_favor_form_avg',
    'Long Balls_favor_form_sum',
    'Offsides_favor_form_avg',
    'Offsides_favor_form_sum',
    'Passing Accuracy_favor_form_avg',
    'Passing Accuracy_favor_form_sum',
    'Possession_favor_form_avg',
    'Possession_favor_form_sum',
    'Saves_favor_form_avg',
    'Saves_favor_form_sum',
    'Shots on Target_favor_form_avg',
    'Shots on Target_favor_form_sum',
    'Tackles_favor_form_avg',
    'Tackles_favor_form_sum',
    'Throw Ins_favor_form_avg',
    'Throw Ins_favor_form_sum',
    'Touches_favor_form_avg',
    'Touches_favor_form_sum',
    'Aerials Won_against_form_avg',
    'Aerials Won_against_form_sum',
    'Clearances_against_form_avg',
    'Clearances_against_form_sum',
    'Corners_against_form_avg',
    'Corners_against_form_sum',
    'Crosses_against_form_avg',
    'Crosses_against_form_sum',
    'Fouls_against_form_avg',
    'Fouls_against_form_sum',
    'Goal Kicks_against_form_avg',
    'Goal Kicks_against_form_sum',
    'Interceptions_against_form_avg',
    'Interceptions_against_form_sum',
    'Long Balls_against_form_avg',
    'Long Balls_against_form_sum',
    'Offsides_against_form_avg',
    'Offsides_against_form_sum',
    'Passing Accuracy_against_form_avg',
    'Passing Accuracy_against_form_sum',
    'Possession_against_form_avg',
    'Possession_against_form_sum',
    'Saves_against_form_avg',
    'Saves_against_form_sum',
    'Shots on Target_against_form_avg',
    'Shots on Target_against_form_sum',
    'Tackles_against_form_avg',
    'Tackles_against_form_sum',
    'Throw Ins_against_form_avg',
    'Throw Ins_against_form_sum',
    'Touches_against_form_avg',
    'Touches_against_form_sum',
    'points_form_avg',
    'points_form_sum',
    'rest_days',
    'rest_days_form_avg',
    'rest_days_form_sum',
    'xg_for_form_avg',
    'xg_for_form_sum',
    'xg_against_form_avg',
    'xg_against_form_sum',
    'goals_for_form_avg',
    'goals_for_form_sum',
    'goals_against_form_avg',
    'goals_against_form_sum'
]

columnas_team_A_validacion = [
    'Aerials Won_favor',
    'Clearances_favor',
    'Corners_favor',
    'Crosses_favor',
    'Fouls_favor',
    'Goal Kicks_favor',
    'Interceptions_favor',
    'Long Balls_favor',
    'Offsides_favor',
    'Passing Accuracy_favor',
    'Possession_favor',
    'Saves_favor',
    'Shots on Target_favor',
    'Tackles_favor',
    'Throw Ins_favor',
    'Touches_favor',
    'Aerials Won_against',
    'Clearances_against',
    'Corners_against',
    'Crosses_against',
    'Fouls_against',
    'Goal Kicks_against',
    'Interceptions_against',
    'Long Balls_against',
    'Offsides_against',
    'Passing Accuracy_against',
    'Possession_against',
    'Saves_against',
    'Shots on Target_against',
    'Tackles_against',
    'Throw Ins_against',
    'Touches_against',
    'points',
    'xg_for',
    'xg_against',
    'goals_for',
    'goals_against'
]

players_team_A = [
    'age_mean',
    'squad_size',
    'age_max',
    'age_min',
    'avg_wage_dollars',
    'total_wage_bill_dollars',
    'max_wage_dollars',
    'min_wage_dollars'
 
]

stats_team_B = [
    'Aerials Won_favor_opponent_form_avg',
    'Aerials Won_favor_opponent_form_sum',
    'Clearances_favor_opponent_form_avg',
    'Clearances_favor_opponent_form_sum',
    'Corners_favor_opponent_form_avg',
    'Corners_favor_opponent_form_sum',
    'Crosses_favor_opponent_form_avg',
    'Crosses_favor_opponent_form_sum',
    'Fouls_favor_opponent_form_avg',
    'Fouls_favor_opponent_form_sum',
    'Goal Kicks_favor_opponent_form_avg',
    'Goal Kicks_favor_opponent_form_sum',
    'Interceptions_favor_opponent_form_avg',
    'Interceptions_favor_opponent_form_sum',
    'Long Balls_favor_opponent_form_avg',
    'Long Balls_favor_opponent_form_sum',
    'Offsides_favor_opponent_form_avg',
    'Offsides_favor_opponent_form_sum',
    'Passing Accuracy_favor_opponent_form_avg',
    'Passing Accuracy_favor_opponent_form_sum',
    'Possession_favor_opponent_form_avg',
    'Possession_favor_opponent_form_sum',
    'Saves_favor_opponent_form_avg',
    'Saves_favor_opponent_form_sum',
    'Shots on Target_favor_opponent_form_avg',
    'Shots on Target_favor_opponent_form_sum',
    'Tackles_favor_opponent_form_avg',
    'Tackles_favor_opponent_form_sum',
    'Throw Ins_favor_opponent_form_avg',
    'Throw Ins_favor_opponent_form_sum',
    'Touches_favor_opponent_form_avg',
    'Touches_favor_opponent_form_sum',
    'Aerials Won_against_opponent_form_avg',
    'Aerials Won_against_opponent_form_sum',
    'Clearances_against_opponent_form_avg',
    'Clearances_against_opponent_form_sum',
    'Corners_against_opponent_form_avg',
    'Corners_against_opponent_form_sum',
    'Crosses_against_opponent_form_avg',
    'Crosses_against_opponent_form_sum',
    'Fouls_against_opponent_form_avg',
    'Fouls_against_opponent_form_sum',
    'Goal Kicks_against_opponent_form_avg',
    'Goal Kicks_against_opponent_form_sum',
    'Interceptions_against_opponent_form_avg',
    'Interceptions_against_opponent_form_sum',
    'Long Balls_against_opponent_form_avg',
    'Long Balls_against_opponent_form_sum',
    'Offsides_against_opponent_form_avg',
    'Offsides_against_opponent_form_sum',
    'Passing Accuracy_against_opponent_form_avg',
    'Passing Accuracy_against_opponent_form_sum',
    'Possession_against_opponent_form_avg',
    'Possession_against_opponent_form_sum',
    'Saves_against_opponent_form_avg',
    'Saves_against_opponent_form_sum',
    'Shots on Target_against_opponent_form_avg',
    'Shots on Target_against_opponent_form_sum',
    'Tackles_against_opponent_form_avg',
    'Tackles_against_opponent_form_sum',
    'Throw Ins_against_opponent_form_avg',
    'Throw Ins_against_opponent_form_sum',
    'Touches_against_opponent_form_avg',
    'Touches_against_opponent_form_sum',
    'points_opponent_form_avg',
    'points_opponent_form_sum',
    'rest_days_opponent',
    'rest_days_opponent_form_avg',
    'rest_days_opponent_form_sum',
    'xg_for_opponent_form_avg',
    'xg_for_opponent_form_sum',
    'xg_against_opponent_form_avg',
    'xg_against_opponent_form_sum',
    'goals_for_opponent_form_avg',
    'goals_for_opponent_form_sum',
    'goals_against_opponent_form_avg',
    'goals_against_opponent_form_sum'
]

columnas_team_B_validacion = [
    'Aerials Won_favor_opponent',
    'Clearances_favor_opponent',
    'Corners_favor_opponent',
    'Crosses_favor_opponent',
    'Fouls_favor_opponent',
    'Goal Kicks_favor_opponent',
    'Interceptions_favor_opponent',
    'Long Balls_favor_opponent',
    'Offsides_favor_opponent',
    'Passing Accuracy_favor_opponent',
    'Possession_favor_opponent',
    'Saves_favor_opponent',
    'Shots on Target_favor_opponent',
    'Tackles_favor_opponent',
    'Throw Ins_favor_opponent',
    'Touches_favor_opponent',
    'Aerials Won_against_opponent',
    'Clearances_against_opponent',
    'Corners_against_opponent',
    'Crosses_against_opponent',
    'Fouls_against_opponent',
    'Goal Kicks_against_opponent',
    'Interceptions_against_opponent',
    'Long Balls_against_opponent',
    'Offsides_against_opponent',
    'Passing Accuracy_against_opponent',
    'Possession_against_opponent',
    'Saves_against_opponent',
    'Shots on Target_against_opponent',
    'Tackles_against_opponent',
    'Throw Ins_against_opponent',
    'Touches_against_opponent',
    'points_opponent',
    'xg_for_opponent',
    'xg_against_opponent',
    'goals_for_opponent',
    'goals_against_opponent'
]

players_team_B = [
    'opp_age_mean',
    'opp_squad_size',
    'opp_age_max',
    'opp_age_min',
    'opp_avg_wage_dollars',
    'opp_total_wage_bill_dollars',
    'opp_max_wage_dollars',
    'opp_min_wage_dollars'
]

cols_drop = columnas_team_A_validacion + columnas_team_B_validacion

df.drop(cols_drop,axis=1,inplace=True)

## Feature Engineering & Data Preparation

Analyze the dataset structure and target variable distribution.

In [5]:
cols_add_A = [
 'age_mean',
 'squad_size',
 'age_max',
 'age_min',
 'avg_wage_dollars',
 'total_wage_bill_dollars',
 'max_wage_dollars',
 'min_wage_dollars'
]

cols_add_B = [
'opp_age_mean',
 'opp_squad_size',
 'opp_age_max',
 'opp_age_min',
 'opp_avg_wage_dollars',
 'opp_total_wage_bill_dollars',
 'opp_max_wage_dollars',
 'opp_min_wage_dollars' 
]

for columna in range(len(cols_add_A)):
    df[cols_add_A[columna] + '_diff'] = df[cols_add_A[columna]] - df[cols_add_B[columna]]

if df is not None:
    # Basic dataset info
    print("Dataset Overview:")
    print(f"  Rows: {len(df):,}")
    print(f"  Columns: {len(df.columns):,}")
    print(f"  Date range: {df['date'].min()} to {df['date'].max()}")
    print(f"  Seasons: {sorted(df['season'].unique())}")
    print(f"  Teams: {len(df['team_id'].unique())} unique teams")
    
    # Target variable distribution
    print("\nTarget Variable (Result) Distribution:")
    result_counts = df['result'].value_counts()
    result_pct = df['result'].value_counts(normalize=True) * 100
    
    for result in ['W', 'D', 'L']:
        if result in result_counts:
            print(f"  {result}: {result_counts[result]:,} ({result_pct[result]:.1f}%)")
    
    # Missing values summary
    missing_summary = df.isnull().sum().sort_values(ascending=False)
    missing_pct = (missing_summary / len(df) * 100).round(1)
    
    print(f"\nColumns with missing values (top 10):")
    for col, missing in missing_summary.head(10).items():
        if missing > 0:
            print(f"  {col}: {missing:,} ({missing_pct[col]}%)")

Dataset Overview:
  Rows: 5,711
  Columns: 192
  Date range: 2019-08-04 00:00:00 to 2025-05-28 00:00:00
  Seasons: ['2019-2020', '2020-2021', '2021-2022', '2022-2023', '2023-2024', '2024-2025']
  Teams: 27 unique teams

Target Variable (Result) Distribution:
  W: 2,406 (42.1%)
  D: 1,257 (22.0%)
  L: 2,048 (35.9%)

Columns with missing values (top 10):
  rest_days_opponent_form_sum: 817 (14.3%)
  rest_days_opponent_form_avg: 817 (14.3%)
  xg_against_opponent_form_sum: 796 (13.9%)
  xg_against_opponent_form_avg: 796 (13.9%)
  xg_for_opponent_form_sum: 796 (13.9%)
  xg_for_opponent_form_avg: 796 (13.9%)
  Saves_favor_opponent_form_avg: 794 (13.9%)
  Long Balls_favor_opponent_form_sum: 794 (13.9%)
  Offsides_favor_opponent_form_avg: 794 (13.9%)
  Offsides_favor_opponent_form_sum: 794 (13.9%)


In [6]:
df['dayofweek'] = df['dayofweek'].map({
      'Mon': 'midweek',    
      'Tue': 'midweek',    
      'Wed': 'midweek',    
      'Thu': 'midweek',    
      'Fri': 'Fri',    
      'Sat': 'Sat',    
      'Sun': 'Sun'
})

dummies = pd.get_dummies(
    df['dayofweek'],
    drop_first=False,
    dtype=int
)

df.drop('dayofweek', axis=1, inplace=True)

df = pd.concat([df, dummies], axis=1)

df['venue'] = df['venue'].map({
    'Away': 0,
    'Home': 1,
    'Neutral': 0
})

df['result'] = df['result'].astype('category')

In [7]:
# Goal-specific features:
df['xg_diff'] = df['xg_for_form_avg'] - df['xg_against_form_avg']
df['shots_ratio'] = df['Shots on Target_favor_form_avg'] / (df['Shots on Target_against_form_avg'] + 1)
df['possession_dominance'] = df['Possession_favor_form_avg'] - 50  # Home field effect
df['attacking_pressure'] = df['Corners_favor_form_avg'] + df['Crosses_favor_form_avg']

# Defensive solidity
df['defensive_actions'] = df['Tackles_favor_form_avg'] + df['Interceptions_favor_form_avg'] + df['Clearances_favor_form_avg']

In [8]:
informacion_competencias = pd.get_dummies(
    df['comp'],
    dtype=int
 )

df = pd.concat([df, informacion_competencias], axis=1)

## Feature Selection

### Features a evaluar

In [9]:
id_cols = [
    'date',
    'comp',
    'round',
    'season',
    'team_id',
    'full_match_report_url',
    'team_name',
    'opponent',
    'opponent_id',
    'referee',
    'start_time'
]

features = [
    'Passing Accuracy_against_opponent_form_avg',
    'goals_for_opponent_form_avg',
    'Throw Ins_favor_opponent_form_sum',
    'Passing Accuracy_favor_opponent_form_sum',
    'Saves_against_form_sum',
    'Touches_favor_form_sum',
    'Clearances_against_form_sum',
    'Shots on Target_against_form_sum',
    'Touches_favor_opponent_form_avg',
    'Interceptions_against_opponent_form_avg',
    'rest_days_opponent',
    'xg_against_opponent_form_avg',
    'Aerials Won_favor_opponent_form_sum',
    'goals_against_opponent_form_sum',
    'points_opponent_form_avg',
    'rest_days_form_avg',
    'Community Shield',
    'Corners_favor_form_sum',
    'min_wage_dollars',
    'Sun',
    'Possession_favor_opponent_form_avg',
    'Clearances_against_opponent_form_avg',
    'xg_for_opponent_form_sum',
    'Tackles_against_form_avg',
    'Throw Ins_against_opponent_form_sum',
    'Passing Accuracy_against_form_avg',
    'Goal Kicks_against_form_sum',
    'Fouls_against_opponent_form_avg',
    'Touches_against_form_avg',
    'Tackles_favor_form_avg',
    'Tackles_against_opponent_form_avg',
    'rest_days_opponent_form_sum',
    'Crosses_against_opponent_form_sum',
    'FA Cup',
    'Long Balls_favor_form_avg',
    'Interceptions_against_form_avg',
    'Offsides_against_opponent_form_sum',
    'Interceptions_favor_opponent_form_avg',
    'Throw Ins_favor_form_sum',
    'goals_against_form_sum',
    'goals_for_opponent_form_sum',
    'opp_squad_size',
    'points_form_sum',
    'Europa Lg',
    'Offsides_favor_opponent_form_sum',
    'Corners_favor_opponent_form_sum',
    'age_min',
    'min_wage_dollars_diff',
    'Passing Accuracy_favor_form_sum',
    'Tackles_favor_opponent_form_sum',
    'Possession_favor_opponent_form_sum',
    'total_wage_bill_dollars_diff',
    'Crosses_against_opponent_form_avg',
    'squad_size',
    'Offsides_against_form_avg',
    'xg_against_form_avg',
    'Aerials Won_favor_form_avg',
    'rest_days_opponent_form_avg',
    'points_opponent_form_sum',
    'squad_size_diff',
    'rest_days_form_sum',
    'opp_min_wage_dollars',
    'Corners_against_opponent_form_avg',
    'Offsides_against_opponent_form_avg',
    'Saves_favor_form_sum',
    'Shots on Target_against_opponent_form_avg',
    'Offsides_favor_form_sum',
    'Passing Accuracy_favor_form_avg',
    'Shots on Target_favor_form_avg',
    'Fouls_against_form_avg',
    'Aerials Won_against_opponent_form_avg',
    'Long Balls_against_form_avg',
    'opp_age_min',
    'Clearances_favor_opponent_form_sum',
    'EFL Cup',
    'Touches_against_opponent_form_avg',
    'total_wage_bill_dollars',
    'FA Community Shield',
    'Tackles_favor_form_sum',
    'Offsides_favor_form_avg',
    'goals_against_form_avg',
    'goals_against_opponent_form_avg',
    'Possession_against_opponent_form_avg',
    'Throw Ins_favor_opponent_form_avg',
    'Tackles_against_form_sum',
    'Possession_favor_form_avg',
    'Long Balls_favor_form_sum',
    'Crosses_favor_form_sum',
    'Touches_against_form_sum',
    'Shots on Target_favor_form_sum',
    'rest_days',
    'Clearances_favor_form_sum',
    'Corners_against_opponent_form_sum',
    'Saves_against_form_avg',
    'opp_total_wage_bill_dollars',
    'Touches_favor_opponent_form_sum',
    'Tackles_favor_opponent_form_avg',
    'age_mean_diff',
    'Possession_against_form_avg',
    'age_max',
    'Long Balls_favor_opponent_form_sum',
    'Shots on Target_favor_opponent_form_sum',
    'Premier League',
    'Long Balls_against_opponent_form_avg',
    'Clearances_against_form_avg',
    'xg_against_form_sum',
    'Fouls_favor_form_sum',
    'Goal Kicks_favor_form_avg',
    'Offsides_against_form_sum',
    'opp_max_wage_dollars',
    'Saves_favor_form_avg',
    'Passing Accuracy_favor_opponent_form_avg',
    'Goal Kicks_favor_opponent_form_sum',
    'Throw Ins_against_form_avg',
    'xg_for_form_sum',
    'Aerials Won_against_form_sum',
    'Tackles_against_opponent_form_sum',
    'goals_for_form_avg',
    'Shots on Target_favor_opponent_form_avg',
    'Interceptions_against_form_sum',
    'Corners_favor_form_avg',
    'Crosses_favor_opponent_form_sum',
    'Sat',
    'opp_age_mean',
    'goals_for_form_sum',
    'Goal Kicks_against_opponent_form_avg',
    'Passing Accuracy_against_opponent_form_sum',
    'Possession_against_opponent_form_sum',
    'Aerials Won_against_form_avg',
    'max_wage_dollars_diff',
    'Throw Ins_against_form_sum',
    'Fouls_favor_opponent_form_avg',
    'age_mean',
    'age_max_diff',
    'Saves_against_opponent_form_avg',
    'Possession_favor_form_sum',
    'opp_age_max',
    'Saves_against_opponent_form_sum',
    'Passing Accuracy_against_form_sum',
    'Crosses_favor_opponent_form_avg',
    'xg_for_opponent_form_avg',
    'Interceptions_favor_form_sum',
    'Goal Kicks_against_opponent_form_sum',
    'Corners_against_form_avg',
    'Interceptions_favor_form_avg',
    'Fouls_favor_opponent_form_sum',
    'Interceptions_favor_opponent_form_sum',
    'Throw Ins_against_opponent_form_avg',
    'xg_against_opponent_form_sum',
    'Goal Kicks_favor_form_sum',
    'Throw Ins_favor_form_avg',
    'opp_avg_wage_dollars',
    'Fouls_favor_form_avg',
    'Touches_against_opponent_form_sum',
    'max_wage_dollars',
    'Touches_favor_form_avg',
    'Saves_favor_opponent_form_avg',
    'Champions Lg',
    'Conf Lg',
    'age_min_diff',
    'Goal Kicks_against_form_avg',
    'venue',
    'Shots on Target_against_form_avg',
    'avg_wage_dollars_diff',
    'Fri',
    'Goal Kicks_favor_opponent_form_avg',
    'Offsides_favor_opponent_form_avg',
    'Long Balls_against_opponent_form_sum',
    'Clearances_favor_opponent_form_avg',
    'Long Balls_against_form_sum',
    'xg_for_form_avg',
    'Crosses_against_form_sum',
    'Aerials Won_favor_form_sum',
    'Fouls_against_form_sum',
    'Corners_against_form_sum',
    'Clearances_against_opponent_form_sum',
    'points_form_avg',
    'Aerials Won_against_opponent_form_sum',
    'Fouls_against_opponent_form_sum',
    'Aerials Won_favor_opponent_form_avg',
    'Saves_favor_opponent_form_sum',
    'Clearances_favor_form_avg',
    'Shots on Target_against_opponent_form_sum',
    'Super Cup',
    'Crosses_favor_form_avg',
    'Interceptions_against_opponent_form_sum',
    'Crosses_against_form_avg',
    'Long Balls_favor_opponent_form_avg',
    'avg_wage_dollars',
    'Possession_against_form_sum',
    'Corners_favor_opponent_form_avg',
    'midweek',
    'xg_diff',
    'shots_ratio',
    'possession_dominance',
    'attacking_pressure',
    'defensive_actions'
    ]   

target = [
    'goal_diff'
    ]

## Data Preparation for ML

In [10]:
df_train = df[
    df['season']!='2024-2025'
    ]
df_test = df[
    (df['season']=='2024-2025') & 
    (df['comp']=='Premier League')
    ]


X_train = df_train[id_cols + features]
y_train = df_train['result']

X_test = df_test[id_cols + features]
y_test = df_test['result']

## Model Training & Evaluation

Train LightGBM model using **pre-optimized hyperparameters** (no tuning required).

In [11]:
# Import LightGBM and required libraries
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, log_loss
import joblib
from datetime import datetime

print("✓ LightGBM and required libraries imported successfully")

✓ LightGBM and required libraries imported successfully


In [12]:
# Data preparation for LightGBM
print("Preparing data for LightGBM...")

# Handle missing values by filling with median for numerical features
print(f"Missing values before handling: {X_train[features].isnull().sum().sum()}")

# Fill missing values
X_train_clean = X_train[features].fillna(X_train[features].median())
X_test_clean = X_test[features].fillna(X_train[features].median())  # Use training median for test set

print(f"Missing values after handling: {X_train_clean.isnull().sum().sum()}")

# Encode target variable for LightGBM (W=0, D=1, L=2)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print(f"Target encoding: {dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))}")
print(f"Training set shape: {X_train_clean.shape}")
print(f"Test set shape: {X_test_clean.shape}")
print(f"Training target distribution: {np.bincount(y_train_encoded)}")
print("✓ Data preparation completed")

Preparing data for LightGBM...
Missing values before handling: 63954
Missing values after handling: 0
Target encoding: {'D': 0, 'L': 1, 'W': 2}
Training set shape: (4758, 197)
Test set shape: (760, 197)
Training target distribution: [1039 1715 2004]
✓ Data preparation completed


## Pre-Optimized Hyperparameters

Using the best hyperparameters found through previous optimization (skipping the time-consuming tuning process).

In [13]:
# Load pre-optimized hyperparameters from previous optimization study
print("Loading pre-optimized hyperparameters...")

try:
    # Try to load the optimization study to get best parameters
    study_path = models_dir / 'lgb_optimization_study.joblib'
    if study_path.exists():
        study = joblib.load(study_path)
        best_params_from_study = study.best_params.copy()
        cv_accuracy = study.best_value
        print(f"✓ Loaded parameters from study with CV accuracy: {cv_accuracy:.4f}")
        print("Best parameters from optimization:")
        for key, value in best_params_from_study.items():
            print(f"  {key}: {value}")
    else:
        print("⚠ No optimization study found, using default optimized parameters")
        # Fallback to known good parameters if study file doesn't exist
        best_params_from_study = {
            'num_leaves': 150,
            'learning_rate': 0.05,
            'feature_fraction': 0.8,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'min_child_samples': 20,
            'reg_alpha': 0.1,
            'reg_lambda': 0.1,
            'max_depth': 8,
            'min_split_gain': 0.1,
            'subsample_for_bin': 200000
        }
        cv_accuracy = 0.55  # Approximate expected accuracy
        
except Exception as e:
    print(f"⚠ Error loading study: {e}")
    print("Using fallback optimized parameters")
    best_params_from_study = {
        'num_leaves': 150,
        'learning_rate': 0.05,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'min_child_samples': 20,
        'reg_alpha': 0.1,
        'reg_lambda': 0.1,
        'max_depth': 8,
        'min_split_gain': 0.1,
        'subsample_for_bin': 200000
    }
    cv_accuracy = 0.55

# Add fixed parameters
optimized_params = best_params_from_study.copy()
optimized_params.update({
    'objective': 'multiclass',
    'num_class': 3,
    'metric': 'multi_logloss',
    'boosting_type': 'gbdt',
    'verbose': 3,
    'random_state': 42
})

print("\n🚀 READY TO TRAIN WITH OPTIMIZED PARAMETERS!")
print("⚡ This will be much faster than hyperparameter optimization")

Loading pre-optimized hyperparameters...
✓ Loaded parameters from study with CV accuracy: 0.5586
Best parameters from optimization:
  num_leaves: 159
  learning_rate: 0.028313172914310722
  feature_fraction: 0.4001572673828278
  bagging_fraction: 0.47335018732867257
  bagging_freq: 6
  min_child_samples: 90
  reg_alpha: 8.294667356727873
  reg_lambda: 2.438991646607493
  max_depth: 4
  min_split_gain: 0.030033635030515426
  subsample_for_bin: 287502

🚀 READY TO TRAIN WITH OPTIMIZED PARAMETERS!
⚡ This will be much faster than hyperparameter optimization


In [15]:
# Train final model with optimized parameters
print("Training final model with pre-optimized parameters...")
print("="*60)

start_time = datetime.now()

# Create full training dataset
train_data = lgb.Dataset(X_train_clean, label=y_train_encoded)

# Train final model
final_model = lgb.train(
    optimized_params,
    train_data,
    num_boost_round=1000,
    callbacks=[lgb.log_evaluation(100)]
)

end_time = datetime.now()
training_time = end_time - start_time

print(f"\n✅ MODEL TRAINING COMPLETED!")
print(f"⏱ Training time: {training_time}")
print(f"🎯 Expected CV accuracy: {cv_accuracy:.1%}")

# Save the final model
model_path = models_dir / 'lgb_final_model_v4.joblib'
joblib.dump(final_model, model_path)
print(f"💾 Final model saved to: {model_path}")

# Save feature names and label encoder
joblib.dump(features, models_dir / 'feature_names_v4.joblib')
joblib.dump(label_encoder, models_dir / 'label_encoder_v4.joblib')
print("💾 Feature names and label encoder saved")

Training final model with pre-optimized parameters...
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.911990
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.030725
[LightGBM] [Debug] init for col-wise cost 0.000520 seconds, init for row-wise cost 0.003463 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004785 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 25558
[LightGBM] [Info] Number of data points in the train set: 4758, number of used features: 193
[LightGBM] [Info] Start training from score -1.521569
[LightGBM] [Info] Start training from score -1.020414
[LightGBM] [Info] Start training from score -0.864682
[LightGBM] [Debug] Re-bagging, using 2264 data to train
[LightGBM] [Debug] Trained a tree with leaves = 12 and depth = 4
[LightGBM] [Debug] Trained a tree with leaves = 8 and depth = 4
[LightGBM] [Debug] Trained a tree with leaves = 9 an

In [16]:
# Evaluate model on test set
print("Evaluating model on test set...")
print("="*50)

# Make predictions on test set
y_test_pred_proba = final_model.predict(X_test_clean, num_iteration=final_model.best_iteration)
y_test_pred_classes = np.argmax(y_test_pred_proba, axis=1)

# Convert back to original labels
y_test_pred_labels = label_encoder.inverse_transform(y_test_pred_classes)
y_test_true_labels = label_encoder.inverse_transform(y_test_encoded)

# Calculate test accuracy
test_accuracy = accuracy_score(y_test_encoded, y_test_pred_classes)

print(f"🎯 TEST SET ACCURACY: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print(f"Cross-validation accuracy: {cv_accuracy:.4f} ({cv_accuracy*100:.2f}%)")

# Detailed classification report
print("\n📊 DETAILED CLASSIFICATION REPORT:")
print(classification_report(y_test_true_labels, y_test_pred_labels, 
                          target_names=['Draw', 'Loss', 'Win']))

# Confusion Matrix
print("\n📈 CONFUSION MATRIX:")
cm = confusion_matrix(y_test_true_labels, y_test_pred_labels, labels=['W', 'D', 'L'])
print("     Predicted")
print("     W    D    L")
for i, actual in enumerate(['W', 'D', 'L']):
    print(f"{actual}  {cm[i][0]:3d} {cm[i][1]:3d} {cm[i][2]:3d}")

# Calculate per-class accuracies
print("\n🎯 PER-CLASS ACCURACY:")
for i, class_name in enumerate(['Win', 'Draw', 'Loss']):
    class_accuracy = cm[i, i] / cm[i].sum()
    print(f"  {class_name}: {class_accuracy:.3f} ({class_accuracy*100:.1f}%)")

# Feature importance
print("\n🔍 TOP 20 MOST IMPORTANT FEATURES:")
feature_imp = pd.DataFrame({
    'feature': features,
    'importance': final_model.feature_importance(importance_type='gain')
}).sort_values('importance', ascending=False)

for i, row in feature_imp.head(20).iterrows():
    print(f"  {row['feature']}: {row['importance']:.0f}")

# Test set distribution
print(f"\n📋 TEST SET INFO:")
print(f"  Total matches: {len(y_test)}")
print(f"  Wins: {sum(y_test_true_labels == 'W')} ({sum(y_test_true_labels == 'W')/len(y_test)*100:.1f}%)")
print(f"  Draws: {sum(y_test_true_labels == 'D')} ({sum(y_test_true_labels == 'D')/len(y_test)*100:.1f}%)")
print(f"  Losses: {sum(y_test_true_labels == 'L')} ({sum(y_test_true_labels == 'L')/len(y_test)*100:.1f}%)")

print("\n✅ MODEL EVALUATION COMPLETED!")
print(f"⚡ Total notebook execution was much faster without hyperparameter optimization!")

Evaluating model on test set...
🎯 TEST SET ACCURACY: 0.5026 (50.26%)
Cross-validation accuracy: 0.5586 (55.86%)

📊 DETAILED CLASSIFICATION REPORT:
              precision    recall  f1-score   support

        Draw       0.70      0.09      0.15       186
        Loss       0.49      0.67      0.57       287
         Win       0.50      0.61      0.55       287

    accuracy                           0.50       760
   macro avg       0.56      0.45      0.42       760
weighted avg       0.55      0.50      0.46       760


📈 CONFUSION MATRIX:
     Predicted
     W    D    L
W  175   4 108
D   80  16  90
L   93   3 191

🎯 PER-CLASS ACCURACY:
  Win: 0.610 (61.0%)
  Draw: 0.086 (8.6%)
  Loss: 0.666 (66.6%)

🔍 TOP 20 MOST IMPORTANT FEATURES:
  avg_wage_dollars_diff: 2601
  total_wage_bill_dollars_diff: 1304
  max_wage_dollars_diff: 1151
  venue: 671
  age_mean_diff: 598
  min_wage_dollars_diff: 597
  xg_diff: 525
  Aerials Won_favor_form_avg: 513
  Touches_favor_form_avg: 491
  total_wage_

## Production Model Training

Train a final production model using **ALL available data** (entire dataset) for real-world predictions.

In [19]:
# 🚀 PRODUCTION MODEL - Train with ENTIRE DATASET for real predictions
print("="*70)
print("🚀 TRAINING PRODUCTION MODEL WITH ENTIRE DATASET")
print("="*70)

start_time = datetime.now()

# Prepare FULL dataset (df) for production model
print("Preparing FULL dataset for production model...")

# Get all available data (no train/test split)
X_full = df[features].fillna(df[features].median())
y_full = df['result']

print(f"Full dataset shape: {X_full.shape}")
print(f"Total matches: {len(y_full):,}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

# Encode target variable
y_full_encoded = label_encoder.fit_transform(y_full)
print(f"Full target distribution: {np.bincount(y_full_encoded)} (D/L/W)")

# Create production dataset
production_data = lgb.Dataset(X_full, label=y_full_encoded)

print("\nTraining production model with optimized parameters...")
print("This model will use ALL available historical data!")

# Train production model
production_model = lgb.train(
    optimized_params,
    production_data,
    num_boost_round=1000,
    callbacks=[lgb.log_evaluation(200)]
)

end_time = datetime.now()
training_time = end_time - start_time

print(f"\n✅ PRODUCTION MODEL TRAINING COMPLETED!")
print(f"⏱ Training time: {training_time}")
print(f"📊 Total training samples: {len(y_full):,}")
print(f"🗓 Data from {len(df['season'].unique())} seasons: {sorted(df['season'].unique())}")

# Save production model
production_model_path = models_dir / 'lgb_production_model_v4.joblib'
joblib.dump(production_model, production_model_path)
print(f"💾 Production model saved to: {production_model_path}")

# Save production artifacts
joblib.dump(features, models_dir / 'production_feature_names_v4.joblib')
joblib.dump(label_encoder, models_dir / 'production_label_encoder_v4.joblib')
joblib.dump(df[features].median(), models_dir / 'production_feature_medians_v4.joblib')
print("💾 Production artifacts saved (features, encoder, medians)")

print(f"\n🎯 PRODUCTION MODEL SUMMARY:")
print(f"  • Model type: LightGBM with optimized hyperparameters")
print(f"  • Training data: {len(y_full):,} matches from {len(df['season'].unique())} seasons")
print(f"  • Features: {len(features)} predictive features")
print(f"  • Target classes: Win/Draw/Loss")
print(f"  • Ready for real predictions!")

print("\n🌟 This model is now ready for production use!")
print("Use this model to make predictions on new, unseen matches.")

🚀 TRAINING PRODUCTION MODEL WITH ENTIRE DATASET
Preparing FULL dataset for production model...
Full dataset shape: (5711, 197)
Total matches: 5,711
Date range: 2019-08-04 00:00:00 to 2025-05-28 00:00:00
Full target distribution: [1257 2048 2406] (D/L/W)

Training production model with optimized parameters...
This model will use ALL available historical data!
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.912450
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.030685
[LightGBM] [Debug] init for col-wise cost 0.000149 seconds, init for row-wise cost 0.004785 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005823 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 26091
[LightGBM] [Info] Number of data points in the train set: 5711, number of used features: 193
[LightGBM] [Info] Start training from score -1.513666
[LightGBM] [Info] Start training from 

In [20]:
# Production model feature importance analysis
print("\n🔍 PRODUCTION MODEL - TOP 20 MOST IMPORTANT FEATURES:")
production_feature_imp = pd.DataFrame({
    'feature': features,
    'importance': production_model.feature_importance(importance_type='gain')
}).sort_values('importance', ascending=False)

for i, row in production_feature_imp.head(20).iterrows():
    print(f"  {row['feature']}: {row['importance']:.0f}")

# Save feature importance for production use
production_feature_imp.to_csv(models_dir / 'production_feature_importance_v4.csv', index=False)
print(f"\n💾 Feature importance saved to: {models_dir / 'production_feature_importance_v4.csv'}")

print(f"\n🎉 PRODUCTION MODEL IS READY!")
print(f"📁 All files saved in: {models_dir}")
print(f"🚀 Use 'lgb_production_model_v4.joblib' for real predictions!")


🔍 PRODUCTION MODEL - TOP 20 MOST IMPORTANT FEATURES:
  avg_wage_dollars_diff: 2696
  total_wage_bill_dollars_diff: 1494
  max_wage_dollars_diff: 1179
  venue: 825
  xg_diff: 791
  xg_for_opponent_form_avg: 647
  total_wage_bill_dollars: 625
  shots_ratio: 624
  defensive_actions: 611
  Saves_favor_form_sum: 603
  age_mean_diff: 583
  max_wage_dollars: 574
  Tackles_against_form_avg: 569
  Aerials Won_favor_form_avg: 545
  avg_wage_dollars: 539
  xg_for_opponent_form_sum: 524
  xg_against_form_avg: 514
  opp_max_wage_dollars: 510
  xg_for_form_avg: 508
  opp_avg_wage_dollars: 490

💾 Feature importance saved to: C:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football\models\premier_league\production_feature_importance_v4.csv

🎉 PRODUCTION MODEL IS READY!
📁 All files saved in: C:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football\models\premier_league
🚀 Use 'lgb_production_model_v4.joblib' for real predictions!
