# ML Model Training and Persistence - Prototyping Notebook

This notebook is part of **Story 2.1: ML Model Training and Persistence**.

Goals:
- Load football match data from SQLite.
- Preprocess features.
- Train a classifier on match outcomes.
- Save the trained model for inference.

Steps:
1. Setup and Data Loading
2. Data Exploration
3. Data Preprocessing
4. Model Training, Evaluation, and Persistence

## 1. Setup and Data Loading

In [4]:
import sqlite3
import pandas as pd
import os
from pandas.io.sql import DatabaseError

# Path to database
db_path = '../football.db'

if not os.path.exists(db_path):
    raise FileNotFoundError(f"Database '{db_path}' not found. Run db_setup.py first.")

conn = sqlite3.connect(db_path)

try:
    df = pd.read_sql_query("SELECT * FROM matches", conn)
    print(f"✅ Loaded {len(df)} rows from 'matches' table")
except DatabaseError as e:
    df = pd.DataFrame()
    print(f"❌ Error loading data: {e}")
finally:
    conn.close()

✅ Loaded 1508 rows from 'matches' table


## 2. Data Exploration

In [5]:
if df.empty:
    print("❌ DataFrame is empty. Cannot proceed.")
else:
    display(df.head())
    print("\nDataFrame Info:")
    display(df.info())
    print("\nData Description:")
    display(df.describe(include='all'))

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,B1,2024-09-01,18:15:00,Kortrijk,St Truiden,1,1,D,0.0,1.0,...,1.84,-0.25,1.75,2.05,1.79,2.11,1.83,2.12,1.78,2.05
1,B1,2024-09-01,17:30:00,St. Gilloise,Anderlecht,0,0,D,0.0,0.0,...,1.98,-0.5,1.93,1.93,1.94,1.94,2.0,1.94,1.95,1.85
2,B1,2024-09-01,15:00:00,Gent,Antwerp,1,1,D,1.0,1.0,...,2.13,0.0,1.8,2.05,1.85,2.05,1.91,2.06,1.84,2.0
3,B1,2024-09-01,12:30:00,Club Brugge,Cercle Brugge,3,0,H,2.0,0.0,...,2.58,-1.0,1.9,1.95,1.93,1.96,1.93,2.01,1.88,1.93
4,B1,2024-08-31,19:45:00,Oud-Heverlee Leuven,Standard,2,0,H,1.0,0.0,...,1.8,-0.5,1.98,1.88,1.96,1.93,1.98,1.93,1.95,1.86



DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1508 entries, 0 to 1507
Data columns (total 93 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Div        1508 non-null   object 
 1   Date       1508 non-null   object 
 2   Time       1508 non-null   object 
 3   HomeTeam   1508 non-null   object 
 4   AwayTeam   1508 non-null   object 
 5   FTHG       1508 non-null   int64  
 6   FTAG       1508 non-null   int64  
 7   FTR        1508 non-null   object 
 8   HTHG       1504 non-null   float64
 9   HTAG       1504 non-null   float64
 10  HTR        1504 non-null   object 
 11  HS         1503 non-null   float64
 12  AS         1503 non-null   float64
 13  HST        1503 non-null   float64
 14  AST        1503 non-null   float64
 15  HF         1503 non-null   float64
 16  AF         1503 non-null   float64
 17  HC         1503 non-null   float64
 18  AC         1503 non-null   float64
 19  HY         1504 non-null   floa

None


Data Description:


Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
count,1508,1508,1508,1508,1508,1508.0,1508.0,1508,1504.0,1504.0,...,1508.0,1508.0,1508.0,1508.0,1508.0,1508.0,1508.0,1508.0,1508.0,1508.0
unique,1,566,18,23,23,,,3,,,...,,,,,,,,,,
top,B1,2023-04-23,19:45:00,Gent,St Truiden,,,H,,,...,,,,,,,,,,
freq,1508,8,314,89,89,,,658,,,...,,,,,,,,,,
mean,,,,,,1.594828,1.323607,,0.706782,0.580452,...,2.20882,-0.253481,1.921187,1.927712,1.943607,1.950351,1.998137,2.004595,1.913833,1.92067
std,,,,,,1.31991,1.184415,,0.849128,0.753022,...,0.326122,0.779494,0.092483,0.092205,0.091927,0.091882,0.094537,0.098014,0.084488,0.085021
min,,,,,,0.0,0.0,,0.0,0.0,...,1.5,-2.75,1.6,1.7,1.71,1.73,1.8,1.76,1.72,1.72
25%,,,,,,1.0,0.0,,0.0,0.0,...,1.98,-0.75,1.85,1.85,1.88,1.88,1.92,1.93,1.85,1.85
50%,,,,,,1.0,1.0,,1.0,0.0,...,2.16,-0.25,1.93,1.93,1.93,1.95,2.0,2.0,1.91,1.92
75%,,,,,,2.0,2.0,,1.0,1.0,...,2.3825,0.25,2.0,2.0,2.0125,2.02,2.07,2.07,1.98,1.99


## 3. Data Preprocessing

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Nettoyage de base et définition des features
df = df.dropna(subset=['HomeTeam', 'AwayTeam', 'FTR'])  # cible et équipes présentes
X = df[['HomeTeam', 'AwayTeam', 'HS', 'AS']]             # sélection de features
y = df['FTR']                                            # cible : résultat final

# Définition des colonnes
categorical_features = ['HomeTeam', 'AwayTeam']
numerical_features = ['HS', 'AS']

# Pipelines
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Fusion des pipelines
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("✅ Data preprocessing pipeline with imputers is ready.")


✅ Data preprocessing pipeline with imputers is ready.


## 4. Model Training, Evaluation, and Persistence

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

# Full pipeline: preprocessing + model
model_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Training
model_pipeline.fit(X_train, y_train)

# Evaluation
y_pred = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"🎯 Model Accuracy: {accuracy:.2f}")

# Persistence
os.makedirs('models', exist_ok=True)
model_path = 'models/prediction_model.pkl'
joblib.dump(model_pipeline, model_path)
print(f"💾 Model saved to {model_path}")

🎯 Model Accuracy: 0.44
💾 Model saved to models/prediction_model.pkl
