# Probabilistic Machine Learning - Project Report

**Course:** Probabilistic Machine Learning (SoSe 2025)

**Lecturer:** Alvaro Diaz-Ruelas

**Student(s) Name(s):**  Niklas Nesseler, Lauren Pommer

**GitHub Username(s):**  @NiklasNesseler

**Date:**  20th of July 2025

**PROJECT-ID:** 21-1NNPLXX_EPL_match_results

---


## 1. Introduction


### Brief description of the dataset and problem
The dataset contains detailed match-level data from the English Premier League spanning from September 2020 to May 2024, comprising 4,788 entries and 28 features. Key information includes match context (teams, venue, round), results (goals for/against, win/draw/loss), and performance metrics (xG, possession, shots, etc.).
The task is to use this dataset to predict match outcomes based on pre-match or in-game statistics.

### Motivation for our project
The Premier League is one of the most watched and competitive football leagues globally. Predicting match results has broad applications in sports analytics, betting, and fan engagement. It also serves as a solid real-world use case for machine learning classification.

### Hypothesis or research question
Can shallow learning models such as SVMs and XGBoost accurately predict the match outcome (win/draw/loss) using features like xG, possession, venue, and team-related context? Will XGBoost outperform SVC?


## 2. Data Loading and Exploration

The dataset was loaded using pandas.
Initial exploration showed 4,788 entries with 28 features. Basic statistics and structure checks were conducted (df.info(), .describe()). The dataset includes match-level information such as team names, match date, xG, possession, goals, and outcome. No critical features were missing, and minor data cleaning steps were planned for preprocessing.



## 3. Data Preprocessing

### Timeframe Filtering
* The dataset was filtered to include only matches from August 5, 2022 to May 19, 2024.

* This reduced the number of entries from 4,788 to 1,520.

* Reason: Focus on recent seasons (2022/23 and 2023/24) to ensure relevance and reflect current playing styles, tactics, and team dynamics.

* Trade-off: Fewer samples, but higher quality and temporal consistency.


### Match Outcome Encoding
* Original results (W/D/L) were encoded as:

* Win = +1, Draw = 0, Loss = −1

* Purpose: Enables classification models (e.g., SVM, XGBoost) and maintains class balance (591 wins, 591 losses).

### De-duplication & Match-Based Restructuring
* The dataset is team-centric and contained duplicates.

* Transformed into a match-based structure, sorted by date.

* Duplicates were removed to ensure clean, non-redundant records.

## 4. Feature Engineering

### Engineered Features
We developed several key features to capture different dimensions of team performance and match context:

#### **Recent Performance Indicators**
**Recent Form (Last 5 Matches)**
* Type: Rolling average points metric
* Idea: Captures short-term team momentum using a standardized scoring system (Win=3, Draw=1, Loss=0)
* Feature: form_last_5_avg
* Why it matters: Recent form often better predicts immediate outcomes than season-long statistics, reflecting current squad fitness, confidence, and tactical adjustments.

#### **Venue-Specific Strength Metrics**
**Home vs Away Performance**
* Type: Separate strength calculations by venue
* Idea: Teams often perform differently at home versus away due to crowd support, travel fatigue, and familiarity with conditions
* Features: home_strength and away_strength
* Calculation: Average points per game segmented by venue from historical matches

#### **Advanced Attacking Metrics**
**Goals and Expected Goals (Last 10 Matches)**
* Type: Rolling averages of actual and expected performance
* Idea: Combines recent scoring output with underlying shot quality to assess true attacking strength
* Features: avg_goals_last_10 and avg_xg_last_10
* Why it works: The gap between actual goals and expected goals reveals whether a team is over/under-performing their chances, indicating sustainability of current form.

#### **Defensive Stability Indicators**
**Goals Conceded (Last 10 Matches)**
* Type: Rolling defensive performance metric
* Idea: Recent defensive record as a predictor of future vulnerability
* Feature: avg_goals_conceded_last_10
* Application: Quantifies defensive consistency and identifies teams in poor defensive form.

#### **Tactical Style Indicators**
**Possession Control (Last 10 Matches)**
* Type: Rolling average of ball possession percentage
* Idea: Reflects team's tactical approach and ability to control game tempo
* Feature: avg_possession_last_10
* Data handling: Missing values filled with neutral 50% when fewer than 10 prior matches available
* Insight: High possession often correlates with territorial dominance and scoring opportunities.

#### **Historical Context Features**
**Head-to-Head Performance**
* Type: Normalized historical matchup score
* Idea: Captures psychological advantages, tactical familiarity, and stylistic matchups between specific teams
* Application: Accounts for rivalry dynamics and historical dominance patterns not reflected in current form alone.

#### **Officiating Context**
**Referee Bias Indicators**
* Type: Statistical analysis of referee-team win rate patterns
* Idea: Identifies referees with statistically significant deviations in team-specific outcomes
* Trade-Off: Correlation may reflect that certain referees officiate more high-profile matches involving stronger teams rather than actual bias
* Usage: Included for completeness but treated as contextual rather than predictive.

### Data Leakage Prevention
**Removed Future Information**
* Challenge: Ensuring model only uses information available at prediction time
* Dropped features:home_goals, away_goals, home_xg, away_xg, home_poss
* These represent match outcomes rather than pre-match predictors, preventing overfitting and ensuring real-world applicability.

## 5. Probabilistic Modeling Approach

### Chosen Models

We implemented and evaluated two machine learning models to predict Premier League match outcomes:

#### 1. **Support Vector Machines (SVM)**

* **Type**: Linear / non-linear classifier
* **Core Idea**: Finds the *optimal hyperplane* that best separates classes in the feature space.
* **Why it fits**:

  * Match outcomes are naturally **separable classes** (Win, Draw, Loss).
  * Performs well on **small to medium datasets** (\~760 samples).
  * Effective with **engineered, high-dimensional features**.
  * Provides **interpretable decision boundaries** and insights into feature importance.

#### 2. **XGBoost (Extreme Gradient Boosting)**

* **Type**: Ensemble of boosted decision trees
* **Core Idea**: Builds trees sequentially, each correcting the previous one's errors.
* **Why it fits**:

  * Captures **complex, non-linear relationships**.
  * Efficiently handles **numerical and categorical features** without heavy preprocessing.
  * Learns **interactions between features** automatically.
  * Often achieves **state-of-the-art performance** on structured/tabular data.

### Feature Importance Analysis

To interpret model behavior and guide feature selection, we used three methods:

* **Random Forest Feature Importance** – based on impurity reduction.
* **Mutual Information** – measures non-linear dependency between input and target.
* **Pearson Correlation** – measures linear association between feature and target.

All three were normalized and compared across four visualizations.

#### Top-Ranked Features (Consensus)

Based on average rank across all three methods:

1. home_avg_goals_10
2. home_avg_xg_10
3. away_avg_xg_10
4. home_team_form
5. away_avg_goals_10
6. away_team_form
7. home_avg_conceded_10
8. h2h_stat
9. away_avg_conceded_10
10. home_avg_possession_10

These features were included in both models.




# 6. Model Training and Evaluation

## Training Process

Both models were trained using the same 80/20 train-test split, resulting in 608 training samples and 152 test samples. The training process involved several key steps:

### Support Vector Machine (SVM):
* **Feature scaling**: StandardScaler was applied to normalize all features, which is crucial for SVM performance
* **Baseline model**: Initial RBF kernel SVM with default parameters achieved 69.08% accuracy
* **Hyperparameter optimization**: GridSearchCV with 5-fold cross-validation tested 72 parameter combinations across:
  * C values: [0.1, 1, 10, 100]
  * Gamma values: ['scale', 'auto', 0.001, 0.01, 0.1, 1]
  * Kernels: ['rbf', 'poly', 'sigmoid']

### XGBoost:
* **No preprocessing required**: XGBoost handles mixed data types and scaling internally
* **Label remapping**: Target labels were converted from {-1, 0, 1} to {0, 1, 2} for XGBoost compatibility
* **Baseline model**: Default XGBoost classifier achieved 66.45% accuracy
* **Hyperparameter optimization**: GridSearchCV with 5-fold cross-validation tested 16 parameter combinations across:
  * Number of estimators: [100, 200]
  * Maximum depth: [4, 6]
  * Learning rate: [0.1, 0.2]
  * Subsample ratio: [0.8, 1.0]

## Model Evaluation

### Performance Metrics

**SVM Results:**
* **Best parameters**: C=1, gamma=0.01, kernel='rbf'
* **Test accuracy**: 66.45%
* **Cross-validation score**: 64.96% ± 4.07%
* **Class-specific performance**:
  * Away Win: 63% precision, 78% recall
  * Draw: 0% precision, 0% recall (complete failure)
  * Home Win: 68% precision, 90% recall

**XGBoost Results:**
* **Best parameters**: learning_rate=0.1, max_depth=4, n_estimators=100, subsample=1.0
* **Test accuracy**: 67.11%
* **Cross-validation score**: 61.68% ± 3.79%
* **Class-specific performance**:
  * Away Win: 65% precision, 80% recall
  * Draw: 45% precision, 15% recall (poor but not zero)
  * Home Win: 71% precision, 83% recall

### Key Performance Insights

1. **Draw prediction challenge**: Both models struggle significantly with predicting draws, which is common in football prediction due to the inherent randomness of draw outcomes.

2. **Home advantage bias**: Both models show strong performance for home wins, reflecting the statistical home advantage in football.

3. **Model comparison**: While XGBoost achieved slightly higher test accuracy (67.11% vs 66.45%), SVM showed better cross-validation stability with higher CV scores.

### Feature Importance Analysis

**SVM Feature Importance** (using linear kernel approximation):
1. **home_team_form** (30.44%): Dominant feature indicating recent home team performance
2. **away_team_form** (16.32%): Second most important, showing away team recent form
3. **referee_encoded** (6.82%): Referee influence on match outcomes
4. **home_avg_goals_10** (6.28%): Home team's recent scoring ability
5. **home_avg_possession_10** (5.61%): Home team's ball control metrics

**XGBoost Feature Importance**:
1. **home_team_form** (21.40%): Consistently the most important feature
2. **away_team_form** (12.50%): Second most important across both models
3. **home_avg_goals_10** (6.01%): Recent goal-scoring form
4. **away_avg_goals_10** (5.87%): Away team scoring metrics
5. **home_avg_conceded_10** (5.72%): Defensive performance metrics

### Cross-Validation and Uncertainty Quantification

**Cross-Validation Results:**
* **SVM**: 5-fold CV scores ranged from 61.98% to 68.03%, with mean 64.96% ± 2.04%
* **XGBoost**: 5-fold CV scores ranged from 59.50% to 64.46%, with mean 61.68% ± 1.90%

**Confidence Analysis:**
* **SVM**: Mean confidence of 2.21 (decision function scores), with all predictions showing high confidence (>0.8)
* **XGBoost**: Mean confidence of 74.26%, with 73 high-confidence predictions (>0.8) and 23 low-confidence predictions (<0.5)

**Temporal Stability**: Monthly accuracy analysis revealed significant variation over time, with accuracy ranging from 41.7% to 100% depending on the month, suggesting potential seasonal effects or data drift that should be monitored in production deployment.

# 7. Results

## Key Findings

Both machine learning models achieved comparable performance on football match outcome prediction, with test accuracies around 66-67%. However, significant class imbalance challenges emerged, particularly in predicting draw outcomes.

### Overall Performance:
* **XGBoost**: 67.11% test accuracy (best performing)
* **SVM**: 66.45% test accuracy
* Both models significantly outperformed random guessing (33.3% for 3-class problem)

### Critical Insight Draw Prediction Challenge:
The most striking finding was the models' inability to reliably predict draws. SVM completely failed to predict any draws (0% precision/recall), while XGBoost managed only 45% precision and 15% recall for draw outcomes. This reflects the inherent unpredictability of drawn matches in football.

**Feature Importance Consistency:**
Both models consistently identified **team form** as the dominant predictive factor, with home and away team form accounting for 46.76% (SVM) and 33.90% (XGBoost) of total feature importance respectively.

## Model Comparison

| Metric | SVM | XGBoost | Winner |
|--------|-----|---------|--------|
| **Test Accuracy** | 66.45% | 67.11% | XGBoost |
| **Cross-Validation** | 64.96% ± 2.04% | 61.68% ± 1.90% | SVM |
| **Draw Prediction** | Complete failure | Poor but functional | XGBoost |
| **Confidence Estimates** | Limited interpretability | Probabilistic outputs | XGBoost |
| **Training Efficiency** | Requires scaling | No preprocessing | XGBoost |

**Recommendation**: XGBoost emerges as the preferred model due to its marginally better test performance, ability to predict draws (albeit poorly), and superior interpretability through native probability estimates. However, the performance gap is minimal, and SVM's better cross-validation stability suggests both models are viable options for this prediction task.

The results highlight that football match prediction remains a challenging domain where even sophisticated machine learning approaches struggle with the sport's inherent unpredictability, particularly for draw outcomes.


# 8. Discussion

## Interpretation of Results

The 67% accuracy achieved represents a meaningful improvement over random chance, but highlights the fundamental challenge of football prediction. The models' strong performance on home wins (83-90% recall) confirms the well-documented home advantage effect, while the complete failure to predict draws reflects football's inherent randomness where evenly matched teams often produce unpredictable outcomes.

The dominance of **team form** features (30-46% of model importance) validates the intuitive understanding that recent performance is the strongest predictor of future results. This aligns with football analytics literature emphasizing momentum and current squad fitness over historical statistics.

## Limitations of the Approach

### Data Limitations:
* Missing crucial contextual factors: player injuries, transfers, motivation levels, and weather conditions
* Limited historical depth may not capture long-term tactical evolution
* No real-time data integration (lineups, pre-match news)

### Methodological Constraints:
* Class imbalance severely impacts draw prediction capability
* Static feature engineering may miss complex temporal dependencies
* Cross-validation shows temporal instability, suggesting potential data drift

### Scope Restrictions:
* Single league focus limits generalizability across different football cultures
* Binary encoding of categorical variables loses nuanced team relationships

## Possible Improvements and Extensions

### Enhanced Feature Engineering:
* Incorporate player-level statistics and injury reports
* Add head-to-head historical matchup data
* Include market odds as baseline probability estimates

### Advanced Modeling Approaches:
* Ensemble methods combining multiple algorithms
* Deep learning with LSTM networks for temporal sequence modeling
* Ordinal regression treating outcomes as ordered categories rather than independent classes

### Real-World Deployment:
* Live data integration for dynamic prediction updates
* Confidence-based betting strategies focusing on high-certainty predictions
* Multi-league expansion with transfer learning techniques


## 9. Summary

This project aimed to predict Premier League match outcomes using machine learning techniques on a comprehensive dataset spanning from 2022 to 2024. The study compared Support Vector Machines and XGBoost models to classify matches into three categories: home win, draw, and away win.

### **Main Outcomes**

**Model Performance**: Both models achieved comparable and meaningful results, with XGBoost slightly outperforming SVM. This represents a substantial improvement over random chance, demonstrating that machine learning can extract predictive signals from football data.

**Feature Engineering Success**: The comprehensive feature engineering approach proved effective, with engineered features like recent team form, rolling performance metrics, and venue-specific strengths becoming the most important predictors. Team form features dominated both models, accounting for 30-46% of total feature importance, validating the intuitive understanding that recent performance is crucial for prediction.

**Critical Challenge - Draw Prediction**: The most significant limitation was the models' inability to reliably predict draws. SVM completely failed to predict any draws, while XGBoost managed only modest performance. This reflects the inherent unpredictability of evenly matched contests in football.

**Model Comparison**: XGBoost emerged as the preferred choice due to its marginally better accuracy, functional draw prediction capability, and superior interpretability through probabilistic outputs. However, SVM's better cross-validation stability suggests both models are viable options.

**Key Insights**: The analysis confirmed several football analytics principles, including the strong home advantage effect and the dominance of recent form over historical statistics. The temporal instability observed in monthly accuracy variationshighlighted potential seasonal effects and data drift concerns for production deployment.

**Limitations and Future Directions**: The study identified several areas for improvement, including the need for player-level data, injury reports, real-time information integration, and advanced modeling approaches like ensemble methods or deep learning architectures. The class imbalance problem and missing contextual factors remain significant challenges.

Overall, the project successfully demonstrated that shallow machine learning techniques can provide meaningful insights into football match prediction, achieving accuracy levels that could be valuable for sports analytics applications, while also highlighting the sport's fundamental unpredictability that continues to challenge even sophisticated analytical approaches.

## 10. References

https://www.kaggle.com/datasets/mhmdkardosha/premier-league-matches