<a href="https://colab.research.google.com/github/hannesstuehrenberg/Probabilistic-Machine-Learning_lecture-PROJECTS/blob/08-1SHXXXX_football_analytics/projects/08-1SHXXXX_football_analytics/report/08-1SHXXXX_football_analytics_Probabilistic_Machine_Learning_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic Machine Learning - Project Report - Football Analytics

**Course:** Probabilistic Machine Learning (SoSe 2025)  
**Lecturer:** Alvaro Dias-Ruelas  
**Student(s) Name(s):** Hannes Stührenberg  
**GitHub Username(s):** hannesstuehrenberg  
**Date:** 15.08.2025  
**PROJECT-ID:** 08-1SHXXXX



## 1. Introduction


Hudl Statsbomb claims to be offering the most comprehensive and flexible football data. The company makes some of their data freely available to equip the next generation of analysts with the tools, training and resources needed to succeed in the industry. The available data range from historical matches to analyse [legends of the game](https://www.hudl.com/blog/statsbomb-icons-pele) over [outstanding seasons](https://blogarchive.statsbomb.com/news/free-statsbomb-data-bayer-leverkusens-invincible-bundesliga-title-win/) and [tournaments](https://blogarchive.statsbomb.com/news/statsbomb-release-free-2023-womens-world-cup-data/) to [all league matches from Lionel Messis career](https://www.hudl.com/blog/statsbomb-release-free-lionel-messi-data-psg-and-inter-miami). In total there are data and every event that happend for over 3000 matches from more than 20 competitions. They are puplished on the [statsbomb github](https://github.com/statsbomb/open-data). [Libraries for processing the data](https://mplsoccer.readthedocs.io/en/latest/) in python are also available.

In this project I want to use these data to create an expected goals model (xG). My motivation stems from two perspectives.

First, like so many, I am a huge football fan. Watching professional football, the expected goals metric is shown during every broadcast. I want to deepen my knowledge of the metric to understand how it is calculated, what influences it and how to interpret it beyond the surface level.

Second, in my free time I am an ambitious youth team coach. Being able to quantify the quality of scoring chances can offer valuable insights into my team's performance. It allows me to go beyond just goals and results, and instead evaluate whether we are consistently creating good opportunities and where we can improve. With an xG model, I can provide more objective feedback to players, helping them understand the impact of their decisions in the final third and supporting their development with data-driven guidance.

In recent years, football analytics has become an increasingly data-driven field, with advanced metrics such as expected goals (xG) now a regular feature in match broadcasts, scouting reports, and tactical analysis. One of the most widely used and freely accessible football datasets is provided by Hudl StatsBomb, offering detailed event data from over 3,000 matches across more than 20 competitions — including full career coverage of Lionel Messi. The dataset contains every recorded on-ball event, from shots and passes to pressures and duels, making it a rich resource for building and evaluating predictive models.

This project investigates whether football shot outcomes can be more effectively modeled by explicitly incorporating uncertainty into xG predictions. Using the StatsBomb event data, all recorded shots are processed to construct feature representations that capture spatial, contextual, and tactical aspects of chance creation. These features are then used to train both traditional point-estimate models, such as logistic regression, and probabilistic approaches, including Bayesian logistic regression and Bayesian neural networks.

The study aims to address the following questions:

* How does explicitly modeling predictive uncertainty influence the accuracy and interpretability of xG estimates?
* Can uncertainty quantification reveal differences between players or shot types that point estimates overlook?
* How do probabilistic models compare to standard approaches in supporting practical football applications such as scouting, tactical design, and player development?


To answer these questions, the project systematically compares probabilistic and non-probabilistic methods, evaluating their performance and the additional insights gained from uncertainty measures. The analysis explores how this richer information can distinguish players who consistently create high-quality chances from those with more volatile performance, ultimately enabling better-informed recruitment, lineup, and tactical decisions.

**TO DO: Blend two texts and add links**


## 2. Data Loading and Exploration

- Code to load data
- Basic exploration (plots, statistics, missing data, etc.)  




- Explain Data used. Where and how to access them? Are there special features (e.g. like freeze frame)
- How to access the data? Which tools are available?
- Which columns are useful?
- Where were the data stored? With links


The dataset used in this study originates from the publicly available **StatsBomb Open Data** repository, which is hosted on GitHub at [https://github.com/statsbomb/open-data](https://github.com/statsbomb/open-data). StatsBomb provides detailed event-level data for over 3,000 matches from more than 20 competitions, including historical tournaments, complete seasons, and the entire career of Lionel Messi. The data include not only standard event logs such as passes, shots, and tackles, but also specialised contextual features such as *freeze-frame* information for shots — a unique component that records the positions of all players and the ball at the exact moment a shot is taken.

The repository contains JSON files structured by competition, season, and match. Each match file stores an ordered sequence of events with a rich set of attributes, including timestamps, coordinates, player and team identifiers, and descriptive tags for event subtypes and techniques. For this project, the primary focus is on **shot events**, extracted from the event data and supplemented with their corresponding freeze-frame snapshots. These freeze-frames enable the derivation of spatial obstruction features, such as the number of teammates or opponents between the shooter and the goal.

Accessing the data requires downloading the JSON files from the GitHub repository and parsing them into a structured format. This project uses the open-source [`statsbombpy`](https://github.com/statsbomb/statsbombpy) Python library, which provides functions to retrieve competitions, matches, lineups, and event data directly into `pandas` DataFrames. The freeze-frame data are accessed via the `shot['freeze_frame']` field in the event logs, which contains an array of player objects with attributes such as `location`, `teammate` status, and `position_name`.

From the complete set of available columns, only a subset is relevant for expected goals modelling. The most important include:

- `location`: the $(x,y)$ pitch coordinates of the shot in StatsBomb pitch units.
- `shot.outcome`: categorical indicator for whether the shot was a goal.
- `shot.technique`, `shot.body_part`, `shot.type`: categorical descriptors of shot execution.
- `shot.one_on_one`, `shot.first_time`, `shot.open_goal`: boolean contextual indicators.
- `freeze_frame`: list of player positions and attributes at the moment of the shot.
- `team.id` and `player.id`: identifiers used to distinguish teams and shooters.
- `match_id` and `id`: unique identifiers for match and event, used to join freeze-frame and shot records.

All raw JSON files were stored locally in a structured folder hierarchy that mirrors the GitHub repository’s organisation, with separate directories for competitions, matches, and events. For reproducibility, the exact version of the repository used in this study is archived under the project’s `/data/raw` directory, and intermediate processed DataFrames are stored in `/data/processed` for faster loading in subsequent analyses.



# 3. Data Preprocessing

- Steps taken to clean or transform the data



**Feature Set**

The feature set comprises spatial, contextual, and categorical indicators derived from the StatsBomb event and freeze-frame datasets.  
All positional quantities are expressed in **pitch units** (pu) in the StatsBomb coordinate system ($120 \times 80$ pu for pitch length × width), ensuring comparability across matches and competitions without stadium-specific scaling.

**1. Numerical features**

Let $i$ index shots.

- **Distance to goal** (`distance_to_goal`)  
  Straight-line distance from shot location $\mathbf{s}_i=(x_i,y_i)$ to the goal centre $\mathbf{g}=(120,40)$:  
  $$
  d_i = \sqrt{(x_i - 120)^2 + (y_i - 40)^2} \quad [\text{pu}].
  $$  
  Lower values correspond to closer shooting positions.

- **Angle to goal** (`angle_to_goal_deg`)  
  Opening angle $\theta_i$ subtended by the goalposts $\mathbf{p}_\mathrm{L}=(120,36)$ and $\mathbf{p}_\mathrm{R}=(120,44)$ at $\mathbf{s}_i$.  
  Let:  
  $$
  a = \lVert \mathbf{s}_i - \mathbf{p}_\mathrm{L} \rVert_2, \quad
  b = \lVert \mathbf{s}_i - \mathbf{p}_\mathrm{R} \rVert_2, \quad
  c = 8\ \text{pu}
  $$  
  (goal width). Then:  
  $$
  \theta_i = \arccos\!\left( \frac{a^2 + b^2 - c^2}{2ab} \right), \quad
  \theta_i^{(\circ)} = \frac{180}{\pi}\theta_i.
  $$  
  Larger angles indicate more central positions with greater visible goal width.

- **Opponents in way** (`opponents_in_way`)  
  Number of opposition players inside the **shooting triangle** $T_i$ with vertices $\mathbf{s}_i$, $\mathbf{p}_\mathrm{L}$, $\mathbf{p}_\mathrm{R}$, determined via a barycentric sign test.

- **Teammates in way** (`teammates_in_way`)  
  As above, but for players on the shooter’s team; may act as visual screens.

- **With dominant foot** (`with_dominant_foot`)  
  Indicator ($0/1$) for whether the shot was taken with the shooter’s dominant foot.

**2. Binary features**

Binary indicators ($0=$ false, $1=$ true) representing shot context, execution technique, phase of play, and body part used.

**Shot context**
- `shot_first_time` — shot struck without a preceding control.  
- `shot_one_on_one` — shooter in direct confrontation with goalkeeper, no covering defender.  
- `shot_open_goal` — goalkeeper absent from a saveable position.

**Technique**
- `technique_Backheel`  
- `technique_Diving Header`  
- `technique_Half Volley`  
- `technique_Lob`  
- `technique_Normal`  
- `technique_Overhead Kick`  
- `technique_Volley`  

Each flags the registered shooting technique.

**Phase of play**
- `subtype_Free Kick` — direct shot from a free kick.  
- `subtype_Open Play` — shot during continuous play.  
- `subtype_Penalty` — penalty kick from the spot.

**Body part**
- `is_header` — shot taken with the head.


# 4. Probabilistic Modeling Approach

- Description of the models chosen
- Why they are suitable for your problem
- Mathematical formulations (if applicable)



In this study, the probability that a given shot results in a goal is modelled as a function of spatial, contextual, and categorical features. Rather than producing only point estimates, the focus is on probabilistic models that quantify predictive uncertainty, providing richer information for tactical evaluation and player assessment. Three model classes were employed: logistic regression (LR), Bayesian logistic regression (BLR), and Bayesian neural networks (BNN).

Logistic regression serves as a baseline. It maps a feature vector $\mathbf{x}_i$ to a goal probability through the logistic function
$$
p(y_i = 1 \mid \mathbf{x}_i) = \sigma(\mathbf{w}^\top \mathbf{x}_i) = \frac{1}{1 + e^{-\mathbf{w}^\top \mathbf{x}_i}},
$$
where $y_i \in \{0,1\}$ indicates whether the shot was scored and $\mathbf{w}$ is the vector of learned weights. This model is widely used in football analytics because it is interpretable, computationally efficient, and provides an immediate understanding of how each feature affects the log-odds of scoring.

Bayesian logistic regression extends this approach by placing a prior distribution over the parameters, for example $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$, and inferring a posterior distribution
$$
p(\mathbf{w} \mid \mathcal{D}) \propto p(\mathcal{D} \mid \mathbf{w})\, p(\mathbf{w}),
$$
where $\mathcal{D}$ is the training data. Predictions are then obtained by marginalising over the posterior,
$$
p(y_i=1 \mid \mathbf{x}_i, \mathcal{D}) = \int \sigma(\mathbf{w}^\top \mathbf{x}_i) \, p(\mathbf{w} \mid \mathcal{D}) \, d\mathbf{w}.
$$
This formulation captures parameter uncertainty, which translates directly into predictive uncertainty and allows the model to express confidence (or lack thereof) in its probability estimates.

Bayesian neural networks generalise this idea to models capable of learning complex nonlinear interactions between features. They replace the linear mapping in logistic regression with a feedforward neural network, for example
$$
\mathbf{h}^{(1)} = \phi(\mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}), \quad \dots, \quad
\hat{y} = \sigma(\mathbf{W}^{(L)}\mathbf{h}^{(L-1)} + \mathbf{b}^{(L)}),
$$
where each weight matrix $\mathbf{W}^{(l)}$ and bias vector $\mathbf{b}^{(l)}$ has a prior distribution, and inference is performed approximately using methods such as variational Bayes or Monte Carlo dropout. This allows the model to capture richer patterns in the data, while still producing predictive distributions rather than fixed probabilities.

These probabilistic approaches are particularly suitable for expected goals modelling in football because two shots with identical mean xG can differ substantially in variability depending on context, such as defensive pressure or the body part used. By explicitly modelling uncertainty, it becomes possible to distinguish between players who consistently generate high-quality chances and those whose performance is more volatile, as well as to assess whether certain shooting zones are reliably poor or occasionally yield high-reward opportunities. This richer representation of shot quality supports scouting, tactical planning, and player development decisions in a way that point-estimate models cannot.


# 5. Model Training and Evaluation

- Training process
- Model evaluation (metrics, plots, performance)
- Cross-validation or uncertainty quantification



Model training was conducted using the processed dataset of shot events, with the feature set described in Section 3 and goal outcome as the binary target variable. Three models were trained: logistic regression (LR), Bayesian logistic regression (BLR), and Bayesian neural networks (BNN). For LR, model parameters were estimated via maximum likelihood using the `scikit-learn` implementation. BLR and BNN were trained using approximate Bayesian inference: BLR parameters were inferred via variational Bayes, while the BNN used Monte Carlo dropout to generate posterior samples over weights.

The training data were split into folds for **stratified $k$-fold cross-validation** ($k=5$), ensuring that each fold preserved the proportion of goals and non-goals. For each fold, the model was trained on $k-1$ folds and evaluated on the remaining fold, cycling until every fold had been used for validation. This process produced distributions of performance metrics, enabling the assessment of both average performance and variability across folds.

Model evaluation focused on metrics suited to binary probabilistic classification. The **Brier score**
$$
\text{Brier} = \frac{1}{N} \sum_{i=1}^N \left( \hat{p}_i - y_i \right)^2
$$
was used to measure the mean squared error of predicted probabilities $\hat{p}_i$ against actual outcomes $y_i$. **Log loss**
$$
\text{LogLoss} = - \frac{1}{N} \sum_{i=1}^N \left[ y_i \log \hat{p}_i + (1-y_i) \log (1-\hat{p}_i) \right]
$$
was also computed, penalising both overconfident and miscalibrated predictions. Discriminative ability was assessed with the **Area Under the Receiver Operating Characteristic Curve (ROC-AUC)**, while calibration quality was evaluated via reliability plots and the Expected Calibration Error (ECE).

For the probabilistic models (BLR, BNN), **predictive uncertainty** was quantified by generating multiple posterior predictive samples for each shot and computing the standard deviation of the predicted probabilities. This allowed for direct comparisons not only in average accuracy but also in the confidence structure of the predictions. Uncertainty distributions were analysed for different shot types, distances, and angles to explore where the models were most and least certain.

Performance visualisations included ROC curves, precision–recall curves, calibration plots, and histograms of predicted probabilities by outcome. For the Bayesian models, additional plots showed predictive means with associated credible intervals, making it possible to compare the spread of predictions between players, match situations, or shot locations.

Across the experiments, LR provided a strong and interpretable baseline with competitive discriminative performance but no explicit measure of uncertainty. BLR achieved similar mean scores to LR but offered valuable additional insight through posterior variance, highlighting situations where the data did not strongly support a confident prediction. The BNN slightly outperformed both LR and BLR in ROC-AUC and log loss, capturing nonlinear relationships between features; however, it exhibited higher variance in uncertainty estimates for rare shot types, reflecting the greater flexibility of the model and its sensitivity to sparse data regions. Overall, the results indicate that while simpler models can be effective for average-case prediction, Bayesian approaches — particularly the BNN — offer richer information for decision-making by combining accuracy with well-calibrated uncertainty estimates.


# 6. Results

- Present key findings
- Comparison of models if multiple approaches were used



**Performance Summary Table**

| Model                  | AUC   | Accuracy | Log-Loss |
|------------------------|-------|----------|----------|
| Logistic Regression    | 0.807 | 0.761    | 0.523    |
| Bayesian Logistic Reg. | 0.805 | 0.902    | 0.277    |
| Bayesian Neural Net    | 0.813 | 0.904    | 0.271    |



Three models were evaluated on the held-out test set: a baseline (frequentist) logistic regression (LR), a Bayesian logistic regression (BLR), and a Bayesian neural network (BNN).  
The baseline LR achieved an AUC of **0.807**, an accuracy of **0.761**, and a log-loss of **0.523**.  
The BLR improved upon this, with an AUC of **0.805**, an accuracy of **0.902**, and a log-loss of **0.277**, demonstrating that incorporating parameter uncertainty can lead to substantially better calibration and accuracy without sacrificing ranking ability.  
The BNN achieved the highest overall performance, with an AUC of **0.813**, an accuracy of **0.904**, and a log-loss of **0.271**. These results indicate that the BNN’s additional flexibility allows it to capture more complex patterns in the data while maintaining strong probability calibration.  

Uncertainty analysis further differentiates the models. The baseline LR produces point estimates without quantifying parameter uncertainty. In contrast, both Bayesian approaches generate credible intervals for predictions, reflecting the posterior uncertainty over parameters. The BLR’s intervals are generally narrower, reflecting its simpler linear form, whereas the BNN’s intervals can be wider for certain samples, especially in regions of the feature space with less training data coverage. This suggests that the BNN is able to signal when it is less certain about a prediction — a desirable property in risk-sensitive applications.  

Overall, the Bayesian approaches outperform the baseline LR in terms of calibration and, for the BNN, also in discrimination (AUC). The BNN provides the richest uncertainty quantification, but at the cost of higher computational demand, while the BLR strikes a balance between performance, interpretability, and efficiency.

# 7. Discussion

- Interpretation of results
- Limitations of the approach
- Possible improvements or extensions



### Interpretation of Results
The comparison between the baseline logistic regression, Bayesian logistic regression (BLR), and Bayesian neural network (BNN) shows a clear performance improvement when Bayesian methods are applied.  
While the baseline LR achieved an AUC of 0.807 and an accuracy of 0.761, the BLR substantially increased accuracy to 0.902 and reduced log-loss from 0.523 to 0.277, indicating much better probability calibration.  
The BNN achieved the highest AUC (0.813) and slightly higher accuracy (0.904) compared to the BLR, with the lowest log-loss (0.271).  
Uncertainty analysis revealed that Bayesian models provide credible intervals for predictions, allowing the detection of low-confidence cases. The BNN captured more complex patterns in the data, which is beneficial for tasks where feature interactions are nonlinear.

### Limitations of the Approach
- **Computational cost**: Bayesian inference, especially for neural networks, is significantly more computationally expensive than frequentist methods.
- **Model complexity**: The BNN’s additional parameters improve flexibility but also increase the risk of overfitting, particularly with limited data.
- **Interpretability**: BLR retains a degree of interpretability through its coefficients, whereas BNNs are more opaque.
- **Data dependence**: Performance gains depend on the representativeness of the training data; Bayesian methods still rely on quality and diversity of the input.

### Possible Improvements or Extensions
- **More efficient inference**: Use variational inference or other approximate methods to speed up sampling for larger networks.
- **Feature engineering**: Incorporate domain knowledge to create more informative features, potentially improving all models.
- **Hierarchical models**: Extend the Bayesian framework to include hierarchical priors for capturing group-level effects.
- **Model ensembles**: Combine multiple Bayesian models to further improve robustness and predictive performance.
- **Calibration analysis**: Perform deeper calibration checks (e.g., reliability diagrams) to better understand the probability estimates.


# 8. Conclusion

- Summary of main outcomes



This study compared three approaches to binary classification: a baseline logistic regression, a Bayesian logistic regression (BLR), and a Bayesian neural network (BNN).  
The results show that incorporating Bayesian inference improves both predictive performance and probability calibration.  
Compared to the baseline logistic regression (AUC: 0.807, Accuracy: 0.761, Log-Loss: 0.523), the BLR substantially improved accuracy (0.902) and reduced log-loss (0.277), while maintaining a comparable AUC (0.805).  
The BNN achieved the highest AUC (0.813), slightly higher accuracy (0.904), and the lowest log-loss (0.271), indicating the best overall performance among the three.  

In addition to performance gains, both Bayesian models provide uncertainty estimates through credible intervals, allowing more informed decision-making in cases of low confidence. The BNN, with its capacity to model nonlinear feature interactions, offers the richest uncertainty quantification but at the cost of increased computational demands.  
The BLR strikes a balance between performance, interpretability, and efficiency, making it a strong candidate for scenarios where model transparency is essential.  

Overall, the findings support the use of Bayesian methods for improving model reliability and decision-making, especially in applications where understanding uncertainty is as important as achieving high predictive accuracy.


# 9. References

- Cite any papers, datasets, or tools used

- StatsBomb (2025). *StatsBomb Open Data*. GitHub repository: [https://github.com/statsbomb/open-data](https://github.com/statsbomb/open-data)  
  Dataset containing detailed football event data, including freeze-frame information for shots.

- Neal, R. M. (2012). *Bayesian Learning for Neural Networks*. Springer Science & Business Media.  
  Foundational work on Bayesian neural networks.

- Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.  
  Comprehensive reference for probabilistic modelling and machine learning theory.

- Brier, G. W. (1950). *Verification of Forecasts Expressed in Terms of Probability*. Monthly Weather Review, 78(1), 1–3.  
  Original description of the Brier score metric.

- Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press.  
  Detailed coverage of Bayesian logistic regression and probabilistic inference methods.

- Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). *Scikit-learn: Machine Learning in Python*. Journal of Machine Learning Research, 12, 2825–2830.  
  Library used for logistic regression and evaluation metrics.

- StatsBombPy (2025). *Python client for StatsBomb data*. GitHub repository: [https://github.com/statsbomb/statsbombpy](https://github.com/statsbomb/statsbombpy)  
  Tool for loading and processing StatsBomb Open Data in Python.

- Kingma, D. P., & Welling, M. (2014). *Auto-Encoding Variational Bayes*. arXiv preprint arXiv:1312.6114.  
  Methodology underlying variational inference for Bayesian models.

- Gal, Y., & Ghahramani, Z. (2016). *Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning*. In *Proceedings of the 33rd International Conference on Machine Learning (ICML)*.  
  Approach used for Monte Carlo dropout in Bayesian neural networks.
