<a href="https://colab.research.google.com/github/hannesstuehrenberg/Probabilistic-Machine-Learning_lecture-PROJECTS/blob/08-1SHXXXX_football_analytics/projects/08-1SHXXXX_football_analytics/report/08-1SHXXXX_football_analytics_Probabilistic_Machine_Learning_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic Machine Learning - Project Report - Football Analytics

**Course:** Probabilistic Machine Learning (SoSe 2025)  
**Lecturer:** Alvaro Dias-Ruelas  
**Student(s) Name(s):** Hannes Stührenberg  
**GitHub Username(s):** hannesstuehrenberg  
**Date:** 15.08.2025  
**PROJECT-ID:** 08-1SHXXXX



## 1. Introduction


Hudl Statsbomb claims to be offering the most comprehensive and flexible football data. The company makes some of their data freely available to equip the next generation of analysts with the tools, training and resources needed to succeed in the industry. The available data range from historical matches to analyse [legends of the game](https://www.hudl.com/blog/statsbomb-icons-pele) over [outstanding seasons](https://blogarchive.statsbomb.com/news/free-statsbomb-data-bayer-leverkusens-invincible-bundesliga-title-win/) and [tournaments](https://blogarchive.statsbomb.com/news/statsbomb-release-free-2023-womens-world-cup-data/) to [all league matches from Lionel Messis career](https://www.hudl.com/blog/statsbomb-release-free-lionel-messi-data-psg-and-inter-miami). In total there are data and every event that happend for over 3000 matches from more than 20 competitions. They are puplished on the [statsbomb github](https://github.com/statsbomb/open-data). [Libraries for processing the data](https://mplsoccer.readthedocs.io/en/latest/) in python are also available.

In this project I want to use these data to create an expected goals model (xG). My motivation stems from two perspectives.

First, like so many, I am a huge football fan. Watching professional football, the expected goals metric is shown during every broadcast. I want to deepen my knowledge of the metric to understand how it is calculated, what influences it and how to interpret it beyond the surface level.

Second, in my free time I am an ambitious youth team coach. Being able to quantify the quality of scoring chances can offer valuable insights into my team's performance. It allows me to go beyond just goals and results, and instead evaluate whether we are consistently creating good opportunities and where we can improve. With an xG model, I can provide more objective feedback to players, helping them understand the impact of their decisions in the final third and supporting their development with data-driven guidance.

In recent years, football analytics has become an increasingly data-driven field, with advanced metrics such as expected goals (xG) now a regular feature in match broadcasts, scouting reports, and tactical analysis. One of the most widely used and freely accessible football datasets is provided by Hudl StatsBomb, offering detailed event data from over 3,000 matches across more than 20 competitions — including full career coverage of Lionel Messi. The dataset contains every recorded on-ball event, from shots and passes to pressures and duels, making it a rich resource for building and evaluating predictive models.

This project investigates whether football shot outcomes can be more effectively modeled by explicitly incorporating uncertainty into xG predictions. Using the StatsBomb event data, all recorded shots are processed to construct feature representations that capture spatial, contextual, and tactical aspects of chance creation. These features are then used to train both traditional point-estimate models, such as logistic regression, and probabilistic approaches, including Bayesian logistic regression and Bayesian neural networks.

The study aims to address the following questions:

* How does explicitly modeling predictive uncertainty influence the accuracy and interpretability of xG estimates?
* Can uncertainty quantification reveal differences between players or shot types that point estimates overlook?
* How do probabilistic models compare to standard approaches in supporting practical football applications such as scouting, tactical design, and player development?


To answer these questions, the project systematically compares probabilistic and non-probabilistic methods, evaluating their performance and the additional insights gained from uncertainty measures. The analysis explores how this richer information can distinguish players who consistently create high-quality chances from those with more volatile performance, ultimately enabling better-informed recruitment, lineup, and tactical decisions.

**TO DO: Blend two texts and add links**


## 2. Data Loading and Exploration

- Code to load data
- Basic exploration (plots, statistics, missing data, etc.)  




- Explain Data used. Where and how to access them? Are there special features (e.g. like freeze frame)
- How to access the data? Which tools are available?
- Which columns are useful?
- Where were the data stored? With links



# 3. Data Preprocessing

- Steps taken to clean or transform the data



**Feature Set**

The feature set comprises spatial, contextual, and categorical indicators derived from the StatsBomb event and freeze-frame datasets.  
All positional quantities are expressed in **pitch units** (pu) in the StatsBomb coordinate system ($120 \times 80$ pu for pitch length × width), ensuring comparability across matches and competitions without stadium-specific scaling.

**1. Numerical features**

Let $i$ index shots.

- **Distance to goal** (`distance_to_goal`)  
  Straight-line distance from shot location $\mathbf{s}_i=(x_i,y_i)$ to the goal centre $\mathbf{g}=(120,40)$:  
  $$
  d_i = \sqrt{(x_i - 120)^2 + (y_i - 40)^2} \quad [\text{pu}].
  $$  
  Lower values correspond to closer shooting positions.

- **Angle to goal** (`angle_to_goal_deg`)  
  Opening angle $\theta_i$ subtended by the goalposts $\mathbf{p}_\mathrm{L}=(120,36)$ and $\mathbf{p}_\mathrm{R}=(120,44)$ at $\mathbf{s}_i$.  
  Let:  
  $$
  a = \lVert \mathbf{s}_i - \mathbf{p}_\mathrm{L} \rVert_2, \quad
  b = \lVert \mathbf{s}_i - \mathbf{p}_\mathrm{R} \rVert_2, \quad
  c = 8\ \text{pu}
  $$  
  (goal width). Then:  
  $$
  \theta_i = \arccos\!\left( \frac{a^2 + b^2 - c^2}{2ab} \right), \quad
  \theta_i^{(\circ)} = \frac{180}{\pi}\theta_i.
  $$  
  Larger angles indicate more central positions with greater visible goal width.

- **Opponents in way** (`opponents_in_way`)  
  Number of opposition players inside the **shooting triangle** $T_i$ with vertices $\mathbf{s}_i$, $\mathbf{p}_\mathrm{L}$, $\mathbf{p}_\mathrm{R}$, determined via a barycentric sign test.

- **Teammates in way** (`teammates_in_way`)  
  As above, but for players on the shooter’s team; may act as visual screens.

- **With dominant foot** (`with_dominant_foot`)  
  Indicator ($0/1$) for whether the shot was taken with the shooter’s dominant foot.

**2. Binary features**

Binary indicators ($0=$ false, $1=$ true) representing shot context, execution technique, phase of play, and body part used.

**Shot context**
- `shot_first_time` — shot struck without a preceding control.  
- `shot_one_on_one` — shooter in direct confrontation with goalkeeper, no covering defender.  
- `shot_open_goal` — goalkeeper absent from a saveable position.

**Technique**
- `technique_Backheel`  
- `technique_Diving Header`  
- `technique_Half Volley`  
- `technique_Lob`  
- `technique_Normal`  
- `technique_Overhead Kick`  
- `technique_Volley`  

Each flags the registered shooting technique.

**Phase of play**
- `subtype_Free Kick` — direct shot from a free kick.  
- `subtype_Open Play` — shot during continuous play.  
- `subtype_Penalty` — penalty kick from the spot.

**Body part**
- `is_header` — shot taken with the head.


## 4. Probabilistic Modeling Approach

- Description of the models chosen
- Why they are suitable for your problem
- Mathematical formulations (if applicable)



## 5. Model Training and Evaluation

- Training process
- Model evaluation (metrics, plots, performance)
- Cross-validation or uncertainty quantification



To evaluate the classification performance for predicting goal probabilities, three probabilistic approaches of increasing complexity were applied: Logistic Regression, Bayesian Logistic Regression, and a Bayesian Neural Network. Each model outputs a probability p(yi=1∣xi)p(yi​=1∣xi​) that a given shot ii results in a goal, based on its feature vector xixi​.

Logistic Regression serves as a baseline model that assumes a linear relationship between the input features and the log-odds of the target event. Mathematically, the model is defined as:
log⁡pi1−pi=β0+β⊤xi,
log1−pi​pi​​=β0​+β⊤xi​,

where pipi​ is the predicted probability, β0β0​ is the intercept, and ββ are the feature weights. Parameters are estimated using maximum likelihood. While logistic regression produces well-calibrated probability estimates, it captures uncertainty only through sampling variability and does not explicitly model parameter uncertainty. Despite this limitation, it remains valuable for its interpretability — coefficients directly indicate the direction and strength of each feature’s influence on the outcome.

Bayesian Logistic Regression extends this approach by treating the parameters (β0,β)(β0​,β) as random variables with prior distributions, commonly Gaussian:
βj∼N(0,σ2),j=0,…,d.
βj​∼N(0,σ2),j=0,…,d.

The posterior distribution is then inferred using Bayes’ theorem, yielding:
p(β∣D)∝p(D∣β) p(β).
p(β∣D)∝p(D∣β)p(β).

Predictions marginalize over the posterior:
p(yi=1∣xi,D)=∫σ(β0+β⊤xi) p(β∣D) dβ.
p(yi​=1∣xi​,D)=∫σ(β0​+β⊤xi​)p(β∣D)dβ.

This allows the model to quantify parameter uncertainty, providing predictive distributions rather than single-point estimates — a crucial advantage when evaluating the variability in scoring chances.

Bayesian Neural Networks (BNNs) generalize this concept to non-linear models by placing probability distributions over the weights and biases of a neural network. Given a network function fW,b(xi)fW,b​(xi​) with priors:
wjk∼N(0,σw2),bk∼N(0,σb2),
wjk​∼N(0,σw2​),bk​∼N(0,σb2​),

predictions are computed as:
p(yi=1∣xi,D)=∫σ(fW,b(xi)) p(W,b∣D) dW db.
p(yi​=1∣xi​,D)=∫σ(fW,b​(xi​))p(W,b∣D)dWdb.

By integrating over these distributions, BNNs capture both model uncertainty (arising from limited data) and data uncertainty (noise inherent in football events). Their non-linear structure enables them to learn complex feature interactions, such as subtle combinations of shot angle, distance, and defensive pressure, that may influence goal probability.

## 6. Results

- Present key findings
- Comparison of models if multiple approaches were used



## 7. Discussion

- Interpretation of results
- Limitations of the approach
- Possible improvements or extensions



## 8. Conclusion

- Summary of main outcomes



## 9. References

- Cite any papers, datasets, or tools used