# Documentation of UEFA EURO DATA 
## by Valentina Gruber

## 1 Introduction

This project uses a UEFA Euro 2024 dataset from FBref: https://fbref.com/en/comps/676/stats/UEFA-Euro-Stats.
FBref says it is a leading source for football analytics. It covers more than 20 competitions, including the big five European men’s leagues, the Champions League, World Cups, the Copa Libertadores, and top leagues in Brazil, Portugal, the Netherlands, the USA, and Mexico. It also covers the top women’s leagues in the USA, Germany, France, England, Italy, Spain, and Australia (https://fbref.com/en/about/).
FBref is a project of Sports Reference LLC, a US company that runs several major sports stats sites since 2000. This long track record suggests stable and trustworthy data (https://www.sports-reference.com/about.html).
FBref also works with Opta (https://www.sports-reference.com/blog/2022/10/fbref-leagues-%F0%9F%87%B5%F0%9F%87%B9-leagues-%F0%9F%87%A7%F0%9F%87%B7-leagues-%F0%9F%87%B2%F0%9F%87%BD-expanded-womens-and-mens-data-new-data-partner/).
Opta is a UK sports analytics company founded in 1996. Its data is known as independent, consistent, and accurate. It is used worldwide by coaches, players, journalists, broadcasters, brands, and sports apps (https://www.statsperform.com/opta-points/).

In this project I focus on Euro 2024 data. The dataset has 11 tables. Each table covers a different topic, such as goalkeeper stats, defensive actions, shooting, passing, and other match metrics.

The goal is to find player types without using the official position. I group players from their stats using probabilistic clustering.

I assume that different player types exist beyond classic positions like “striker” or “defender”. These types can be found by game metrics.

## 2 Data Loading

The data has 11 tables with 621 players and 219 columns.
When loading the data, many cells were empty. Some players only played a few minutes and have very few stats. On average, a player is missing about one third of all metrics.
There is also high correlation between several features. Some metrics appear in different units: per 90 minutes, totals, or percentages.

## 3 Data Preprocessing

I created a unique ID for each player. The ID uses the first two letters of the first name, position, country code, age, and year. I wrote code to detect duplicate IDs and fix them.

Merging: I joined the 11 tables into one master table using the ID. I removed duplicate columns, because each table had basic fields like age and team. I kept the other features, even if they used different units (percent, totals, per 90). The dataset was not too large, so this was acceptable.

Cleaning: I removed players with very little game time (see 01_cleaning.ipynb). Out of 621 players, 493 played at least once, and 359 played at least 90 minutes. All later analysis uses these 359 players.

## 4 Data Exploration

I did an initial exploration to understand the data. I looked at player strength and the age of top players. I used metrics like minutes, team performance while on the pitch, touches, goals, carries, tackles, and interceptions.  
Results looked plausible: players like Nuno Mendes, Joshua Kimmich, Toni Kroos, and Pepe appeared as top performers.  
I also ran a first K-Means clustering. I grouped features into four areas: Completion & Creativity, Passing & Build-up, Ball Control & Progression, and Defensive Behavior.  
The clusters made sense. Likely positions were visible even without using the position label. Known players matched clusters consistent with their play style.  

## 5 Probabilistic Modeling Approach

Next, I used a Gaussian Mixture Model (GMM). It is a probabilistic clustering method. It assumes the data comes from a mix of several normal distributions.
Unlike K-Means, GMM gives soft assignments. A player can belong to several clusters with different probabilities. GMM also allows flexible cluster shapes and gives probabilities that help with interpretation and uncertainty.

# 6 Model Training and Evaluation

I loaded player_stats_clean.csv. I set a minimum completeness of 70% and a maximum feature correlation of 0.90. I removed non-feature columns (name, position, team, nation, season, IDs) and “Unnamed” technical columns.  

I defined a starting feature list:  
offense per 90 (e.g., Gls/90, Ast/90, xG/90, Sh/90), chance creation (SCA90, GCA90, KP, PPA, 1/3, CrsPA, PrgP), progression and receiving (PrgC, PrgR), defense (Tkl, TklW, Int, Blocks, Clr, Won%, Aerials Won/Lost), possession/carries/take-ons (Touches, Att 3rd, Mid 3rd, Att Pen, Carries, Succ%, Mis, Dis), and efficiency (G/Sh, G/SoT, npxG/Sh).
I resolved alias names across tables. I also added all numeric /90 columns as candidates. The final feature set is the intersection of domain features, /90 columns, and numeric columns.

Quality checks removed columns with <70% completeness. I filled remaining missing values with the column median.  
The chosen features were: SCA90, GCA90, KP, PPA, 1/3, CrsPA, PrgP, Int, Clr, Touches, TeamSuccess+/-90, TeamSuccess(xG)xG+/-90, StandardSh/90, StandardSoT/90.  
Non-normalized features were converted to per-90.  

For GMM training I excluded players with >2 missing values in the selected features and imputed the rest with the mean. I removed low-variance features and highly correlated ones (|r| > 0.95). Then I standardized all features.  

Model selection: I trained GMMs for k = 2…10 and covariance types (full, tied, diag, spherical). I computed BIC for each and plotted the results.  
Validation: I ran 5-fold cross-validation with test log-likelihood. For each (k, covariance) I computed mean and standard deviation.  
BIC and K-Fold gave almost the same result (BIC: k=10, diag; K-Fold: k=9, diag). I chose k = 9, diag. It is simpler and reduces the risk of small unstable clusters.  

I trained the final GMM. I saved hard labels and soft probabilities, plus a cluster confidence (max probability). I visualized players in a PCA scatter plot.

# 7 Results and Discussion

Most players have a clear cluster. Some are not fully certain. Many players in Cluster 4 and Cluster 1 still have a notable second option. In Clusters 0, 5, and 8 only one player each shows a second option.

PCA:  
PC1 loads on SCA90, KP/90, PPA/90, PrgP/90, and shots/SoT per 90. It represents creativity, finishing, and progression.  
PC2 loads on Touches/90, 1/3/90, Interceptions, and Clearances (partly TeamSuccess(xG)). It separates “high involvement/defensive” from “final-third/finishing”.  

Roles from GMM (k=9, diag):  
Cluster 0: high-volume creators/regisseurs (high creation, progression, many touches).  
Cluster 1: classic centre-backs (many clearances, little creation).  
Cluster 2: deep controllers (many touches, some chance creation, moderate finishing).  
Cluster 3: finishers (many shots, low involvement).  
Cluster 4: two-way wide players/progressors (good progression, mixed defense, some creation).  
Cluster 5: “no-nonsense” centre-backs (clear, block, little build-up).  
Cluster 6: low-creation recyclers (safe passes, position discipline).  
Cluster 7: creative attackers (high creation, above-average finishing, fewer touches).  
Cluster 8: progressive passers/controllers in build-up (solid creation, good involvement).  

There is no “best” cluster. It depends on the metric. Cluster 0 ranks high for team impact. Cluster 3 is best for scoring. Clusters 5 (and sometimes 1) lead for defensive clearances. Clusters 2 and 8 lead for ball circulation and involvement.  

Clusters vs. real positions:  
Goalkeepers map almost cleanly to Cluster 5 (~96%).  
Defenders split: many in Cluster 4 (progressive full-backs/two-way), then Cluster 8 (ball-playing build-up), and Cluster 1 (classic CB).  
Midfielders are mostly in Cluster 4, plus Clusters 2/8 (controller/progressor) and some in Cluster 7 (creator/10).  
Forwards are mainly in Cluster 3 (finisher), with some in Clusters 4/6 (working/pressing forwards). A few show very low involvement and fall near Cluster 5.  
Mixed labels (DF/MF, MF/DF) land mainly in Clusters 4/8 (progressive, build-up roles).  
So the clusters match positions well overall.  

Outliers and hybrids:  
Outliers are often between two roles or have rare feature mixes. Variance is high and distance to the cluster centre is large, which is visible in PCA.  
Example: Álex Grimaldo (Cluster 0) has very high PPA/90, CrsPA/90, KP/90, so he is a very creative progressive wide player. His shots/90 are below the cluster mean, so he is a creator, not a finisher.  
Gonçalo Inácio (Cluster 2) has high 1/3/90, PrgP/90, Touches/90, KP/90 for a centre-back. He is an active ball-playing CB.    

The <90% heatmap shows likely hybrids. Bellingham (0.45/0.41) sits between creative attacker and deep controller. De Bruyne (0.88/0.12) is clearly a creator with some attacking DNA.  

Teams:  
Many squads focus on progression/build-up roles (Clusters 4/8). Portugal, England, and Germany show broad role diversity. Scotland, Belgium, Hungary, and Italy focus more on Cluster 4. The Netherlands have many build-up controllers (Cluster 8).  

Limitations:  
This is one tournament with a small sample (359 players ≥90 min). Team and opponent effects may exist. I had missing values and mixed units (%, totals, per 90). Some features are highly correlated, and feature choice was partly heuristic. GMM assumes Gaussian components with diagonal covariance, which is a simplification. There is no true ground truth for clusters; positions were only used for checks.  
Improvements: add league data and more tournaments, use event-level data, test robustness with resampling and stability measures, and try models like DP-GMM or Mixture of Factor Analyzers.  

<p align="center">
  <img src="../results/figures_GMM/GMM_ScatterPlot.png" alt="GMM Scatter Plot" width="700">
</p>  

<p align="center">
  <img src="../results/figures_GMM/GMM_outliers.png" alt="GMM Scatter Plot" width="700">
</p>  

<p align="center">
  <img src="../results/figures_GMM/GMM_soft.png" alt="GMM Scatter Plot" width="700">
</p>  



## 8 Conclusion

The GMM with nine clusters (diagonal covariance) finds meaningful player types without using positions. The PCA is easy to read: PC1 is creation/finishing/progression; PC2 separates involvement/defense from final-third play. The roles align with real football profiles (goalkeepers separate; defenders split into classic, ball-playing, and progressive wide; midfielders into controller/progressor/creator; forwards into finisher/support). Hybrids are visible through soft assignments (e.g., Bellingham between creative attacker and deep controller).  
On team level, many squads emphasise progression/build-up roles, and role diversity differs by nation.  
Technically, the workflow includes per-90 normalisation, merging 11 tables, cleaning, model selection with BIC and K-Fold, and outlier analysis with log-likelihood, Mahalanobis distance, and per-feature z² contributions.