<a href="https://colab.research.google.com/github/IvaroEkel/Probabilistic-Machine-Learning_lecture-PROJECTS/blob/main/TEMPLATE_Probabilistic_Machine_Learning_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic Machine Learning - Project Report

**Course:** Probabilistic Machine Learning (SoSe 2025) <br>
**Lecturer:** [Alvaro Diaz-Ruelas] <br>
**Student(s) Name(s):** [Luca Thale-Bombien] <br>
**GitHub Username(s):** [Kavlahkaff] <br>
**Date:** [15.08.2025] <br>
**PROJECT-ID:** [13-2TL] <br>

---


## 1. Introduction

The goal of this project is to understand how well a chess player’s online elo rating can be predicted via features derived from one or more games. The primary question is whether a chess player's strength (Elo rating) can be predicted purely from the data of a single game using a probabilistic modelling approach, while also quantifying the uncertainty of the predictions. The analysis will be structured along three hypotheses of increasing complexity: <br>
1.	A player's choice of opening, as reflection of their theoretical knowledge, would be a strong predictor of their Elo rating. <br>
2.	Using more diverse In-game metrics like blunders, doubled pawns etc. will perform better than openings. <br>
3.	Aggregating information across games will further improve performance. <br>

The goal is not just to create a black-box predictor but to build an interpretable model that allows us to understand what guides it‘s predictions and thus learn about the constituents of a chess players strength.



## 2. Data Loading and Exploration

The dataset was created using the Lichess Standard Games Repository on Huggingface, which consists of over 6.77 billion chess games (4.68TB). The data is organized by Year and Month, where each month consists of shards of around 1GB and includes every game’s moves and meta-data like time-control, player name, Elo, opening and more. The inital download of the data was facilitated by the download_data.py script which allows for flexibly downloading specific shards from Huggingface.
This project was written in python using several tools to facilitate the processing and analysis of the data. Pandas was used for manipulation, Scikit-learn for data analysis and modelling and Matplotlib and Seaborn for creating visualizations.

Due to resource limitations, this project used only a subset of the Lichess Dataset, namely the first 30 shards of March 2025. The data is complete, so no imputation had to be done. Still, some filters were applied to ease the analysis and ensure consistency:

1.	Time-control is ten-minute rapid chess with no time increment.
2.	The movetext had to contain engine evaluations.
3.	Checkmate had to be possible, so minimum of 4 moves.
4.	The Elo of either player should not be equal to 1500, as this is the Lichess starting Elo.

After filtering the dataset consisted of 683.253 games.

| Color | Mean Elo | Median Elo | Max Elo | Min Elo | STD Elo |
| ----- | -------- | ---------- | ------- | ------- | ------- |
| White | 1537.38  | 1543.0     | 3119    | 400     | 396.46  |
| Black | 1538.20  | 1545.0     | 3116    | 400     | 396.43  |

**Table 1:** shows the mean, median, max and min and standard deviation of players Elo in the final dataset.

| Number of Moves | Value   |
|-----------------| ------- |
| Mean            | 33.32   |
| Std Dev         | 14.92   |
| Min             | 4.00    |
| Max             | 227.00  |

**Table 2:** shows the average, minimum and maximum game length as well as the standard deviation in the final dataset.

![image info](./results/WhiteEloDist.png)

**Figure 1:** Elo distribution for White.


## 3. Data Preprocessing

After cleaning the data, it was transformed for the analysis.
The move text was parsed into individual moves, converted into the PGN format and engine evaluations were extracted. Using the PyChess library, 18 in-game features based on chess theory were computed for both, the black and white player. These include the average centipawn loss, a metric of how much a player deviates from the engines best move, the total game length or the number of moves until the first win opportunity. For a full list of features and their explanation see the functions in utils.py.
The final feature space contained 500 openings, 3 Terminations and 18 (*2) features.
For later stages of the experiments, the 11 most informative features were aggregated over players and opening respectively. Mean, median and standard deviation for the aggregated metrics were computed, and the features aggregated once over the white player and once over the ECO codes.
For the regression tasks, experiments showed best performance when predicting the average ELO of both, white and black. For the classification task, the games were divided into 10 equally sized bins. This choice of bins was made, to be comparable with related research, while also allowing for a balanced trade-off of interpretability and granularity. The final DataFrames were converted to parquet files and saved for re-use in the analysis. Experiments were done on two variants of the Dataset, one consisting of 10 shards (220k games) and the other more extensive one consisting of 30 shards (680k games).






## 4. Probabilistic Modeling Approach

This section describes the modelling strategy used to predict chess Elo from the engineered features. A set of simple, strong baselines alongside a range of probabilistic methods were evaluated that provide probability estimates and uncertainty quantification. Models were chosen to cover a spectrum of assumptions about the data (linear vs non-linear; homoscedastic vs heteroscedastic; independent vs correlated features) so that predictive performance can be compared while also assessing how well uncertainty estimates align with empirical error.

### 4.1 Problem framing

Depending on the chosen target formulation, the task can be treated either as a **regression** problem (predicting a continuous Elo value) or as a **classification** problem (predicting Elo bands, e.g., beginner/intermediate/advanced) or in this case an evenly distributed 10 class classification.
The modelling choices below identify which methods are used for regression and which for classification, and highlight how probabilistic approaches help with predictions and uncertainty estimates.


### 4.2 Baseline models

**Linear Regression (baseline, regression)**
A simple ordinary least squares (OLS) linear model is used as a transparent baseline for continuous Elo prediction. Linear regression fits a linear combination of features and is easy to interpret (coefficients indicate feature importance). It is sensitive to outliers and multicollinearity, and will underperform when relationships are strongly non-linear, but serves as a low-variance benchmark and a sanity check for more complex models.

**Random Forest (baseline, classifier)**
Random Forests are an ensemble of decision trees trained with bagging and feature subsampling. For classification, they produce class probabilities by averaging the class votes of the trees.

Random Forests are robust to noisy and heterogeneous features, capture non-linear interactions without heavy feature engineering, and are resilient to overfitting in many practical settings. Random Forest was used, both as a strong non-probabilistic baseline (RandomForestClassifier) and as a tool to gauge non-linear structure in the feature set. This modelling approach is also the currently best performing one found in scientific literature (Tijhuis et al. 2023)


### 4.3 Probabilistic approaches

**Bayesian Ridge Regression (regression)**
Bayesian ridge regression places a Gaussian prior on the linear coefficients and performs Bayesian inference for the posterior distribution over weights. The prior acts as regularization, improving stability when features are correlated or when the number of features is large relative to samples. The Bayesian formulation yields a posterior predictive distribution for Elo, giving both point predictions (e.g., posterior mean) and uncertainty intervals (posterior variance), which is useful when reporting confidence about a player’s predicted Elo.

**Gaussian Naive Bayes (GNB, classification)**
GNB assumes each feature is conditionally independent given the class and that each feature’s class-conditional distribution is Gaussian. The independence assumption makes GNB very fast and robust for high-dimensional problems, and it naturally outputs posterior probabilities. While the independence assumption rarely holds exactly, GNB often performs well in practice and provides a useful probabilistic baseline.

**Linear Discriminant Analysis (LDA, classification)**
LDA models class-conditional feature distributions as multivariate Gaussians that share a common covariance matrix; classification follows from Bayes’ theorem using these Gaussian likelihoods. LDA projects data onto a lower-dimensional subspace that maximizes class separability, and yields posterior class probabilities under its Gaussian assumptions. It is computationally inexpensive and performs well when the equal-covariance assumption is approximately true and classes are roughly Gaussian in the feature space.

**Quadratic Discriminant Analysis (QDA, classification)**
QDA relaxes LDA’s equal-covariance assumption by allowing each class to have its own covariance matrix. This makes QDA more flexible for classes with different dispersion or elliptical orientations, at the cost of estimating more parameters (thus requiring more data to be stable). QDA is appropriate when class boundaries are non-linear and the Gaussian assumption is still reasonable.

**Weighted Bayesian Logistic Regression (classification)**
When the target is categorical (e.g., Elo bands) Bayesian logistic regression is used with class-weighting to handle class imbalance. The model is the logistic (sigmoid) transformation of a linear predictor, augmented with priors on the weights.
- Class weights increase the effective loss for underrepresented classes.
- The Bayesian treatment produces posterior predictive probabilities rather than raw, uncalibrated scores.

This approach is particularly useful for predicting a chess player’s Elo from in-game features, where uncertainty estimation can be as important as the point prediction itself. In chess, player performance can vary due to psychological factors, preparation depth, or variance in opponent styles. By explicitly modeling parameter uncertainty, Bayesian logistic regression provides calibrated probabilities for a player belonging to a given Elo band, rather than just a hard classification. This allows for more nuanced interpretations, such as estimating the likelihood that a player is performing above their current rating and helps identify when the model’s confidence is low. The use of priors enables the incorporation of domain knowledge, such as known relationships between engine evaluations, blunder rates, and structure, while the regularization effect might improve generalization across diverse player populations.

More information on each modeling approach can be found in the /notebooks folder, where each modelling apporach has its own notebook.



## 5. Model Training and Evaluation

For analysis the data was first loaded from the parquet files. It was then scaled and the categorical features were one hot encoded using sklearn. For the classification tasks, the Elo was divided in 10 equally sized bins. All experiments use a 70/30 train-test-split with a random state of 42.
Two non-probabilistic baseline models were trained to achieve comparability. First a simple linear regression using the one-hot encoded ECO-code (shorthand for Opening played) as a feature and a random forest classifier (RFC) using all features. An RFC was chosen, as it is the best performing model I was able to find in other research (Tijhuis et al., 2023), where it achieved an accuracy of 19%. The hyperparameters of the RFC were optimized using the Optuna library. After training the model, the feature importance was calculated for each feature.
In addition to the non-probabilistic baselines, several probabilistic modelling techniques were implemented to provide not only point predictions but also estimates of uncertainty. The key advantage of probabilistic models in this context is their ability to quantify the confidence in each prediction, which is crucial when player performance varies due to factors not observed in a single game.
The first approach was Bayesian Ridge Regression, using only the opening choice as input. This model places a Gaussian prior on the regression coefficients and infers a posterior distribution based on the training data. Predictions are then expressed as posterior means with corresponding credible intervals, allowing an assessment of the uncertainty associated with each Elo estimate.
Both regression models were evaluated using the standard metrics like RMSE and R^2. Additionally the regression was reframed to allow for an estimation of accuracy, by classifying an allowed deviation in predicted vs. real Elo.

Four different probabilistic models were trained and evaluated for the classification task. A Gaussian Naive Bayes, Linear Discriminant Analysis, Quadratic Discriminant Analysis and a weighted bayesian logistic regression. The primary evaluation metric for all of these models was the accuracy, though other metrics were also included like precision and recall. Confusion matrices were made for all models to allow for interpretation of the results. For the LDA, several other experiments were done, to gauge how well the model is able to predict certain classes.

To test hypothesis two and three, the analysis was first done using only single game features and then repeated using the calculated aggregated features over several games of the same player.




## 6. Results

### Linear Regression

The linear regression using the one-hot-encoded openings achieved an RMSE of 363, an R^2 of 0.14 and an accuracy with a deviation of 50 Elo point of 9.93%.

![image info](./results/linear_regression.png)

**Figure 2:** Real vs. Predicted Elo of the linear regression model.
### Bayesian Ridge Regression

The bayesian ridge regression performed almost the same as the linear regression, while providing an uncertainty estimation. The uncertainty was generally really high, averaging at 363.

![image info](./results/Bayesian_Ridge.png)

**Figure 3:** Real vs. Predicted Elo of the bayesian regression model, visualizing uncertainty.

### Random Forest Classifier

The random forest classifier achieved an accuracy of 21% using all single game features, where the most important features were game length, first blunder and first win opportunity among others. Using the aggregated features improved performance to around 25%, changing the most important features as well. In this second run the aggregated features carried the most information.

| Class         | Precision | Recall | F1-Score | Support     |
| ------------- | --------- | ------ | -------- | ----------- |
| 0             | 0.37      | 0.69   | 0.48     | 20,568      |
| 1             | 0.19      | 0.21   | 0.20     | 20,572      |
| 2             | 0.16      | 0.12   | 0.14     | 20,371      |
| 3             | 0.16      | 0.11   | 0.13     | 20,585      |
| 4             | 0.16      | 0.11   | 0.13     | 20,405      |
| 5             | 0.15      | 0.08   | 0.11     | 20,526      |
| 6             | 0.16      | 0.14   | 0.15     | 20,337      |
| 7             | 0.17      | 0.11   | 0.13     | 20,576      |
| 8             | 0.20      | 0.21   | 0.20     | 20,653      |
| 9             | 0.38      | 0.71   | 0.50     | 20,384      |
| **Accuracy**  |           |        | **0.25** | **204,977** |
| **Macro Avg** | 0.21      | 0.25   | 0.22     | 204,977     |

**Table 3:** Evaluation metrics for the Random Forest Classifier trained on aggregated features.


![image info](./results/feature_importance.png)

**Figure 4:** Feature importance of the random forest classifier using aggregated features.

### Gaussian Naive Bayes

The Gaussian Naive Bayes achieved an accuracy of only 12%.

### Linear Discriminant Analysis

LDA achieved an accuracy of 21% matching the performance of the baseline Random Forest Classifier. When comparing the accuracy starting with the outher bins and then iteratively adding one more bin, a linear decline of performance can be seen. When comparing only class 0 and 9 the accuracy is around 90%, but drops to around 70% when comparing classes 0-4 vs. 5-9.

![image info](./results/LDA_accuracy_across_bins.png)

**Figure 5:** Accuracy of the LDA model across differently sized bins.

Comparing only two classes at a time and plotting their performance in a heatmap shows a higher acuracy when comparing the extreme ends of the classes, while performance for the middle classes is worse. Additionally close classes are differentiated worse than classes that are further apart. We achieve the highest accuracy when comparing class 0 and 9, with 91.6% and the lowest accuracy when comparing class 4 against 5 with an accuracy of 54.4%.

![image info](./results/LDA_pairwise.png)

**Figure 6:** Pairwise class comparison of LDA accuracy.

Reducing the feature space, by eliminating highly correlated features slightly improves the performance of the LDA model to 21.6% accuracy.

Using a GMM to first cluster the data into 10 groups based on the features and then running LDA to predict the cluster labels drastically improves performance to 77%.

![image info](./results/GMM_cluster.png)

**Figure 7:** 2D PCA visualization of GMM clustering

### Quadratic Discriminant Analysis

QDA achieves an accuracy of only 10%, matching the linear regression baseline. Reducing dimensionality via PCA increases model performance, peaking at around 38 components with an accuracy of 19%. Regularizing the QDA model also improved performance and when using only the identity matrices it matched the LDA model.

![image info](./results/QDA_PCA.png)

**Figure 8:** Performance of the QDA model with different number of PCA components.

### Weighted Bayesian Logistic Regression

The bayesian logistic regression achieved an accuracy of 21.2%, slightly improving over the baseline random forest classifier and the LDA model. When plotting the predictions in a confusion matrix, a similar trend as in the LDA can be seen, where the model tends to over-predict the outer classes, while being more uncertain about the middle classes.

![image info](./results/logistic_confusion.png)

**Figure 9:** Confusion matrix of actual vs. predicted classes of the weighted bayesian logistic regression.

Plotting the ROC-curves of the different classes, reveals the same pattern: a better performance for the outer classes and a worse performance for the middle classes. Class 0 for example has an AUC of 0.84 while class 5 has an AUC of 0.54.

![image info](./results/logistic_AUC.png)

**Figure 10:** Multiclass ROC-curve of the weighted bayesian logistic regression modell, showing true vs. false positive rate.


# 7. Discussion

## Interpretation of Results

Coming back to our intial hypotheses, the results allow for several interpretations and learnings. The first hypothesis was that a player's choice of opening, as reflection of their theoretical knowledge, would be a strong predictor of their Elo rating. While this is not per se wrong, the regression models trained on only openings did not perform well. Both linear regression approaches performed similarly, with low R² values (0.14) and high RMSE (363). This suggests a weak linear relationship between one-hot-encoded openings and Elo rating, although Bayesian Ridge added the benefit of uncertainty estimates. This shows that the opening alone does not convey enough information to accurately distinguish players. This makes sense, since openings are often only very short move sequences. Their use alone does not indicate player strength as it is also important what happens after the opening.

  The second hypothesis was that using more diverse In-game metrics like blunders, doubled pawns etc. will perform better than openings. This hypothesis holds with our experiments, as both the Random Forest Classifier and the probabilistic models trained on these features performed better than the models trained only on the openings.
The third hypothesis stated that aggregating information across games will further improve performance. This held as well, though the effects were mostly visible in the Random Forest Classifier. The baseline random forest achieved **21% accuracy**, improving to **25%** when using aggregated features, highlighting the value of summarizing player behavior across games. Game length and blunder timing emerged as strong predictors, but precision and recall varied greatly across Elo bins, with outer classes (lowest and highest Elo groups) predicted more reliably than middle ranges.  LDA and Bayesian logistic regression achieved **\~21% accuracy**, matching the baseline random forest. Both models showed strong performance when distinguishing extreme Elo classes (e.g., class 0 vs. 9 at over 90% accuracy) but struggled with adjacent or mid-range classes. The Bayesian logistic regression additionally quantified uncertainty and showed higher AUC for extreme classes compared to middle ones.

Using a GMM to cluster the data based on the features and then predicting the cluster labels using LDA shows improvements over using the 10 arbitrarily chosen bins. While choosing these 10 bins is possibly closer to the original classification goal, this highlights that the structure of our data can be harnessed to improve performance. Achieving an accuracy of over 70% shows a strong structure in the features that can be used by our models. Future work could use the GMM do define a hybrid approach that identifies and uses more natural bin ranges for its classification.

The Gaussian Naive Bayes and QDA did not perform well. This is due to the structure and characteristics of our dataset. GNB assumes independence of features and an underlying gaussian distribution. The feature elimination and correlation matrix showed that our features are highly collinear, thus a GNB is not the right choice for this project.

This holds similarly for the QDA, which cannot handle highly collinear features. Another reason for the collapse of the QDA is the number of features in comparison to our examples. QDA needs p(p+3)/2+1 features, which, with over 500 features is way more than our dataset contained.

Overall, the models demonstrate **clear asymmetry in predictive power**: extreme Elo players are more easily identified, while players in the mid-tier have overlapping in-game behavior patterns that confound classification. Aggregated behavioral features appear more informative than single-game metrics, suggesting that **future work should emphasize longitudinal, player-level modeling** over isolated match analysis.


## Limitations

Several factors limit the reliability and generalizability of the results in this study:

1. **Small player sample size.** The dataset was restricted to the first 30 shards of March 2025 (≈ half a month of games). This limits opponent diversity and may bias models toward trends present only in this subset.
2. **Single time control.** Only rapid games (10+0) were included. Performance and feature importance may differ in blitz or classical formats, where decision patterns and time management vary.
3. **Unstabilized Elo ratings.** Players with few games were not filtered out, introducing variance from newly created or unstable accounts that can distort probabilistic estimates.
4. **Computational constraints.** Some probabilistic methods (e.g., Bayesian logistic regression) required hours to train on even 10,000 games, preventing extensive hyperparameter tuning, nested cross-validation, or multiple runs to assess posterior stability.

## Future work

To address these limitations and strengthen probabilistic inference, future work should consider:

1. **Broader and more balanced sampling.** Use a temporally and competitively diverse dataset and filter for players with ≥20 games to obtain more stable Elo estimates and better-calibrated posteriors.
2. **Multiple time controls.** Extend the analysis to blitz, bullet, and classical games to evaluate model robustness across formats.
3. **Data-driven class definitions.** Replace arbitrary 10-fold Elo binning with approaches such as GMM clustering or continuous regression with uncertainty intervals to capture more natural skill groupings.
4. **Computational scaling.** Employ approximate Bayesian inference (e.g., variational inference, expectation propagation) to enable larger datasets and more iterations without prohibitive runtimes.
5. **Leverage time-series structure.** Incorporate move-level or clock-time features (e.g., time spent per move, move sequence dynamics) to better capture temporal aspects of decision-making and potentially improve predictive performance.


## 8. Conclusion

This study demonstrated that probabilistic models can match, and in some cases exceed, the performance of non-probabilistic baselines while also providing valuable measures of predictive uncertainty. Several important findings emerged:
- **Feature Collinearity:** Many initial features were highly collinear. Dimensionality reduction via Principal Component Analysis (PCA) improved performance but reduced interpretability. Recursive feature elimination may offer a better balance between predictive accuracy and interpretability in future applications.
- **Probabilistic Competitiveness:** Bayesian models achieved comparable accuracy to the Random Forest baseline and in some cases improved upon state-of-the-art results reported in the literature, while offering interpretable credible intervals and posterior probability distributions.
- **Value of Aggregation:** Aggregating player statistics across multiple games produced more informative features, reducing predictive uncertainty and improving performance.
- **Overall Predictive Limitations:** Despite these improvements, the models remain too inaccurate and uncertain for reliable Elo prediction in real-world settings, especially for players in the mid-range ratings where class overlap is high.


The probabilistic framework, however, provides a solid foundation for future refinement. With more representative data, better feature engineering, and scalable Bayesian methods, these models could become both accurate and interpretable tools for chess rating estimation.




## 9. References

T. Tijhuis, P. M. Blom and P. Spronck, "Predicting Chess Player Rating Based on a Single Game," 2023 IEEE Conference on Games (CoG), Boston, MA, USA, 2023, pp. 1-8, doi: 10.1109/CoG57401.2023.10333133.

https://huggingface.co/datasets/Lichess/standard-chess-games/viewer