# Classifying Mushrooms with Probabilistic Machine Learning

**Course:** Probabilistic Machine Learning (SoSe 2025) <br>
**Lecturer:** Dr. Alvaro Diaz Ruelas <br>
**Student(s) Name(s):** Tom Kieback <br>
**GitHub Username(s):** tom4917 <br>
**Date:** 12.08.2025 <br>
**PROJECT-ID:** 09-2KTXXXX_mushroom_classification 

---


## 1. Introduction
The goal of the project is to classify mushrooms into the poisonous and edible mushrooms by using features of the mushrooms. This process should be learned by machine learning models such that anyone without prior knowledge about mushrooms would be able to decide whether or not a mushroom is edible. Since the consequences of having a poisonous mushroom classified as edible could be serious, I used probabilistic machine learning approaches to quantify the uncertainty of a classification.

The data I used consists of hypothetical samples corresponding to 23 different species of gilled Mushrooms[1]. The samples are either classified as poisonous or edible. The data contains descriptive features corresponding to the attributes of the mushrooms. These features include information on:
- the cap, such as color and shape
- the odor
- the gill and the stalk
- population and habitat 



## 2. Data Loading and Exploration

The data contains 8124 different samples of mushrooms. Each mushroom has 22 descriptive features. The classes edible (e) and poisonous (p) are roughly equally distributed with 4208 mushrooms classified as edible and 3916 as poisonous. There was only one column which had missing values encoded as ß which was 'stalk-root'. The column contained 2480 missing values. This equals roughly 31% of the data. Also, the feature 'veil-type' had only one unique value.
To see if the features are correlated, I calculated Cramer's V [2] for each pair of features since all features are categorical values. This resulted in the following correlation matrix:

<img src="results/correlation.png" width="550">

The data has some high correlations with values over 0.5. Also, gill-attachment has three correlations with values close to one. This could also be related to gill-attachment having just two unique values with one of them making only 2% of the data. Veil-type is not correlated with any other values, which makes sense since there is only one unique value. Also, it is worth noting that the odor seems to be highly correlated with the class of the mushrooms.


## 3. Data Preprocessing

The column 'veil-type' has only one unique value. Therefore, it has 0 correlation with any other feature and the class. Since it holds no information, I dropped it.

The column 'stalk-root' had 31% missing values. It was correlated with some other features with values for Cramer's V over 0.5. Since imputation of the missing values would lead to misinformation on 31% of the data, I decided to drop the column instead.

As for the other correlations, I decided to keep all features since the main models I plan on using should be able to handle correlation between features quite well.

All the models I used need numerical instead of categorical data. So, to make the data usable for the models, I used one-hot encoding or ordinal encoding (only for categorical Naive Bayes).

## 4. Probabilistic Modeling Approach

For classification, I used different probabilistic modelling approaches which are able to classify the mushrooms into edible and poisonous as well as give an uncertainty to that estimation.

### 4.1 Naive Bayes Classifiers

Naive Bayes classifiers are a group of classifiers which use Bayes' theorem to predict the probability of a class given a sample x. They are called naive because they assume conditional independence given the class for all features. They estimate the likelihood and prior from the samples of the training data and use these estimations to calculate the posterior. Because of the assumption of independence, the likelihood of the data becomes the product of the likelihood of all features. The classifiers differ by the assumptions they make about the likelihood. The prior in all models is assumed to be a discrete probability distribution estimated by counting the occurrences in the training data.[3]

These models should fit this case since they are able to give an uncertainty estimate for a prediction, although they might have some trouble with the independence assumption, which is violated in the data according to the correlation matrix above.

#### 4.1.1 Categorical Naive Bayes
Categorical NB models each feature as a discrete categorical probability distribution conditioned on the class. Therefore, the likelihood can simply be calculated by counting the occurrences in the training data. For example, for a feature x_i = t and class = y, the likelihood would be:
Number of x_i = t and class = y / class = y. [3]

Since all features in the dataset are categorical values, this model should be able to be used for the data.

#### 4.1.2 Bernoulli Naive Bayes
The Bernoulli Naive Bayes expects data that follows a multivariate Bernoulli distribution. Therefore, each feature is either 1 or 0. The likelihood for a feature given a class is then estimated by counting the occurrences in the training data. The likelihood for a sample with x = 0 is given by 1 − p(x_i = 1 | y).[3]

Since the data is categorical, I use one-hot encoding to get 0/1 for each category. This should fulfil the requirements for the Bernoulli NB, but it is noted that the distribution is not really Bernoulli since OHE columns derived from the same feature are perfectly correlated, which might interfere with model performance.

#### 4.1.3 Gaussian Naive Bayes
The Gaussian Naive Bayes assumes the likelihood for each feature of a given class follows a Gaussian distribution. The parameters of the Gaussian distribution — mean and sigma — are calculated using the samples of the training data.[3]

I used one-hot encoding for this model, which technically is not a Gaussian distribution, so this might interfere with performance.

### 4.2 Logistic Regression
The logistic regression consists of two parts: a weighted linear combination of all features and a sigmoid function. The output of the linear combination is passed into the sigmoid function. This scales the output to values between 0 and 1, which can be seen as the probability for class = 1. The model is then trained using binary cross-entropy loss. This means the loss increases the further the model’s prediction is from the true label. By training, the model updates the weights of the linear combination. [4]

The model should be appropriate for the data and question since it gives probability outputs, and by assigning a weight to each category (after OHE), it should be able to handle some correlation by adjusting the weights accordingly. Although, it has no capability to consider feature interaction since all categories are combined linearly.

### 4.3 Gaussian Discriminative Analysis
The Gaussian Discriminative Analysis models the likelihood of the data depending on the class. The obtained likelihoods are then used to calculate the posterior with Bayes' theorem for each class. The class that maximizes the posterior is then chosen as the estimated class. The posterior of the chosen class can then be seen as the probability for that prediction, giving 1 − P for the uncertainty.
In Gaussian Discriminative Analysis, the distribution of the data conditioned on the class is approximated to follow a multivariate Gaussian distribution. In training, the models estimate the parameters of the Gaussian distribution — the mean vector and the covariance matrix — using the samples of each class from the training set. [5]

#### 4.3.1 Linear Discriminative Analysis
LDA is a special case of the Gaussian Discriminative Analysis in which the covariance matrix is assumed to be the same for all classes. This assumption simplifies the calculations and leads to a linear decision boundary. [5]
This model should be appropriate for the task. It allows direct calculation of the posterior by modelling the likelihood of the data, which makes it possible to express the uncertainty of a prediction. By directly considering the covariance in its calculation, the model is able to handle correlated data. Although, just like logistic regression, since it is linear, it has no capability to consider feature interaction.

#### 4.3.2 Quadratic Discriminative Analysis
Quadratic Discriminative Analysis (QDA) allows a different covariance matrix for each class. This makes the calculation of the posterior more complex, resulting in quadratic terms of x, which are scaled differently for each class depending on their covariances. This leads to a quadratic decision boundary and the model being able to capture feature interactions of two categories.[5]

The model, just like LDA, allows direct calculation of the posterior, which makes it possible to include the uncertainty of a prediction. It is also able to handle correlated data by including the covariance in its calculations. On top of that, it is able to capture feature interactions (of two categories together after OHE). It should be suited for the given task.

## 5. Model Training and Evaluation
I trained all models with the data using a train-test split with a test size of 0.3. This resulted in a test size of 5686 samples.
For all models except for the Categorical Naive Bayes, I used a one-hot encoding of the data.
### 5.1 Evaluation

In that setting, the models achieved the following results:
- Categorical Naive Bayes: Accuracy 0.95 with 5 false positives (FP) and 127 false negatives (FN) (poisonous mushrooms considered positives)
- Bernoulli Naive Bayes: Accuracy of 0.93 with 16 FP and 144 FN
- Gaussian Naive Bayes: Accuracy of 0.96 with 86 FP and 1 FN
- Logistic Regression: Accuracy of 1 with perfect classification
- LDA: Accuracy of 1 but one false negative
- QDA: Accuracy of 1 with perfect classification
#### 5.1.1 Uncertainty Estimation
For the uncertainty estimation, I concentrated on the models with perfect or nearly perfect classification.
Since these are all probabilistic models, each of them can give a pointwise estimation of the uncertainty of the prediction. The output for all these models can be interpreted as the probability of class = 1. Giving 1-p as the Uncertainty for a prediction of 1 and p as the uncertainty for a prediction of 0.

I analyzed these uncertainty estimations for all models. Uncertainty estimates can give additional information on whether eating a mushroom is safe or not. For that, I plotted the probability estimates for class = 1 (poisonous) against the true label of the data.

<img src="results/log0,3.png" width="450" height="200">

<img src="results/LDA0,3.png" width="450" height="200">

<img src="results/QDA0,3.png" width="450" height="200">

It shows that only Logistic Regression had varying uncertainty for the estimations. QDA and LDA had no uncertainty at all for every estimation, resulting in either 1 or 0 for the estimation. Even the false positive of LDA had no uncertainty, predicting 1 for a true 0.

I did another run for these three models where my goal was to increase the uncertainty of the models by reducing the training set to 10% of the data. This resulted in the estimations for the models:

- Logistic Regression: 8 FN
- LDA: 16 FP
- QDA: 16 FP

This led to these uncertainty scores:

<img src="results/log0,9.png" width="450" height="200">

<img src="results/LDA0,9.png" width="450" height="200">

<img src="results/QDA0,9.png" width="450" height="200">

Decreasing the training size to a very low amount did increase the overall uncertainty for the Logistic Regression model, which can be seen quite well in the boxplot. Also, in all misclassifications, the model wasn’t completely sure. The lowest value for a false negative (true label 1) was 0.097. This means the model was still roughly 10% uncertain in predicting a negative (0).

QDA and LDA, on the other hand, were still fully confident in their predictions, classifying only 1 and 0. Also, both models predicted values of 1 in their false positives. This means they estimated no uncertainty in their wrong predictions.

## 6. Results
None of the Naive Bayes classifiers had good results compared to the other models. From the Naive Bayes classifiers, the Gaussian performed best.
Although, many misclassifications would lead to these models not being appropriate for this use case.
Logistic Regression and QDA had perfect results on the test dataset with a reasonable training set size of 70%. LDA had one misclassification in this setting.

LDA and QDA had no useful uncertainty estimates, giving either a probability of 1 or 0. This led to one fully confident misclassification by LDA in the first setting. With reduced training size, both models still either predicted 1 or 0 as probability, being fully confident even while both misclassifying 16 samples.

The Logistic Regression had better uncertainty estimates in both settings, even becoming more unsure when the training size decreased.

In the context of the project question, this leads to QDA and LDA not being useful at all. One misclassification could already be extremely harmful. And since the models are 100% sure of their assumption even when being wrong, there is no way to "trust" the prediction based on uncertainty.

The only model which could be used for this use case would be Logistic Regression, since it has perfect classification while also giving reasonable uncertainty estimates to the prediction.



## 7. Discussion

All Naive Bayes classifiers performed poorly compared to the other models. This could be because the features are correlated, which violates the assumption of being conditionally independent. Gaussian performed best, which is interesting since the OHE is technically not truly Gaussian. This could mean that the data becomes better separable in the encoding space and that the assumption of Gaussian is valid on that OHE.
LDA and Logistic Regression achieved perfect or nearly perfect scores, which leads to the belief that the data is roughly linearly separable. This means feature interaction is not necessarily needed to correctly classify the data. QDA achieved a perfect score, as having a quadratic decision boundary does not harm when dealing with roughly linear data.

Both QDA and LDA had poor uncertainty estimates, being overconfident in their predictions even when wrong. Regarding QDA in the first setting, this could be due to the data being too easy to separate. In the second setting, the smaller training size could be a reason, which makes the estimation of the covariance difficult, especially for QDA, which has one matrix for each class.

The reason for the bad uncertainty estimates could also be that the OHE of the data is not truly Gaussian, which could interfere with the assumptions.

To improve this project, I would need a larger and more difficult dataset. With that, I could test if the uncertainty estimates get better when having more training data while also not being able to classify all samples correctly. Thereby seeing if these models are capable of giving good uncertainty estimates when having OHE data as input.

Also, a more difficult training dataset would be interesting to see if the QDA model could outperform the Logistic Regression when the data is not linearly separable and feature interactions are needed to correctly classify.


## 8. Conclusion

All Naive Bayes models had poor results for this use case and could not be used to decide on whether or not to eat a mushroom. Both LDA and QDA gave bad uncertainty estimates, being completely sure all the time, which makes them not suitable for this use case.

The only model to be considered for this would be Logistic Regression, since it gave uncertainty estimates and perfect prediction. Although, more training data would be needed to verify this.

You can predict whether or not a mushroom is edible, but to be sure to eat it, you would need reasonable uncertainty estimates. This only worked with the Logistic Regression.

Although more and more difficult training data would be needed to make sure difficult cases could also be solved with reliable uncertainty estimates.

## 9. References

[1] Mushroom [Dataset]. (1981). UCI Machine Learning Repository. https://doi.org/10.24432/C5959T.

[2] Wikipedia. (n.d.). Cramér’s V. In Wikipedia. Retrieved August 12, 2025, from https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V

[3] scikit-learn developers. (n.d.). Gaussian Naive Bayes. In scikit-learn documentation (Version stable 1.7.1). Retrieved August 12, 2025, from https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes

[4] scikit-learn developers. (n.d.). Linear Models. In scikit-learn documentation (Version stable 1.7.1). Retrieved August 12, 2025, from https://scikit-learn.org/stable/modules/linear_model.html

[5] scikit-learn developers. (n.d.). LDA & QDA. In scikit-learn documentation (Version stable 1.7.1). Retrieved August 12, 2025, from https://scikit-learn.org/stable/modules/lda_qda.html