This document aims to determine which between two predictive models does a better job at predicting whether a mushroom is poisonous or edible based on different characteristics. The first model is a Multinomial Naive Bayes model and the second is a Random Forest model. They will be compared based on their prediction accuracy.

Naive Bayes Model:

In [11]:
import pandas as pd
import numpy as np
import random

df = pd.read_csv("mushrooms.csv")

y = df['class'].values
X = df.drop(columns=['class'])

def encode_features(df):
    encoded = {}
    for col in df.columns:
        encoded[col] = {val: i for i, val in enumerate(df[col].unique())}
        df[col] = df[col].map(encoded[col])
    return df, encoded

X, encoding_dict = encode_features(X)

X = X.values
y = np.array([1 if label == 'p' else 0 for label in y])

indices = list(range(len(X)))
random.shuffle(indices)
split = int(0.8 * len(X))
train_idx, test_idx = indices[:split], indices[split:]

X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

class_counts = np.bincount(y_train)
prior_probs = class_counts / len(y_train)

feature_counts = {}
feature_probs = {}

for c in [0, 1]:
    feature_counts[c] = X_train[y_train == c].sum(axis=0) + 1
    feature_probs[c] = feature_counts[c] / feature_counts[c].sum()

def predict(X):
    predictions = []
    for sample in X:
        class_probs = np.log(prior_probs)
        for c in [0, 1]:
            class_probs[c] += np.sum(np.log(feature_probs[c]) * sample) 
        predictions.append(np.argmax(class_probs))
    return np.array(predictions)

y_pred = predict(X_test)
accuracy = np.mean(y_pred == y_test)

print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.8898


Here I implemented a Multinomial Naive Bayes model to predict whether mushrooms are poisonous or edible based on their characteristics. The dataset was processed by giving all categorical features a numerical value. I then used Laplace smoothing to avoid zero probabilities, and calculated log-probabilities to avoid misleading results due to multiplication of small probabilities.

The dataset was split 80/20 into training and test sets. The final model achieved an accuracy of approximately 88.98%, indicating strong predictive power.

Random Forest Model:

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_pred)
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")

Random Forest Accuracy: 1.0000


Using some previously defined functions and variables, implementing the Random Forest model was straightforward. The model was trained on the same training set as the Naive Bayes model to ensure an accurate comparison of the two. The Random Forest model achieved 100% accuracy. While this may indicate that the Random Forest model is better at predicting whether a mushroom is poisonous or not, it would be best to gather more information from the two models to ensure their reliability. Therefore, a classification report will be generated for both models to provide another evaluation of their performance.

Classification Report:

In [15]:
from sklearn.metrics import classification_report

print("Naive Bayes Classification Report:")
print(classification_report(y_test, y_pred))

print("Random Forest Classification Report:")
print(classification_report(y_test, rf_pred))

Naive Bayes Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.94      0.90       853
           1       0.92      0.84      0.88       772

    accuracy                           0.89      1625
   macro avg       0.89      0.89      0.89      1625
weighted avg       0.89      0.89      0.89      1625

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       853
           1       1.00      1.00      1.00       772

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625



The Random Forest model achieved 100% accuracy, with perfect precision and recall for both classes according to the classification report. This confirms that the model generally works extremely well on the mushroom dataset, and the perfect accuracy is not misleading or due to class imbalance.

In contrast, the Naive Bayes model, while also effective, had a slightly lower prediction accuracy as well as lower precision and recall for the poisonous class. This indicates that while the Naive Bayes model is still a strong performer, it does not match the performance of the Random Forest model on this particular dataset.