# Mushroom Edibility Prediction: Odor vs. Cap Color Accuracy

This notebook analyzes which feature, **Odor** or **Cap Color**, better predicts whether a mushroom is poisonous. We will load the data, convert categorical variables into numerical format using **One-Hot Encoding**, and train **Logistic Regression** models to compare predictive accuracy.

### 1. Data Loading

We begin by importing the necessary libraries and loading the Mushroom dataset. Then we create a subset containing only the target variable (edibility) and the two predictors (odor and cap color).

In [144]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Direct URL for the Mushroom dataset from the UCI archive
url = '[https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data)'

# Column names are necessary as the file lacks a header
column_names = [
    'poisonous', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
    'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
    'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
    'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color',
    'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'
]

# Load the data directly
df = pd.read_csv("agaricus-lepiota.data", header=None, names=column_names)

# Create the subset with the target and two predictor variables
df_subset = df[["poisonous", "odor", "cap-color"]].copy()
df_subset.columns = ["edibility", "odor", "cap_color"]
df_subset.head()

Unnamed: 0,edibility,odor,cap_color
0,p,p,n
1,e,a,y
2,e,l,w
3,p,p,w
4,e,n,g


In [145]:
# Changing values into numbers
edibility_map = {"e": 0, "p": 1}
odor_map = {v: i for i, v in enumerate(df_subset["odor"].unique())}
cap_color_map = {v: i for i, v in enumerate(df_subset["cap_color"].unique())}

df_subset["edibility"] = df_subset["edibility"].map(edibility_map)
df_subset["odor"] = df_subset["odor"].map(odor_map)
df_subset["cap_color"] = df_subset["cap_color"].map(cap_color_map)

df_subset.head()

Unnamed: 0,edibility,odor,cap_color
0,1,0,0
1,0,1,1
2,0,2,2
3,1,0,2
4,0,3,3


In [146]:
# Separate target (y)
y = df_subset["edibility"]
print(y.head())

0    1
1    0
2    0
3    1
4    0
Name: edibility, dtype: int64


### 2. Feature Preparation (One-Hot Encoding)
To use categorical variables in Logistic Regression, we convert odor and cap color into binary columns using One-Hot Encoding. This process turns each category into its own column with 0/1 values.

In [147]:
# Predictor 1: Odor
X_odor = df_subset[["odor"]]
X_odor_encoded = pd.get_dummies(X_odor, columns=['odor'], prefix='odor', drop_first=True)

# Predictor 2: Cap Color
X_cap_color = df_subset[["cap_color"]]
X_cap_color_encoded = pd.get_dummies(X_cap_color, columns=['cap_color'], prefix='cap_color', drop_first=True)

# Combine for the final model
X_combined_encoded = pd.concat([X_odor_encoded, X_cap_color_encoded], axis=1)

print(f"Odor features created: {X_odor_encoded.shape[1]}")
print(f"Cap Color features created: {X_cap_color_encoded.shape[1]}")
print(f"Total features for combined model: {X_combined_encoded.shape[1]}")

Odor features created: 8
Cap Color features created: 9
Total features for combined model: 17


### 3. Model Training and Evaluation
We train three separate Logistic Regression models: Odor-Only, Cap Color-Only, and the Combined Model, using a 70/30 train/test split.

3.1 Odor Predictor Analysis

This section trains a Logistic Regression model using only odor as the predictor and evaluates its accuracy on the test set.

In [148]:
# Split data for the Odor model
X_train_o, X_test_o, y_train_o, y_test_o = train_test_split(
    X_odor_encoded, y, test_size=0.3, random_state=42
)

# Initialize and train the Logistic Regression model
model_odor = LogisticRegression(solver='liblinear', random_state=42)
model_odor.fit(X_train_o, y_train_o)

# Predict and evaluate
y_pred_o = model_odor.predict(X_test_o)
accuracy_odor = accuracy_score(y_test_o, y_pred_o)

print("--- Odor Model Results ---")
print(f"Test Set Accuracy (Odor): **{accuracy_odor:.4f}**")

--- Odor Model Results ---
Test Set Accuracy (Odor): **0.9840**


3.2 Cap Color Predictor Analysis

This section trains a Logistic Regression model using only cap color as the predictor and evaluates its accuracy.

In [149]:
# Split data for the Cap Color model
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cap_color_encoded, y, test_size=0.3, random_state=42
)

# Initialize and train the Logistic Regression model
model_cap_color = LogisticRegression(solver='liblinear', random_state=42)
model_cap_color.fit(X_train_c, y_train_c)

# Predict and evaluate
y_pred_c = model_cap_color.predict(X_test_c)
accuracy_cap_color = accuracy_score(y_test_c, y_pred_c)

print("--- Cap Color Model Results ---")
print(f"Test Set Accuracy (Cap Color): **{accuracy_cap_color:.4f}**")

--- Cap Color Model Results ---
Test Set Accuracy (Cap Color): **0.5870**


3.3 Combined Predictor Analysis (Odor + Cap Color)

Here we train a model using both predictors to see if combining features improves accuracy.

In [150]:
# Split the combined data
X_train_comb, X_test_comb, y_train_comb, y_test_comb = train_test_split(
    X_combined_encoded, y, test_size=0.3, random_state=42
)

# Initialize and train the Logistic Regression model
model_combined = LogisticRegression(solver='liblinear', random_state=42)
model_combined.fit(X_train_comb, y_train_comb)

# Predict and evaluate
y_pred_comb = model_combined.predict(X_test_comb)
accuracy_combined = accuracy_score(y_test_comb, y_pred_comb)

print("--- Combined Model Results ---")
print(f"Test Set Accuracy (Odor + Cap Color): **{accuracy_combined:.4f}**")

--- Combined Model Results ---
Test Set Accuracy (Odor + Cap Color): **0.9840**


### 4. Conclusions and Recommendations

Based on the Logistic Regression models, odor clearly stands out as the strongest predictor of whether a mushroom is edible or poisonous. The model that used only the odor column reached an accuracy of about 0.9840, which is extremely high. In comparison, the cap color model performed much worse at around 0.5870. When both odor and cap color were combined, the accuracy stayed the same as the odor-only model. This shows that almost all the useful information for predicting edibility comes from odor, and adding cap color does not improve the model in a linear framework.

These results suggest that odor is the key feature for this prediction task. The fact that the combined model does not perform better than the odor-only model also means that cap color does not contribute additional helpful information when using Logistic Regression. Because of this, both the odor-only model and the combined model work equally well, but odor alone is already enough to make very accurate predictions.

For further analysis, it would be useful to check whether non linear models can pick up patterns that Logistic Regression cannot. Models like Decision Trees, Random Forests, or Gradient Boosting might find interactions between odor and cap color that a linear model is unable to detect. It could also help to look at how these features behave visually through clustering or dimensionality reduction to see if certain groups form naturally. Exploring these additional approaches would show whether cap color has any value in more complex models or if odor is simply the dominant feature overall.