## Mushroom Attributes
![mushrooms](\images\dataset-cover.jpg)

Dataset describes mushrooms by physical characteristics and are classified as either poisonous or edible. This model predicts classification given new input.

#### Imports

In [197]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

print(tf.__version__)

2.6.0


#### Pre-processing

Inspecting data

In [198]:
df = pd.read_csv("mushroom.csv")
df.head()

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises%3F,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,class
0,b'x',b's',b'n',b't',b'p',b'f',b'c',b'n',b'k',b'e',...,b'w',b'w',b'p',b'w',b'o',b'p',b'k',b's',b'u',b'p'
1,b'x',b's',b'y',b't',b'a',b'f',b'c',b'b',b'k',b'e',...,b'w',b'w',b'p',b'w',b'o',b'p',b'n',b'n',b'g',b'e'
2,b'b',b's',b'w',b't',b'l',b'f',b'c',b'b',b'n',b'e',...,b'w',b'w',b'p',b'w',b'o',b'p',b'n',b'n',b'm',b'e'
3,b'x',b'y',b'w',b't',b'p',b'f',b'c',b'n',b'n',b'e',...,b'w',b'w',b'p',b'w',b'o',b'p',b'k',b's',b'u',b'p'
4,b'x',b's',b'g',b'f',b'n',b'f',b'w',b'b',b'k',b't',...,b'w',b'w',b'p',b'w',b'o',b'e',b'n',b'a',b'g',b'e'


In [199]:
df.shape

(8124, 23)

In [200]:
column_names = df.columns.tolist()

Creating a dictionary of the cardinality of each category to check if one-hot encoding is viable for each column.

In [201]:
object_nunique = list(map(lambda col: df[col].nunique(), column_names))
d = dict(zip(column_names, object_nunique))

The maximum cardinality for the columns is 'gill-color' with 12 different categories. All mushrooms in dataset have only one veil-type so removing that column.

In [202]:
sorted(d.items(), key=lambda x: x[1])

[('veil-type', 1),
 ('bruises%3F', 2),
 ('gill-attachment', 2),
 ('gill-spacing', 2),
 ('gill-size', 2),
 ('stalk-shape', 2),
 ('class', 2),
 ('ring-number', 3),
 ('cap-surface', 4),
 ('stalk-surface-above-ring', 4),
 ('stalk-surface-below-ring', 4),
 ('veil-color', 4),
 ('stalk-root', 5),
 ('ring-type', 5),
 ('cap-shape', 6),
 ('population', 6),
 ('habitat', 7),
 ('odor', 9),
 ('stalk-color-above-ring', 9),
 ('stalk-color-below-ring', 9),
 ('spore-print-color', 9),
 ('cap-color', 10),
 ('gill-color', 12)]

In [203]:
df.drop("veil-type", axis=1, inplace=True)

Splitting training and testing data.

In [204]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

Setting target value to "class". 0 denotes edible and 1 denotes poisonous.

In [205]:
y_train = df_train.pop("class")
y_test = df_test.pop("class")
y_train = y_train.replace({"b'p'": 0, "b'e'": 1})
y_test = y_test.replace({"b'p'": 0, "b'e'": 1})
X_train = df_train
X_test = df_test

One-hot encoding for all columns. This should increase the dataset to 116 columns. Hopefully this isn't too high.

In [206]:
# Get list of columns - In this dataset, all columns are categorical so no extra steps needed
cols = X_train.columns.tolist()

# Applying one-hot encoding to training and testing sets
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_X_train = pd.DataFrame(encoder.fit_transform(X_train[cols]))
encoded_X_test = pd.DataFrame(encoder.transform(X_test[cols]))

In [207]:
encoded_X_train.shape

(6499, 116)

In [208]:
encoded_X_test.shape

(1625, 116)

I think the data should be ready now. I will be using an XGBoostClassifier model.

#### XGBoostClassifier Model - All data

In [209]:
from xgboost import XGBClassifier

In [210]:
xgb = XGBClassifier(n_estimators=500, learning_rate=0.01, max_depth=4, random_state=1)

In [211]:
xgb.fit(encoded_X_train, y_train)

In [212]:
y_pred = xgb.predict(encoded_X_test)

In [213]:
from sklearn.metrics import accuracy_score

In [214]:
accuracy = accuracy_score(y_pred, y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9969230769230769


I believe that the model is overfitting. Adjusting the parameters of the XGBClassifier does not decrease the accuracy by much. Therefore, I am going to recreate the model using less features to see if that aids the problem.

#### XGBoostClassifier Model - Reduced features

In [215]:
df = pd.read_csv("mushroom.csv")
features = ["cap-shape", "cap-surface", "cap-color", "gill-size", "stalk-shape", "stalk-root"]
target = ["class"]
df = df.loc[:, features + target]
df.head()

Unnamed: 0,cap-shape,cap-surface,cap-color,gill-size,stalk-shape,stalk-root,class
0,b'x',b's',b'n',b'n',b'e',b'e',b'p'
1,b'x',b's',b'y',b'b',b'e',b'c',b'e'
2,b'b',b's',b'w',b'b',b'e',b'c',b'e'
3,b'x',b'y',b'w',b'n',b'e',b'e',b'p'
4,b'x',b's',b'g',b'b',b't',b'e',b'e'


In [216]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
y_train = df_train.pop("class")
y_test = df_test.pop("class")
y_train = y_train.replace({"b'p'": 0, "b'e'": 1})
y_test = y_test.replace({"b'p'": 0, "b'e'": 1})
X_train = df_train
X_test = df_test

In [217]:
cols = X_train.columns.tolist()
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_X_train = pd.DataFrame(encoder.fit_transform(X_train[cols]))
encoded_X_test = pd.DataFrame(encoder.transform(X_test[cols]))

In [218]:
xgb = XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=4, random_state=1)
xgb.fit(encoded_X_train, y_train)
y_pred = xgb.predict(encoded_X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9981538461538462


Again, tried many different parameters here and all yielded about the same result. Maybe mushroom classification is just simple, but I'm a beginner so I doubt the high accuracy. I'm submitting this one as is but I'd like to hear anyone's feedback on how to improve the model if possible.