# Classifying Mushrooms

a data set that's available on the University of California Irvine Machine Learning Repository. This data set has 8124 instances of "hypothetical samples" corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota family. There are 22 features in the data set, all of which are categorical. Each mushroom species is identified as definitely edible, definitely poisonous, or of unknown edibility and therefore "not recommended". For this data set, the latter class was combined with the poisonous one.

Our goal here is to classify mushrooms as "edible" or "poisonous" based on features such as "cap-shape", "odor", and "habitat". You can read more about this data set and feature information at the UCI Repository webpage. Otherwise, we won't go into the details about what all the features mean, except to point out that they are all categorical features. We will note that feature 11 ("stalk-root") has some "missing" data that has been encoded as a question mark "?". For simplicity, we will just interpret that as another possible value for feature 11 in this case.

In [1]:
import pandas as pd

df = pd.read_csv('https://dataincubator-course.s3.amazonaws.com/coursedata/agaricus-lepiota.data',
                 header=None,
                 names=['edible', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 
                        'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 
                        'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 
                        'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 
                        'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 
                        'population', 'habitat']   #  Column names taken from UCI description
                )

df.head()

Unnamed: 0,edible,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


The first column ("e" vs "p", or "edible" vs "poisonous") is what we want to predict, so let's save that in a different variable, and then delete it from the DataFrame. We'll translate this data from a string into a numeric field.

In [2]:
#  Define our 'target', i.e. what we want to predict in terms of numeric values, 1 if edible, 0 if not
target = df['edible'].apply(lambda x: 1 if x =='e' else 0).values

#  So then we can drop this "edible" column from the DataFrame
df.drop('edible', axis=1, inplace=True)

Before proceeding further, let's do a train/test split so that we can test our model later, holding back 20% of the data.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size = 0.2, 
                                                    random_state=42)

In [4]:
from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(categories='auto', sparse=False).fit_transform(X_train)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# BernoulliNB

In [10]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import BernoulliNB

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('ohe', OneHotEncoder(categories='auto', sparse=False, handle_unknown='ignore')),
    ('bnb', BernoulliNB())
])

pipe.fit(X_train, y_train)

Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                ('bnb', BernoulliNB())])

In [11]:
pipe.score(X_test, y_test)

0.936

In [12]:
from sklearn.metrics import confusion_matrix, precision_score, classification_report

y_pred = pipe.predict(X_test)

precision_score(y_test, y_pred)

0.9038251366120219

In [13]:
confusion_matrix(y_test, y_pred)

array([[694,  88],
       [ 16, 827]], dtype=int64)

Nearly 10% of the samples (88 out of 915) identified as "edible" aren't... It should go without saying that you don't want to use this classifier (or, to be honest, any other machine learning classifier) as a guide to what mushrooms you can safely eat.

In [14]:
print(classification_report(y_test, y_pred, target_names=['poisonous', 'edible']))

              precision    recall  f1-score   support

   poisonous       0.98      0.89      0.93       782
      edible       0.90      0.98      0.94       843

    accuracy                           0.94      1625
   macro avg       0.94      0.93      0.94      1625
weighted avg       0.94      0.94      0.94      1625



# MultinomialNB

In [15]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('ohe', OneHotEncoder(categories='auto', sparse=False, handle_unknown='ignore')),
    ('bnb', MultinomialNB())
])

pipe.fit(X_train, y_train)

Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                ('bnb', MultinomialNB())])

In [16]:
pipe.score(X_test, y_test)

0.9507692307692308

In [17]:
from sklearn.metrics import confusion_matrix, precision_score, classification_report

y_pred = pipe.predict(X_test)

precision_score(y_test, y_pred)

0.9187705817782656

In [18]:
confusion_matrix(y_test, y_pred)

array([[708,  74],
       [  6, 837]], dtype=int64)

In [19]:
print(classification_report(y_test, y_pred, target_names=['poisonous', 'edible']))

              precision    recall  f1-score   support

   poisonous       0.99      0.91      0.95       782
      edible       0.92      0.99      0.95       843

    accuracy                           0.95      1625
   macro avg       0.96      0.95      0.95      1625
weighted avg       0.95      0.95      0.95      1625



MultinomialNB gives slightly better accuracy (and precision) for this data set. The decision functions of the two classifiers differ. According to Scikit-Learn documentation, BernoulliNB "penalizes the non-occurrence of a feature  𝑖  that is an indicator for [a class], where the multinomial variant would simply ignore a non-occurring feature." It's not always possible to use BernoulliNB (as your features must be binary), but when you can it's useful to compare MultinomialNB and BernoulliNB using whatever metric you are interested in optimizing.

# Logistic Regression

In [23]:
from sklearn.linear_model import LogisticRegression

In [24]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('ohe', OneHotEncoder(categories='auto', sparse=False, handle_unknown='ignore')),
    ('bnb', LogisticRegression())
])

pipe.fit(X_train, y_train)

Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                ('bnb', LogisticRegression())])

In [25]:
pipe.score(X_test, y_test)

1.0

In [26]:
from sklearn.metrics import confusion_matrix, precision_score, classification_report

y_pred = pipe.predict(X_test)

precision_score(y_test, y_pred)

1.0

In [27]:
confusion_matrix(y_test, y_pred)

array([[782,   0],
       [  0, 843]], dtype=int64)

In [28]:
print(classification_report(y_test, y_pred, target_names=['poisonous', 'edible']))

              precision    recall  f1-score   support

   poisonous       1.00      1.00      1.00       782
      edible       1.00      1.00      1.00       843

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625



Logistic Regression gives 100% accuracy for this data set. With 22 features and enough samples in this case, LR is able to find a perfect separation between the two classes (edible and poisonous), even having not seen the testing data set during the fit process. (I still don't think I'm going to trust it to tell me what I can eat or not!)