# Naive Bayes Classification of Mushrooms

In this notebook, we take a dataset of mushroom features from kaggle and build a model that can classify a mushroom as either poisionous or edible depending on it's features using Naive Bayes Classification

In [55]:
import pandas as pd
import numpy as np

# Fraction reserved for training
TRAIN_FRAC = 0.75

Naive Bayes Classification takes advantage of Bayes' Theorem, a theorem that seems simple and intuitive but is nonetheless profound

$$
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
$$

You have an initial probability of some $A$ happening (the prior $P(A)$), and, as more evidence $B$ is uncovered, the known probability of $A$ is updated (the posterior $P(A|B)$). In this case, $A$ would be whether or not a mushroom is posionous. If the mushroom cap is colored red, for example, it might be more or less probable for the mushroom to be poisionous given this new information. This is determined by the likelihood $P(B|A)$ of a poisionous mushroom to be red.

As he usually does, [3Blue1Brown](https://youtu.be/HZGCoVF3YvM) explains this so much better. Check out his video.

So, we have a set of mushroom features $\textbf{x}$ and we formulate the probability of a mushroom being poisionous given the mushroom has these properties using Bayes' Theorem

$$
P(Poisionous | \textbf{x}) = \frac{P(\textbf{x} | Poisionous)P(Poisionous)}{P(\textbf{x})}
$$

The "naive" in Naive Bayes means we assume all features are independent of one another. This is not necessarily true, as features often appear together. Mushrooms are made up of discreet species which all exhibit the same set of properties. However, making this assumption allows this equation to be written out as a product of individual likelihoods

$$
P(Poisionous | \textbf{x}) = P(Poisionous) * P(x_1 | Poisionous) * P(x_2 | Poisionous) * P(x_3 | Poisionous) *\ ...\ * P(x_n | Poisionous)
$$

We likewise do the same for "Edible"

Since $P(\textbf{x})$ is a constant for each class and each feature set, it can be largely ignored when trying to determine a class given a set of features.

So, the plan is to take a dataset, use it to compute prior probabilities and likelihoods for each feature and class, and use this information to construct a classifier that can accurately predict wether new mushrooms are poisionous or not

In [85]:
# Read dataset. We can split the dataset into training data for 
# building the model and testing data for testing the built model
df = pd.read_csv('data/mushrooms.csv')
train_df = df.sample(frac=TRAIN_FRAC)
test_df = df.drop(train_df.index)

# Print out a sample
df.sample(10)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
4851,p,x,y,g,f,f,f,c,b,p,...,k,n,b,p,w,o,l,h,v,d
4237,p,f,f,y,f,f,f,c,b,h,...,k,b,n,p,w,o,l,h,v,d
3697,p,x,s,p,f,c,f,c,n,u,...,s,w,w,p,w,o,p,k,v,d
7428,p,k,y,n,f,y,f,c,n,b,...,k,p,p,p,w,o,e,w,v,d
2692,e,x,y,n,t,n,f,c,b,w,...,s,w,p,p,w,o,p,n,v,d
3493,e,f,y,g,t,n,f,c,b,u,...,s,p,w,p,w,o,p,n,v,d
4401,p,f,f,y,f,f,f,c,b,h,...,k,b,b,p,w,o,l,h,v,g
5477,p,f,y,w,t,n,f,c,b,r,...,s,w,w,p,w,t,p,r,v,m
4165,e,f,f,c,f,n,f,w,n,w,...,f,w,n,p,w,o,e,w,v,l
2696,e,f,f,n,t,n,f,c,b,p,...,s,g,p,p,w,o,p,k,y,d


In this dataset, the "class" column is our classes

So you may notice that this theorem involves starting with a prior probability. So what probability do we start with? We could just assume 50/50 odds for either class. A better idea is to just determine the probability of each class by just checking how often it appears in the dataset.

In [57]:
priors = train_df.groupby('class').size() / len(train_df)
priors

class
e    0.518792
p    0.481208
dtype: float64

Next we sift through each feature and compute $P( feature | class )$. This can be computed from 

$$
\frac{P( feature\ \&\ Class )}{P( Class )}
$$

Again, $P( Class )$ is our initial prior, the overall probability of each class, while $P( feature \&\ Class )$ is the probability that a datapoint with feature $feature$ and class $Class$ appears in the dataset. Again this can be computed by counting the number of instances of each feature/class pair and normalizing by the number of instances of each class.

We create a dictionary containing the likelihood tables for each feature

In [67]:
likelihoods = {
    column: train_df.groupby([column, 'class']).size() / train_df.groupby('class').size()
    for column in train_df.columns if column != 'class'
}
likelihoods

{'cap-shape': cap-shape  class
 b          e        0.096172
            p        0.012278
 c          p        0.001364
 f          e        0.377412
            p        0.395634
 k          e        0.053780
            p        0.152115
 s          e        0.007276
 x          e        0.465359
            p        0.438608
 dtype: float64,
 'cap-surface': cap-surface  class
 f            e        0.362227
              p        0.193724
 g            p        0.001364
 s            e        0.274280
              p        0.359482
 y            e        0.363493
              p        0.445430
 dtype: float64,
 'cap-color': cap-color  class
 b          e        0.012022
            p        0.030696
 c          e        0.008225
            p        0.003411
 e          e        0.152800
            p        0.225784
 g          e        0.235369
            p        0.204297
 n          e        0.301803
            p        0.260914
 p          e        0.012971
            p  

With this information, we can create our classifier function. This takes a single sample and runs through each feature, updating the probability.

In [83]:
def naive_bayes_classify(sample):
    """
    Determine wether a mushroom sample is
    poisionous or not given it's features
    
    :param sample: single row of a dataset
    
    :return: predicted class
    """
    # Start with a prior probability
    probability = priors.copy()
    
    # For each column with known likelihood table
    # Update our label probabilities using new likelihood
    for column, table in likelihoods.items():
        probability *= table[sample[column]].reindex_like(probability).fillna(0)
        
    # Get class with maximum probability
    return probability.idxmax()

The last thing we can do is test the accuracy of our model using our split test dataset. This is theoretically data that the model has not seen before. We compare the computed classes with the expected classes in the test dataset, and determine the percentage of classes that were predicted correctly.

In [84]:
expected = test_df[label]
actual = test_df.drop('class', axis=1).apply(naive_bayes_classify, axis=1)
accuracy = (expected == actual).sum() / len(test_df)
print('Model Accuracy: {:0.2%}'.format(accuracy))

Model Accuracy: 99.75%
