# Mushroom Classification - whether edible or not 

In this notebook I am going to build a Bernoulli Naive Bayes model from sklearn to classify a muahroom dataset , to predict whether a certain species of mushroom is edible or not. Here the algorithm is BernoulliNB ,which can be applied only when the dataset is boolean in nature.

Basic idea behind BernoulliNB is same as MultinomialNB which is using Bayes theorem : P(Y|X) = P(X|Y) * P(Y)/P(X) 

Calulating P(X|Y) is different for every naive bayes algorithm. For BernolliNB it is calulated as :

for a feature X_i

P(X_i|Y) =  P( i | Y ) *  X_i + (1-P( i | Y )) * (1-X_i)  ,here p(i|Y) is the probalilty of finding ith feature = 1 in calss Y

In [19]:
import numpy as np
import pandas as pd

We load the dataset and we see that data is completely categorical . As all its attributes are categorical popular supervised learning algorithms like Logistic Regeression,Neural Network,SVM are not going to be very effective . Because values in any column are not numerical in nature ,so those algorithms are not going to work good.

In [20]:
df = pd.read_csv('mushrooms.csv')
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [21]:
df.shape

(8124, 23)

Split the data so that first 7000 exaples comes in training data and the rest in test data.

In [22]:
train_df = df[:7000]
test_df = df[7000:]

Lets take a look at how the classses e (=edible) and p (=poisonous) are distributed.

In [23]:
target = train_df['class']
target.value_counts(normalize=1)

e    0.534857
p    0.465143
Name: class, dtype: float64

Now to use this dataset in BernoulliNB classifier ,first we need to make it boolean. So,to do that we use 'one hot encoding' on allthe columns. I am not using any ready-to-use one-hot-encoder rather I am showing very easy way to make your own one-hot-encoding from scratch. For each column if the value in a row matches with a unoque value for that column ,we put 1 to that cell else 0.  

This way we create several columns corresponding to the original columns and a distinct value in that column. Finally we will delete the original columns.

In [24]:
del train_df['class']
cols = list(train_df)
for f in cols :
    for elem in df[f].unique():
        train_df[f+'_'+str(elem)] = (train_df[f]==elem)
    ##train_df = train_df.drop([f],inplace=True,axis=1)
for f in cols:
    del train_df[f]
train_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,cap-shape_x,cap-shape_b,cap-shape_s,cap-shape_f,cap-shape_k,cap-shape_c,cap-surface_s,cap-surface_y,cap-surface_f,cap-surface_g,...,population_v,population_y,population_c,habitat_u,habitat_g,habitat_m,habitat_d,habitat_p,habitat_w,habitat_l
0,True,False,False,False,False,False,True,False,False,False,...,False,False,False,True,False,False,False,False,False,False
1,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
2,False,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
3,True,False,False,False,False,False,False,True,False,False,...,False,False,False,True,False,False,False,False,False,False
4,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False


Now we call the BernoulliNB from sklearn.naive_bayes . We create the calssifier and then fit the training data to that calssifier.
We get a 93.8 % accuracy for the training data

In [25]:
from sklearn.naive_bayes import BernoulliNB
clf_ber = BernoulliNB()
train_x = train_df.as_matrix()
clf_ber.fit(train_x,target)
print("Traing data accuracy = "+str(clf_ber.score(train_x,target)))

Traing data accuracy = 0.938428571429


In [26]:
test_y = test_df['class']
del test_df['class']
for f in cols :
    for elem in df[f].unique():
        test_df[f+'_'+str(elem)] = (test_df[f]==elem)
for f in cols:
    del test_df[f]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [27]:
test_x = test_df.as_matrix()
clf_ber.score(test_x,test_y)

0.95551601423487542