# Case Study - 10

# Probability for Data Science

Problem Statement:
To make a suitable machine learning algorithm to predict if the mushroom is
edible or poisonous (e or p) using the given dataset.
(Along with other ML algorithms, Naïve Bayes’ Classifier should be applied)
Also, if some data pre-processing is necessary do that as well.

In [1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings('ignore')

In [2]:
#read the dataset
data=pd.read_csv("mushrooms.csv")

In [3]:
# checking null values
data.isna().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

There are no null values in the dataset, hence it is clean.

In [4]:
data.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [5]:
# checking the columns and type
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

 Since machines cannot understand categorical values, we should convert them to numerical values. Here we use label encoding to convert them into numerical values

In [6]:
from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()

In [7]:
#encoding the classification column
data['class']=enc.fit_transform(data['class']) 

In [8]:
#creating traing and testing set
X=data.drop('class',axis=1)
Y=data['class']

In [9]:
X=pd.get_dummies(X,drop_first=True)

In [10]:
X.head()

Unnamed: 0,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_c,cap-color_e,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,0,0,0,1,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,1,0,1,0,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,1,0,0,0
3,0,0,0,0,1,0,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [11]:
#spliting the data into training and testing set
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,random_state=42,test_size=.25)

In [12]:
# importing all the necessary libraries to use in various classification algorithms
from sklearn.linear_model import LogisticRegression # for logistic Regression Algorithm
from sklearn.neighbors import KNeighborsClassifier # for K Nearest Neighbours
from sklearn.svm import SVC # for Support Vector Machine(SVM) Classifier Algorithm
from sklearn.tree import DecisionTreeClassifier # for using Decision Tree Algorithm
from sklearn.ensemble import RandomForestClassifier # for using Random Forest Algorithm


In [13]:
# importing necessary libraries for checking the model accuracy
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# 1.Logistic Regression 

In [14]:
logitreg = LogisticRegression(max_iter=700)
logitreg.fit(X_train,Y_train)
Y_predict=logitreg.predict(X_test)

In [15]:
print(accuracy_score(Y_test,Y_predict))
confusion_matrix(Y_test,Y_predict)

1.0


array([[1040,    0],
       [   0,  991]], dtype=int64)

In [16]:
logitreg_as= accuracy_score(Y_test,Y_predict)
logitreg_as

1.0

# 2.K-Nearest Neighbours(KNN)

In [17]:
knnmodel=KNeighborsClassifier(n_neighbors=3)
knnmodel.fit(X_train,Y_train)
Y_predict1=knnmodel.predict(X_test)

In [18]:
print(accuracy_score(Y_test,Y_predict1))
confusion_matrix(Y_test,Y_predict1)

1.0


array([[1040,    0],
       [   0,  991]], dtype=int64)

In [19]:
logitreg_as1= accuracy_score(Y_test,Y_predict1)
logitreg_as1

1.0

# 3.Decision Tree

In [20]:
dt=DecisionTreeClassifier()
dt.fit(X_train,Y_train)
Y_predict2= dt.predict(X_test)

In [21]:
print(accuracy_score(Y_test,Y_predict2))
confusion_matrix(Y_test,Y_predict2)

1.0


array([[1040,    0],
       [   0,  991]], dtype=int64)

In [22]:
logitreg_as2 = accuracy_score(Y_test,Y_predict2)
logitreg_as2

1.0

# 4.Random Forest

In [23]:
rf=RandomForestClassifier()
rf.fit(X_train,Y_train)
Y_predict3=rf.predict(X_test)

In [24]:
print(accuracy_score(Y_test,Y_predict3))
confusion_matrix(Y_test,Y_predict3)

1.0


array([[1040,    0],
       [   0,  991]], dtype=int64)

In [25]:
logitreg_as3= accuracy_score(Y_test,Y_predict3)
logitreg_as3

1.0

# 5.Support Vector Machine (SVM)

In [26]:
svm_model = SVC(kernel ='rbf',C=30,gamma='auto')
svm_model.fit(X_train,Y_train)
Y_predict4= svm_model.predict(X_test)

In [27]:
print(accuracy_score(Y_test,Y_predict3))
confusion_matrix(Y_test,Y_predict4)

1.0


array([[1040,    0],
       [   0,  991]], dtype=int64)

In [28]:
logitreg_as4= accuracy_score(Y_test,Y_predict4)
logitreg_as4

1.0

# 6.Naive Bayes

In [29]:
from sklearn.naive_bayes import GaussianNB
classifier=GaussianNB()
classifier.fit(X_train,Y_train)
Y_predict5= classifier.predict(X_test)

In [30]:
print(accuracy_score(Y_test,Y_predict5))
confusion_matrix(Y_test,Y_predict)

0.9428852781880847


array([[1040,    0],
       [   0,  991]], dtype=int64)

In [31]:
logitreg_as5= accuracy_score(Y_test,Y_predict5)
logitreg_as5


0.9428852781880847

In [32]:
from sklearn.naive_bayes import BernoulliNB
classifier1=BernoulliNB()
classifier1.fit(X_train,Y_train)
Y_predict6= classifier1.predict(X_test)

In [33]:
print(accuracy_score(Y_test,Y_predict6))
confusion_matrix(Y_test,Y_predict6)

0.9374692269817824


array([[1031,    9],
       [ 118,  873]], dtype=int64)

In [34]:
logitreg_as6=accuracy_score(Y_test,Y_predict6)
logitreg_as6

0.9374692269817824