# Case Study on Probability for Data Science
#### Problem Statement:
To make a suitable machine learning algorithm to predict if the mushroom is edible or poisonous (e or p) using the given dataset.
(Along with other ML algorithms, Naïve Bayes’ Classifier should be applied)
Also, if some data pre-processing is necessary do that as well.
## Attribute Information:
• cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
• cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
• cap-colour: brown=n, buff=b, cinnamon=c, Gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
• bruises: bruises=t, no=f
• odour: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
• gill-attachment: attached=a, descending=d, free=f, notched=n
• gill-spacing: close=c, crowded=w, distant=d
• gill-size: broad=b, narrow=n
• gill-colour: black=k, brown=n, buff=b, chocolate=h, grey=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
• stalk-shape: enlarging=e, tapering=t
• Stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
• stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
• stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
• stalk-colour-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
• stalk-colour-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
• veil-type: partial=p, universal=u
• veil-colour: brown=n, orange=o, white=w, yellow=y
• ring-number: none=n, one=o, two=t
• ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
• spore-print-colour: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
• population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
• habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d


# Importing Libraries

In [20]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Loading Data

In [21]:
#readingdata
data=pd.read_csv('mushrooms.csv')
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


# Preprocessing Steps

In [22]:
#checking for Null values 
data.isna().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [23]:
data.shape

(8124, 23)

In [24]:
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [25]:
#Splitting data
x=data.iloc[:,1:]
y=data.iloc[:,0]

#LabelEncodingdata
from sklearn.preprocessing import LabelEncoder
x = x.apply(LabelEncoder().fit_transform)
y = LabelEncoder().fit_transform(y)

In [26]:
#to split data into test and train
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=42)

In [27]:
x_train.shape

(6093, 22)

In [28]:
y_train.shape

(6093,)

In [29]:
x_test.shape

(2031, 22)

In [30]:
y_test.shape

(2031,)

In [31]:
#Randomforest classifier
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)

In [32]:
#to evaluate the model 
from sklearn.metrics import confusion_matrix,accuracy_score
confusion_matrix(y_test,y_pred)

array([[1040,    0],
       [   0,  991]], dtype=int64)

In [33]:
accuracy_score(y_test,y_pred)

1.0

Insite:
    Random forest classifier has accuracy 1

In [34]:
#feature scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)

In [35]:
#fitting Gaussian Naive Bayes to the training set
from sklearn.naive_bayes import GaussianNB
classifier=GaussianNB()
classifier.fit(x_train,y_train)

GaussianNB()

In [36]:
y_pred=classifier.predict(x_train)
y_pred

array([1, 1, 0, ..., 0, 1, 0])

In [37]:
accuracy_score(y_test,y_pred)

ValueError: Found input variables with inconsistent numbers of samples: [2031, 6093]

In [38]:
confusion_matrix(y_test,y_pred)

ValueError: Found input variables with inconsistent numbers of samples: [2031, 6093]

HELP:I have tried my best and yet this error is still popping up.Hope You guys can provide an insite.