# Supervised Machine Learning: Naive Bayes

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Now, with regards to our dataset, we can apply Bayes’ theorem in following way: 

P(y|X) = {P(X|y) P(y)}/{P(X)}

where, y is class variable and X is a dependent feature vector (of size n) where: 

X = (x_1,x_2,x_3,.....,x_n)

## 1. Naive Bayes Classification

1. We assume the features are independent
2. Each feature is given the same importance (weight)

In [32]:
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
         'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

import pandas as pd
df = pd.DataFrame(data={'weather':weather,'temp':temp, 'play':play})

In [34]:
df.head()

Unnamed: 0,weather,temp,play
0,Sunny,Hot,No
1,Sunny,Hot,No
2,Overcast,Hot,Yes
3,Rainy,Mild,Yes
4,Rainy,Cool,Yes


In [35]:
le = preprocessing.LabelEncoder()
df['weather'] = le.fit_transform(df['weather'])
df['temp'] = le.fit_transform(df['temp'])
df['play'] = le.fit_transform(df['play'])
df.head()

Unnamed: 0,weather,temp,play
0,2,1,0
1,2,1,0
2,0,1,1
3,1,2,1
4,1,0,1


In [37]:
features = df[['weather', 'temp']]
label = df['play']

#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(features,label)

GaussianNB()

In [38]:
model.predict([[0,2]]) # 0:Overcast, 2:Mild

array([1])

In [46]:
model.score(features, label)

0.7142857142857143

## 2. Gaussian Naive Bayes
Continuous values associated with each feature are assumed to be distributed according to a Gaussian (normal) distribution. This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.

- Mean (x) = 1/n * sum(x)
- Standard deviation(x) = sqrt (1/n * sum(xi-mean(x)^2 ))

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [3]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

In [6]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

GaussianNB()

In [7]:
gnb.score(X_test, y_test)

0.9466666666666667

In [8]:
y_pred = gnb.predict(X_test)

In [9]:
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 75 points : 4
