#Naive Bayes Classification
Naive Bayes is a classification algorithm that works based on the Bayes theorem. Although, it's based on bayes theorem, but it can be implemented using different functions. Today, we will build our model using the Categorical Naive Bayes implementation from `sklearn`.

#Libraries for the Implementation
We will use several libraries today. We need to import NumPy, Matplot Library and Pandas and sklearn.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import CategoricalNB

The Dataset we will use contains Qualitative/Categorical attributes in it. The "Golf Data Sample" is the Dataset that we will use as an example today. We have imported the dataset into a dataframe and split it into two numpy arrays, `X` containing the values of the predictors and `y` containing the label values.

In [None]:
dataset = pd.read_csv('Golf Data Sample.csv')
X = dataset.iloc[:, :3].values
y = dataset.iloc[:, 3:].values
print (X)
print (y)

[['Sunny' 'Hot' 'Low']
 ['Sunny' 'Hot' 'High']
 ['Sunny' 'Hot' 'High']
 ['Sunny' 'Mild' 'High']
 ['Sunny' 'Mild' 'Low']
 ['Overcast' 'Mild' 'High']
 ['Overcast' 'Hot' 'High']
 ['Overcast' 'Mild' 'Low']
 ['Overcast' 'Hot' 'High']
 ['Rain' 'Mild' 'High']
 ['Rain' 'Mild' 'Low']
 ['Rain' 'Hot' 'High']
 ['Rain' 'Mild' 'High']
 ['Rain' 'Mild' 'High']]
[['Play']
 ['Dont Play']
 ['Dont Play']
 ['Dont Play']
 ['Play']
 ['Play']
 ['Play']
 ['Play']
 ['Play']
 ['Dont Play']
 ['Dont Play']
 ['Play']
 ['Play']
 ['Play']]


Since, the dataset contains Qualitative/Categorical values we need to encode them into discrete, numerical values. We are using the `LabelEncoder` function for this operation. Notice that we have split the `X`2d array into 3 arrays, this is because the `LabelEncoder` function can only process 1d arrays, so each column in the `X` array in put into a 1d array and then encoded. Afterwards, we aggregated the encoded column values into another 2d numpy array, `X_converted`.

In [None]:

le = LabelEncoder()

y_converted = le.fit_transform(y)

print(y)
print(y_converted)

x_outlook= X[:,:1]
x_outlook=x_outlook.reshape(14)
print(x_outlook.shape)
print(x_outlook)
x_outlook=le.fit_transform(x_outlook)
print(x_outlook)

x_temperature= X[:,1:2]
x_temperature=x_temperature.reshape(14)
print(x_temperature.shape)
print(x_temperature)
x_temperature=le.fit_transform(x_temperature)
print(x_temperature)

x_humidity= X[:,2:3]
x_humidity=x_humidity.reshape(14)
print(x_humidity.shape)
print(x_humidity)
x_humidity=le.fit_transform(x_humidity)
print(x_humidity)


X_converted=np.array([x_outlook,x_temperature,x_humidity])
X_converted=X_converted.reshape(14,3)
print(X_converted.shape)
print(X_converted)

[['Play']
 ['Dont Play']
 ['Dont Play']
 ['Dont Play']
 ['Play']
 ['Play']
 ['Play']
 ['Play']
 ['Play']
 ['Dont Play']
 ['Dont Play']
 ['Play']
 ['Play']
 ['Play']]
[1 0 0 0 1 1 1 1 1 0 0 1 1 1]
(14,)
['Sunny' 'Sunny' 'Sunny' 'Sunny' 'Sunny' 'Overcast' 'Overcast' 'Overcast'
 'Overcast' 'Rain' 'Rain' 'Rain' 'Rain' 'Rain']
[2 2 2 2 2 0 0 0 0 1 1 1 1 1]
(14,)
['Hot' 'Hot' 'Hot' 'Mild' 'Mild' 'Mild' 'Hot' 'Mild' 'Hot' 'Mild' 'Mild'
 'Hot' 'Mild' 'Mild']
[0 0 0 1 1 1 0 1 0 1 1 0 1 1]
(14,)
['Low' 'High' 'High' 'High' 'Low' 'High' 'High' 'Low' 'High' 'High' 'Low'
 'High' 'High' 'High']
[1 0 0 0 1 0 0 1 0 0 1 0 0 0]
(14, 3)
[[2 2 2]
 [2 2 0]
 [0 0 0]
 [1 1 1]
 [1 1 0]
 [0 0 1]
 [1 1 0]
 [1 0 1]
 [1 0 1]
 [1 1 0]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 0]]


  y = column_or_1d(y, warn=True)


In this step, we are are doing a Train-Test split on our dataset. The test set size we are selecting is 45% or, 0.45. 

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X_converted, y_converted, test_size = 0.45, random_state=0)
print(X_train)
print('*****')
print(X_test)

[[2 2 0]
 [1 0 1]
 [0 0 1]
 [1 1 1]
 [2 2 2]
 [0 0 1]
 [0 0 1]]
*****
[[1 0 1]
 [1 1 0]
 [1 1 0]
 [0 0 1]
 [0 0 0]
 [0 0 0]
 [1 1 0]]


We will now train the model and then, for the 45% test data, we will find the predictive output.

In [None]:


clf = CategoricalNB()
clf.fit(X_train, y_train)

y_pred=clf.predict(X_test)
print(y_pred)



[1 0 0 1 1 1 0]


We can also check the accuracy by the `accuracy_score`function that compares the predicted and actual values of the label as below:

In [None]:

ac = accuracy_score(y_test,y_pred)
print(ac)

0.5714285714285714


We can try predicting the outcome of the new datapoint to see whether, Golf will be "Play" or, "Don't Play". Here, the new/unseen datapoint is contained in a numpy array.

In [None]:

new_row=np.array(['Sunny','Hot','High'])
new_row_converted=le.fit_transform(new_row)
new_row_converted=new_row_converted.reshape(1,3)
print(new_row_converted.shape)
y_pred=clf.predict(new_row_converted)
print(y_pred)

(1, 3)
[0]


#Question:1
Using the "London Train Sample" Dataset, predict the Class for a new datapoint, where day is "Saturday" , Season is "Spring", Rain is "None", and wind is "Slight". You can take necessary help from the code given above.

In [None]:
#Solution:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('London Train Sample.csv')
X = dataset.iloc[:, :4].values
y = dataset.iloc[:, 4:].values
print (X)
print (y)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_converted = le.fit_transform(y)
print(y)
print(y_converted)
x_Day= X[:,:1]
x_Day=x_Day.reshape(16)
print(x_Day.shape)
print(x_Day)
x_Day=le.fit_transform(x_Day)
print(x_Day)
x_Season= X[:,1:2]
x_Season=x_Season.reshape(16)
print(x_Season.shape)
print(x_Season)
x_Season=le.fit_transform(x_Season)
print(x_Season)
x_Wind= X[:,2:3]
x_Wind=x_Wind.reshape(16)
print(x_Wind.shape)
print(x_Wind)
x_Wind=le.fit_transform(x_Wind)
print(x_Wind)
x_Rain= X[:,3:]
x_Rain=x_Rain.reshape(16)
print(x_Rain.shape)
print(x_Rain)
x_Rain=le.fit_transform(x_Rain)
print(x_Rain)
X_converted=np.array([x_Day,x_Season,x_Wind,x_Rain])
X_converted=X_converted.reshape(16,4)
print(X_converted.shape)
print(X_converted)
from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X_converted, y_converted, test_size =
X_train, X_test, y_train, y_test = train_test_split(X_converted, y_converted, test_size = 0.45, random_state=0)
print(X_train)
print(X_test)
import numpy as np
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB()
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(y_pred)
from sklearn.metrics import accuracy_score
ac = accuracy_score(y_test,y_pred)
print(ac)
new_row=np.array(['Saturday','Spring','None','Slight'])
new_row_converted=le.fit_transform(new_row)
new_row_converted=new_row_converted.reshape(1,4)
print(new_row_converted.shape)
y_pred=clf.predict(new_row_converted)
print(y_pred)

[['Weekday' 'Spring' 'None' 'None']
 ['Weekday' 'Winter' 'None' 'Slight']
 ['Weekday' 'Weekday' 'None' 'Slight']
 ['Weekday' 'Weekday' 'High' 'Heavy']
 ['Saturday' 'Summer' 'Normal' 'None']
 ['Weekday' 'Autumn' 'Normal' 'None']
 ['Holiday' 'Summer' 'High' 'Slight']
 ['Sunday' 'Summer' 'Normal' 'None']
 ['Weekday' 'Winter' 'High' 'Heavy']
 ['Weekday' 'Summer' 'None' 'Slight']
 ['Saturday' 'Spring' 'High' 'Heavy']
 ['Weekday' 'Summer' 'High' 'Slight']
 ['Saturday' 'Winter' 'Normal' 'None']
 ['Weekday' 'Summer' 'High' 'None']
 ['Weekday' 'Winter' 'Normal' 'Heavy']
 ['Holiday' 'Spring' 'Normal' 'Slight']]
[['Ontime']
 ['Ontime']
 ['Ontime']
 ['Late']
 ['Ontime']
 ['Very Late']
 ['Ontime']
 ['Ontime']
 ['Very Late']
 ['Ontime']
 ['Cancelled']
 ['Ontime']
 ['Late']
 ['Ontime']
 ['Very Late']
 ['Ontime']]
[['Ontime']
 ['Ontime']
 ['Ontime']
 ['Late']
 ['Ontime']
 ['Very Late']
 ['Ontime']
 ['Ontime']
 ['Very Late']
 ['Ontime']
 ['Cancelled']
 ['Ontime']
 ['Late']
 ['Ontime']
 ['Very Late']
 [

  y = column_or_1d(y, warn=True)


IndexError: ignored