# Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.

## Bayes' Theorem
Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)

Where

###### P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.
###### P(d|h) is the probability of data d given that the hypothesis h was true.
###### P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
###### P(d) is the probability of the data (regardless of the hypothesis).

### Useful Libraries

#### Load Dataset. Use "bank-data.csv"

In [266]:
# import dataset
import pandas as pd
df = pd.read_csv('bank-data.csv',index_col=0)
df.head() 

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,FEMALE,INNER_CITY,17546.0,NO,1,NO,NO,NO,NO,YES
ID12102,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
ID12103,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
ID12104,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO
ID12105,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO


#### Preprocess the data

In [267]:
# import library for preprocessing
from sklearn import preprocessing as ps

In [268]:
# Tranform data using "fit_transform(attribute)" function  
lab_encoder=ps.LabelEncoder()   #create the label encoder , assign a particular label to attributes
df.sex=lab_encoder.fit_transform(df.sex)  #fit the label encoder in the region applied to categorical attribute
df.region=lab_encoder.fit_transform(df.region)
df.married=lab_encoder.fit_transform(df.married)
df.car=lab_encoder.fit_transform(df.car)
df.save_act=lab_encoder.fit_transform(df.save_act)
df.current_act=lab_encoder.fit_transform(df.current_act)
df.mortgage=lab_encoder.fit_transform(df.mortgage)
df.head()


Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,0,0,17546.0,0,1,0,0,0,0,YES
ID12102,40,1,3,30085.1,1,3,1,0,1,1,NO
ID12103,51,0,0,16575.4,1,0,1,1,1,0,NO
ID12104,23,0,3,20375.4,1,3,0,0,1,0,NO
ID12105,57,0,1,50576.3,1,0,0,1,0,0,NO


#### Select independent variables and target column

In [269]:
# Select the independent variables and the target attribute
X = df[df.columns[:-1]] # Selecting the independent variables
Y=df[df.columns[len(df.columns)-1]] # selecting only the target lableled column


#### Import Naive Bayes Classifier library 

In [270]:
# import Classifier library
from sklearn.naive_bayes import GaussianNB

In [271]:
# Call the Classifier
clf = GaussianNB()
clf

GaussianNB(priors=None, var_smoothing=1e-09)

#### Predict the target column and find the perfromance of the model

In [272]:
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

In [273]:
# Print Number of mislabeled points
y_pred=clf.fit(X_train,Y_train).predict(X_test)

print("No. of mislabeled points out of a total %d points : %d"%(X_test.shape[0],(Y_test!=y_pred).sum()))

No. of mislabeled points out of a total 144 points : 56


### Prediction and Evaluation

In [274]:
# import required libraries
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score


In [275]:
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
print(classification_report(Y_test,y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,y_pred))

              precision    recall  f1-score   support

          NO       0.62      0.78      0.69        80
         YES       0.59      0.41      0.48        64

   micro avg       0.61      0.61      0.61       144
   macro avg       0.61      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

Confusion Matrix
[[62 18]
 [38 26]]

 Accuracy
0.6111111111111112


#### Q1: Consider "current_act" as an irrelevant attribute. Remove it and find the accuracy of Naive Bayes classifier

In [276]:
# display dataframe first 5 columns
df1=df
df1.head(5)
df1.drop(['current_act'],axis=1,inplace=True)
df1.head(5)

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ID12101,48,0,0,17546.0,0,1,0,0,0,YES
ID12102,40,1,3,30085.1,1,3,1,0,1,NO
ID12103,51,0,0,16575.4,1,0,1,1,0,NO
ID12104,23,0,3,20375.4,1,3,0,0,0,NO
ID12105,57,0,1,50576.3,1,0,0,1,0,NO


In [277]:
# Selecting the independent variables
X1 = df1[df1.columns[:-1]] # Selecting the independent variables
 


In [278]:
# selecting only the target lableled column
Y1=df1[df1.columns[len(df1.columns)-1]]

In [279]:
# Apply the classifier and Print Number of mislabeled points
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X1, Y1, test_size=0.30, random_state = 30)
y_pred=clf.fit(X_train,Y_train).predict(X_test)

print("No. of mislabeled points out of a total %d points : %d"%(X_test.shape[0],(Y_test!=y_pred).sum()))

No. of mislabeled points out of a total 144 points : 56


In [280]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test,y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,y_pred))

              precision    recall  f1-score   support

          NO       0.62      0.76      0.69        80
         YES       0.59      0.42      0.49        64

   micro avg       0.61      0.61      0.61       144
   macro avg       0.60      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

Confusion Matrix
[[61 19]
 [37 27]]

 Accuracy
0.6111111111111112


#### Q2: Write your observation

After Removing the irrelevent attribute "current_act", Accuracy of Naive bayes classifer is still same as the earlier builded classifier.

### Load "car.csv" dataset. 

#### Q3: Apply Naive Bayes classifier on this dataset

In [281]:
# Load the data
import pandas as pd
df2 = pd.read_csv('car.csv')
df2.columns = ["price", "maintenance_cost", "doors", "person_capacity", "luggage_boot_size", "safety", "class"]

# shuffle the DataFrame rows 
df2=df2.sample(frac=1)
df2.head() 


Unnamed: 0,price,maintenance_cost,doors,person_capacity,luggage_boot_size,safety,class
734,high,med,6,2,big,low,unacc
139,vhigh,high,3,2,med,high,unacc
484,high,vhigh,3,6,big,high,unacc
582,high,high,3,4,big,med,acc
567,high,high,3,2,small,med,unacc


In [282]:
# Preprocess and Tranform data using "fit_transform(attribute)" function  
lab_encoder=ps.LabelEncoder()   #create the label encoder , assign a particular label to attributes
df2.price=lab_encoder.fit_transform(df2.price)  #fit the label encoder in the region applied to categorical attribute
df2.maintenance_cost=lab_encoder.fit_transform(df2.maintenance_cost)
df2.doors=lab_encoder.fit_transform(df2.doors)
df2.person_capacity	=lab_encoder.fit_transform(df2.person_capacity)
df2.luggage_boot_size=lab_encoder.fit_transform(df2.luggage_boot_size)
df2.safety=lab_encoder.fit_transform(df2.safety)
df2.head()

Unnamed: 0,price,maintenance_cost,doors,person_capacity,luggage_boot_size,safety,class
734,0,2,3,0,0,1,unacc
139,3,0,1,0,1,0,unacc
484,0,3,1,2,0,0,unacc
582,0,0,1,1,0,2,acc
567,0,0,1,0,2,2,unacc


In [283]:
# Select the independent variables and the target attribute
X2 = df2[df2.columns[:-1]] # Selecting the independent variables
Y2=df2[df2.columns[len(df2.columns)-1]] # selecting only the target lableled column

In [284]:
# Apply the classifier
from sklearn.naive_bayes import GaussianNB
clf1 = GaussianNB()
clf1

GaussianNB(priors=None, var_smoothing=1e-09)

In [285]:
# Divide the dataset into training and testing partition
# predictions for testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X2, Y2, test_size=0.30, random_state = 30)
y_pred=clf1.fit(X_train,Y_train).predict(X_test)




In [286]:
# Print Number of mislabeled points
print("No. of mislabeled points out of a total %d points : %d"%(X_test.shape[0],(Y_test!=y_pred).sum()))

No. of mislabeled points out of a total 519 points : 183


In [287]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test,y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,y_pred))

              precision    recall  f1-score   support

         acc       0.57      0.09      0.16       129
        good       0.00      0.00      0.00        19
       unacc       0.84      0.87      0.85       353
       vgood       0.14      1.00      0.24        18

   micro avg       0.65      0.65      0.65       519
   macro avg       0.39      0.49      0.31       519
weighted avg       0.72      0.65      0.63       519

Confusion Matrix
[[ 12   0  54  63]
 [  5   0   6   8]
 [  4   0 306  43]
 [  0   0   0  18]]

 Accuracy
0.6473988439306358


  'precision', 'predicted', average, warn_for)


#### Q4: Find the correlation between the attributes of the dataset.

In [288]:
# Find the pairwise correlation of attributes and arrange in ascending order
cor=df2.corr().abs()
s=cor.unstack()
so=s.sort_values(kind="quicksort")
so

luggage_boot_size  safety               7.713233e-19
safety             luggage_boot_size    7.713233e-19
doors              safety               8.450158e-19
safety             doors                8.450158e-19
                   person_capacity      3.085293e-18
person_capacity    safety               3.085293e-18
price              safety               4.225079e-18
safety             price                4.225079e-18
maintenance_cost   safety               4.788423e-18
safety             maintenance_cost     4.788423e-18
luggage_boot_size  person_capacity      8.693132e-04
person_capacity    luggage_boot_size    8.693132e-04
price              luggage_boot_size    9.523677e-04
luggage_boot_size  price                9.523677e-04
                   maintenance_cost     9.523677e-04
maintenance_cost   luggage_boot_size    9.523677e-04
person_capacity    price                9.523677e-04
price              person_capacity      9.523677e-04
person_capacity    doors                9.5236

#### Q5: Remove one of the highly correlated attributes and apply Naive Bayes classifier

In [289]:
# Drop highly correlated attribute
df3=df2
df3.head(5)
df3.drop(['doors'],axis=1,inplace=True)
df3.head(5)

Unnamed: 0,price,maintenance_cost,person_capacity,luggage_boot_size,safety,class
734,0,2,0,0,1,unacc
139,3,0,0,1,0,unacc
484,0,3,2,0,0,unacc
582,0,0,1,0,2,acc
567,0,0,0,2,2,unacc


In [290]:
# Apply the classifier
# Divide the dataset into training and testing partition
# predictions for testing partition
# Print Number of mislabeled points

clf2 = GaussianNB()

# Select the independent variables and the target attribute
X3 = df3[df3.columns[:-1]] # Selecting the independent variables
Y3=df3[df3.columns[len(df3.columns)-1]] # selecting only the target lableled column

X_train, X_test, Y_train, Y_test = train_test_split(X3, Y3, test_size=0.30, random_state = 30)
y_pred=clf2.fit(X_train,Y_train).predict(X_test)
print("No. of mislabeled points out of a total %d points : %d"%(X_test.shape[0],(Y_test!=y_pred).sum()))

No. of mislabeled points out of a total 519 points : 183


In [291]:
# Calculate and print confusion matrix and other performance measures

print(classification_report(Y_test,y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,y_pred))

              precision    recall  f1-score   support

         acc       0.64      0.05      0.10       129
        good       0.00      0.00      0.00        19
       unacc       0.82      0.88      0.85       353
       vgood       0.14      1.00      0.24        18

   micro avg       0.65      0.65      0.65       519
   macro avg       0.40      0.48      0.30       519
weighted avg       0.72      0.65      0.61       519

Confusion Matrix
[[  7   0  59  63]
 [  4   0   7   8]
 [  0   0 311  42]
 [  0   0   0  18]]

 Accuracy
0.6473988439306358


  'precision', 'predicted', average, warn_for)


#### Q6: Write your observation below in the performance of model in Q4 and Q6

Accuracy of Naive bayes classifier is slightly reduced after dropping the attribute "doors". In earlier Naive bayes classifer
the accuracy is 0.6242774566473989 and after dropping the column "doors"(Higly correlated attribute) the accuracy is 
0.6184971098265896.