# Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.

## Bayes' Theorem
Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)

Where

###### P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.
###### P(d|h) is the probability of data d given that the hypothesis h was true.
###### P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
###### P(d) is the probability of the data (regardless of the hypothesis).

### Useful Libraries
To load dataset:    import pandas as pd
Preprocessing:      from sklearn import preprocessing
NB Classifier:      from sklearn.naive_bayes import GaussianNB
Train & Test Split: from sklearn.model_selection import train_test_split
K-Fold:             from sklearn.model_selection import cross_val_score

For Prediction and Evaluation

from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score


#### Load Dataset. Use "bank-data.csv"

In [95]:
# import dataset
import pandas as pd
df = pd.read_csv('bank-data.csv',index_col=0)
df.head()


Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,FEMALE,INNER_CITY,17546.0,NO,1,NO,NO,NO,NO,YES
ID12102,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
ID12103,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
ID12104,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO
ID12105,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO


#### Preprocess the data

In [96]:
# import library for preprocessing
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()

In [97]:
# Tranform data using "fit_transform(attribute)" function  
df.sex=le.fit_transform(df.sex)
df.married=le.fit_transform(df.married)
df.car=le.fit_transform(df.car)
df.save_act=le.fit_transform(df.save_act)
df.current_act=le.fit_transform(df.current_act)
df.mortgage=le.fit_transform(df.mortgage)
df.pep=le.fit_transform(df.pep)
df.region=le.fit_transform(df.region)
df

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,0,0,17546.0,0,1,0,0,0,0,1
ID12102,40,1,3,30085.1,1,3,1,0,1,1,0
ID12103,51,0,0,16575.4,1,0,1,1,1,0,0
ID12104,23,0,3,20375.4,1,3,0,0,1,0,0
ID12105,57,0,1,50576.3,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
ID12575,31,0,3,22678.1,0,1,1,1,1,1,1
ID12576,33,0,3,12178.5,1,2,0,1,1,1,0
ID12577,43,1,1,26106.7,0,1,0,0,1,0,1
ID12578,40,1,0,27417.6,1,0,0,1,1,1,0


#### Select independent variables and target column

In [98]:
# Select the independent variables and the target attribute
X = df[df.columns[:-1]] # Selecting the independent variables
Y=df[df.columns[len(df.columns)-1]] # selecting only the target lableled column


#### Import Naive Bayes Classifier library 

In [99]:
# import Classifier library
from sklearn.naive_bayes import GaussianNB

In [100]:
# Call the Classifier
gnb=GaussianNB()

#### Predict the target column and find the perfromance of the model

In [101]:
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

In [102]:
# Print Number of mislabeled points
gnb.fit(X_train,Y_train)
predictions = gnb.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions).sum()))
#sum(Y_test!=predictions)


Number of mislabeled points out of a total 144 points : 56


### Prediction and Evaluation

In [103]:
# import required libraries
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

In [104]:
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.62      0.78      0.69        80
           1       0.59      0.41      0.48        64

    accuracy                           0.61       144
   macro avg       0.61      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

Confusion Matrix
[[62 18]
 [38 26]]

 Accuracy
0.6111111111111112


#### Q1: Consider "current_act" as an irrelevant attribute. Remove it and find the accuracy of Naive Bayes classifier

In [105]:
# display dataframe first 5 columns
X_new=X.drop(columns='current_act')
X_new

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,mortgage
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ID12101,48,0,0,17546.0,0,1,0,0,0
ID12102,40,1,3,30085.1,1,3,1,0,1
ID12103,51,0,0,16575.4,1,0,1,1,0
ID12104,23,0,3,20375.4,1,3,0,0,0
ID12105,57,0,1,50576.3,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...
ID12575,31,0,3,22678.1,0,1,1,1,1
ID12576,33,0,3,12178.5,1,2,0,1,1
ID12577,43,1,1,26106.7,0,1,0,0,0
ID12578,40,1,0,27417.6,1,0,0,1,1


In [106]:
# Selecting the independent variables
X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y, test_size=0.30, random_state = 30)

In [107]:
# selecting only the target lableled column
Y_test

id
ID12350    0
ID12201    1
ID12229    0
ID12127    0
ID12318    0
          ..
ID12279    1
ID12467    1
ID12113    1
ID12345    0
ID12309    0
Name: pep, Length: 144, dtype: int64

In [108]:
# Apply the classifier and Print Number of mislabeled points
gnb1=GaussianNB()
gnb1.fit(X_train,Y_train)
predictions1 = gnb1.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions1).sum()))
#sum(Y_test!=predictions1)

Number of mislabeled points out of a total 144 points : 56


In [109]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test,predictions1))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions1))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions1))

              precision    recall  f1-score   support

           0       0.62      0.76      0.69        80
           1       0.59      0.42      0.49        64

    accuracy                           0.61       144
   macro avg       0.60      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

Confusion Matrix
[[61 19]
 [37 27]]

 Accuracy
0.6111111111111112


#### Q2: Write your observation

Accuracy did not change on removing the irrelevant data

## Correlation Checking

In [114]:
c = X.corr().abs()
print(c)
s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so)

                  age       sex    region    income   married  children  \
age          1.000000  0.113416  0.031045  0.746954  0.002897  0.032233   
sex          0.113416  1.000000  0.025683  0.045622  0.013560  0.004089   
region       0.031045  0.025683  1.000000  0.010030  0.003875  0.029435   
income       0.746954  0.045622  0.010030  1.000000  0.000374  0.052406   
married      0.002897  0.013560  0.003875  0.000374  1.000000  0.031679   
children     0.032233  0.004089  0.029435  0.052406  0.031679  1.000000   
car          0.088710  0.061259  0.004216  0.112592  0.043234  0.039704   
save_act     0.203436  0.012936  0.072585  0.295660  0.049677  0.013860   
current_act  0.045850  0.033821  0.013958  0.035265  0.038228  0.032741   
mortgage     0.048380  0.070235  0.014856  0.023663  0.001784  0.047236   

                  car  save_act  current_act  mortgage  
age          0.088710  0.203436     0.045850  0.048380  
sex          0.061259  0.012936     0.033821  0.070235  
reg

In [119]:
X_new=X.drop(columns='income')

In [120]:
gnb4=GaussianNB()
X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y, test_size=0.30, random_state = 30)
gnb4.fit(X_train,Y_train)
predictions4 = gnb4.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions4).sum()))

Number of mislabeled points out of a total 144 points : 55


In [121]:
print(classification_report(Y_test,predictions4))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions4))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions4))

              precision    recall  f1-score   support

           0       0.64      0.70      0.67        80
           1       0.58      0.52      0.55        64

    accuracy                           0.62       144
   macro avg       0.61      0.61      0.61       144
weighted avg       0.61      0.62      0.62       144

Confusion Matrix
[[56 24]
 [31 33]]

 Accuracy
0.6180555555555556


#### Q6: Write your observation below in the performance of model

Accuracy increased for car dataset by removing correlated attribute. Increase was slight

### Load "car.csv" dataset. 

#### Q3: Apply Naive Bayes classifier on this dataset

In [76]:
# Load the data
df = pd.read_csv('car.csv',header=None,names=['buying','maint','doors','persons','lug_boot','safety','class'])
#df.columns=['buying','maint','doors','persons','lug_boot','safety']
#df.head()
# shuffle the DataFrame rows 
df


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,6,6,med,med,good
1724,low,low,6,6,med,high,vgood
1725,low,low,6,6,big,low,unacc
1726,low,low,6,6,big,med,good


In [80]:
# Preprocess and Tranform data using "fit_transform(attribute)" function  
for col in df.columns:
    df[col]=le.fit_transform(df[col])
df

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2
...,...,...,...,...,...,...,...
1723,1,1,3,2,1,2,1
1724,1,1,3,2,1,0,3
1725,1,1,3,2,0,1,2
1726,1,1,3,2,0,2,1


In [81]:
# Select the independent variables and the target attribute
X = df[df.columns[:-1]] # Selecting the independent variables
Y=df[df.columns[len(df.columns)-1]] # selecting only the target lableled column
X

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,3,3,0,0,2,1
1,3,3,0,0,2,2
2,3,3,0,0,2,0
3,3,3,0,0,1,1
4,3,3,0,0,1,2
...,...,...,...,...,...,...
1723,1,1,3,2,1,2
1724,1,1,3,2,1,0
1725,1,1,3,2,0,1
1726,1,1,3,2,0,2


In [82]:
# Apply the classifier
gnb2=GaussianNB()

In [83]:
# Divide the dataset into training and testing partition
# predictions for testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
gnb2.fit(X_train,Y_train)
predictions2 = gnb2.predict(X_test)


In [85]:
# Print Number of mislabeled points
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions2).sum()))

Number of mislabeled points out of a total 519 points : 195


In [86]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test,predictions2))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions2))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions2))

              precision    recall  f1-score   support

           0       0.45      0.12      0.19       111
           1       0.00      0.00      0.00        21
           2       0.85      0.79      0.82       368
           3       0.13      1.00      0.23        19

    accuracy                           0.62       519
   macro avg       0.36      0.48      0.31       519
weighted avg       0.70      0.62      0.63       519

Confusion Matrix
[[ 13   0  45  53]
 [  2   0   7  12]
 [ 14   0 292  62]
 [  0   0   0  19]]

 Accuracy
0.6242774566473989


#### Q4: Find the correlation between the attributes of the dataset.

In [91]:
# Find the pairwise correlation of attributes and arrange in ascending order
c = df.corr().abs()
#print(c)
s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so)

            buying     maint     doors   persons  lug_boot    safety     class
buying    1.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.051424
maint     0.000000  1.000000  0.000000  0.000000  0.000000  0.000000  0.040194
doors     0.000000  0.000000  1.000000  0.000000  0.000000  0.000000  0.031327
persons   0.000000  0.000000  0.000000  1.000000  0.000000  0.000000  0.299468
lug_boot  0.000000  0.000000  0.000000  0.000000  1.000000  0.000000  0.033184
safety    0.000000  0.000000  0.000000  0.000000  0.000000  1.000000  0.021044
class     0.051424  0.040194  0.031327  0.299468  0.033184  0.021044  1.000000
persons   buying      0.000000
safety    buying      0.000000
lug_boot  safety      0.000000
          persons     0.000000
          doors       0.000000
          maint       0.000000
          buying      0.000000
safety    persons     0.000000
persons   safety      0.000000
          lug_boot    0.000000
          doors       0.000000
          maint       0.000

#### Q5: Remove one of the highly correlated attributes and apply Naive Bayes classifier

In [92]:
# Drop highly correlated attribute
X_new=X.drop(columns='persons')
X_new

Unnamed: 0,buying,maint,doors,lug_boot,safety
0,3,3,0,2,1
1,3,3,0,2,2
2,3,3,0,2,0
3,3,3,0,1,1
4,3,3,0,1,2
...,...,...,...,...,...
1723,1,1,3,1,2
1724,1,1,3,1,0
1725,1,1,3,0,1
1726,1,1,3,0,2


In [93]:
# Apply the classifier
# Divide the dataset into training and testing partition
# predictions for testing partition
# Print Number of mislabeled points
gnb3=GaussianNB()
X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y, test_size=0.30, random_state = 30)
gnb3.fit(X_train,Y_train)
predictions3 = gnb3.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions3).sum()))

Number of mislabeled points out of a total 519 points : 202


In [94]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test,predictions3))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions3))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions3))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       111
           1       0.00      0.00      0.00        21
           2       0.82      0.81      0.81       368
           3       0.12      1.00      0.22        19

    accuracy                           0.61       519
   macro avg       0.23      0.45      0.26       519
weighted avg       0.58      0.61      0.58       519

Confusion Matrix
[[  0   0  58  53]
 [  0   0   9  12]
 [  0   0 298  70]
 [  0   0   0  19]]

 Accuracy
0.6107899807321773


#### Q6: Write your observation below in the performance of model in Q4 and Q6

Accuracy decreased for car dataset as correlation was 0