# ***Naive Bayes/Gaussian Classifier***

We mount to the drive first.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Importing the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import preprocessing

Link for the dataset:

https://drive.google.com/file/d/1Tz6E-K3hWu4vxveTh8-7PqsarUetNeIS/view?usp=sharing

Calling the dataset.

In [None]:
dataset=pd.read_csv("/content/drive/MyDrive/SML/Datasets/titanicsurvival.csv")
dataset

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.2500,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.9250,1
3,1,female,35.0,53.1000,1
4,3,male,35.0,8.0500,0
...,...,...,...,...,...
886,2,male,27.0,13.0000,0
887,1,female,19.0,30.0000,1
888,3,female,,23.4500,0
889,1,male,26.0,30.0000,1


In [None]:
dataset.shape

(891, 5)

There are 891 rows and 5 columns in the dataset. That is, there are 5 varibles of interest.

In [None]:
dataset.isnull().values.any()

True

There are no null values in the dataset.

In [None]:
print(dataset['Sex'].unique())

['male' 'female']


Now that we know that, there are two categories in a particular column, we do mapping.

Here, we map male to 0 and female to 1.

In [None]:
sex_set=set(dataset['Sex'])
dataset['Sex']=dataset['Sex'].map({'male':0,'female':1}).astype(int)
print(dataset.head)

<bound method NDFrame.head of      Pclass  Sex   Age     Fare  Survived
0         3    0  22.0   7.2500         0
1         1    1  38.0  71.2833         1
2         3    1  26.0   7.9250         1
3         1    1  35.0  53.1000         1
4         3    0  35.0   8.0500         0
..      ...  ...   ...      ...       ...
886       2    0  27.0  13.0000         0
887       1    1  19.0  30.0000         1
888       3    1   NaN  23.4500         0
889       1    0  26.0  30.0000         1
890       3    0  32.0   7.7500         0

[891 rows x 5 columns]>


In [None]:
dataset=dataset.dropna()
dataset.shape

(714, 5)

We see that there are null values in the dataset and hence drop them

Note that the null values are removed row wise.

After dropping the null values the dataset reduced from [891 x 5] to [714 x 5].

In [None]:
dataset.head(10)

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,0,22.0,7.25,0
1,1,1,38.0,71.2833,1
2,3,1,26.0,7.925,1
3,1,1,35.0,53.1,1
4,3,0,35.0,8.05,0
6,1,0,54.0,51.8625,0
7,3,0,2.0,21.075,0
8,3,1,27.0,11.1333,1
9,2,1,14.0,30.0708,1
10,3,1,4.0,16.7,1


We segregate the varibles as dependent and independent as follows.

In [None]:
X=dataset.iloc[:,:-1].values
X

array([[ 3.    ,  0.    , 22.    ,  7.25  ],
       [ 1.    ,  1.    , 38.    , 71.2833],
       [ 3.    ,  1.    , 26.    ,  7.925 ],
       ...,
       [ 1.    ,  1.    , 19.    , 30.    ],
       [ 1.    ,  0.    , 26.    , 30.    ],
       [ 3.    ,  0.    , 32.    ,  7.75  ]])

In [None]:
Y=dataset.iloc[:,-1].values
Y

array([0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)

We divide the dataset into test data and train data.

We have taken the size of test data as 25%.

Below are the dimensions of train and test data sizes for both the independent and dependent variables.

In [None]:
X_train.shape

(535, 4)

In [None]:
X_test.shape

(179, 4)

In [None]:
y_train.shape

(535,)

In [None]:
y_test.shape

(179,)

Now we build a Gaussian Classifier.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Build a Gaussian Classifier
model = GaussianNB()

# Model training
model.fit(X_train, y_train)

Then we predict the output to see how well the built model performs.

In [None]:
# Predict Output
predicted = model.predict([X_test[1]])

print("Actual Value:", y_test[1])
print("Predicted Value:", predicted[0])

Actual Value: 0
Predicted Value: 1
