# Naive Bayes Classifier (GaussianNB)

In [1]:
import pandas as pd
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.2500,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1000,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,"Montvila, Rev. Juozas",2,male,27.0,0,0,211536,13.0000,,S,0
887,888,"Graham, Miss. Margaret Edith",1,female,19.0,0,0,112053,30.0000,B42,S,1
888,889,"Johnston, Miss. Catherine Helen ""Carrie""",3,female,,1,2,W./C. 6607,23.4500,,S,0
889,890,"Behr, Mr. Karl Howell",1,male,26.0,0,0,111369,30.0000,C148,C,1


**Removing Unnecessary Columns:**

Some columns like Name, Ticket, and Cabin aren't useful for predicting survival, so we drop them.





In [2]:
df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked'], axis='columns', inplace=True)
df

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.2500,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.9250,1
3,1,female,35.0,53.1000,1
4,3,male,35.0,8.0500,0
...,...,...,...,...,...
886,2,male,27.0,13.0000,0
887,1,female,19.0,30.0000,1
888,3,female,,23.4500,0
889,1,male,26.0,30.0000,1


**Separatting Features and Target:**

In [3]:
target = df.Survived
inputs = df.drop('Survived', axis='columns')

**Converting "Sex" column to Numbers:**

The Sex column contains words ("male" and "female"), but machine learning models only understand numbers.

In [4]:
dummies = pd.get_dummies(inputs.Sex)
dummies.head()

Unnamed: 0,female,male
0,False,True
1,True,False
2,True,False
3,True,False
4,False,True


**Using one-hot encoding to create separate columns for male and female, then removing the original `Sex` column:**

In [5]:
inputs = pd.concat([inputs, dummies], axis='columns')
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,False,True
1,1,female,38.0,71.2833,True,False
2,3,female,26.0,7.925,True,False
3,1,female,35.0,53.1,True,False
4,3,male,35.0,8.05,False,True


In [6]:
inputs.drop('Sex', axis='columns', inplace=True)
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,False,True
1,1,38.0,71.2833,True,False
2,3,26.0,7.925,True,False
3,1,35.0,53.1,True,False
4,3,35.0,8.05,False,True


**Handling Missing Values:**

We want to know if any columns has NaN values:

In [7]:
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [8]:
import math
inputs.Age = inputs.Age.fillna(math.floor(inputs.Age.mean()))

**Splitting Data for Training & Testing:**

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2)

**Training the Naïve Bayes Model:**

It assumes that features (e.g., Age, Fare, etc.) are independent, which makes it fast and efficient.

If the data consists of word counts (e.g., text analysis) → Use MultinomialNB.

If the data has continuous numerical values (e.g., real-world measurements) → Use GaussianNB.

In [10]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [11]:
model.fit(X_train, y_train)

**Evaluating the Model’s Accuracy:**

In [12]:
model.score(X_test, y_test)

0.8044692737430168

In [13]:
X_test[:10]

Unnamed: 0,Pclass,Age,Fare,female,male
832,3,29.0,7.2292,False,True
856,1,45.0,164.8667,True,False
274,3,29.0,7.75,True,False
817,2,31.0,37.0042,False,True
22,3,15.0,8.0292,True,False
873,3,47.0,9.0,False,True
449,1,52.0,30.5,False,True
19,3,29.0,7.225,True,False
606,3,30.0,7.8958,False,True
326,3,61.0,6.2375,False,True


In [14]:
y_test[:10]

Unnamed: 0,Survived
832,0
856,1
274,1
817,0
22,1
873,0
449,1
19,1
606,0
326,0


In [15]:
model.predict(X_test[:10])

array([0, 1, 1, 0, 1, 0, 0, 1, 0, 0])

In [16]:
model.predict_proba(X_test[:10])

array([[9.88264208e-01, 1.17357919e-02],
       [3.56313018e-06, 9.99996437e-01],
       [5.99801993e-02, 9.40019801e-01],
       [9.71075254e-01, 2.89247456e-02],
       [4.54459622e-02, 9.54554038e-01],
       [9.88901734e-01, 1.10982664e-02],
       [9.10552553e-01, 8.94474466e-02],
       [5.98345333e-02, 9.40165467e-01],
       [9.88444621e-01, 1.15553785e-02],
       [9.86044547e-01, 1.39554527e-02]])