# Naive Bayes Classifier from Scratch – Titanic Survival Prediction

This notebook implements a **Naive Bayes classifier from scratch** using NumPy to predict whether a passenger survived the Titanic disaster. The aim is to understand the fundamental principles of probabilistic classification, including likelihood estimation and applying Bayes' theorem step by step.

**Key Steps:**

1. Loading the data and performing exploratory data analysis (EDA)
2. Data preprocessing (handling categorical and numeric features, cleaning missing data)
3. Calculating class prior probabilities (Survived vs. Not Survived)
4. Evaluating performance using accuracy

**Goal:**
To build a simple yet powerful probabilistic model and gain an intuitive understanding of how Naive Bayes works under the hood, applied to a real-world dataset.


In [68]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV

df = sns.load_dataset('titanic')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [69]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [70]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [71]:
df['age'] = df['age'].fillna(df['age'].mean())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
# dropping the columns that either have too many missing values (e.g. deck) or that are 
# correlated with other columns (e.g. alive)
df = df.drop(['alive', 'deck', 'adult_male', 'who', 'class', 'embarked'], axis=1)
df.shape

(891, 9)

In [72]:
df['embark_town'].unique()

array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)

In [73]:
# we need to convert categorical columns to numerical ones (`sex` and `embarked_town`)
df = pd.get_dummies(df)
# `get_dummies` creates correlated columns that we have to drop. For example, having both
# columns of male and female is redundant, because male being 0 is indicative of female 
# being 1.
df = df.drop(['sex_female', 'embark_town_Cherbourg'], axis=1)

In [74]:
label = df['survived']
data = df.drop(['survived'], axis=1)

In [75]:
x_train, x_test, y_train, y_test = train_test_split(data, label, test_size=0.25, random_state=42)

In [76]:
# in this set of operations we are trying to find out which value for `var_smoothing` yields
# the best answer using grid search
param_grid = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]}
gnb_grid = GaussianNB()
clf = GridSearchCV(gnb_grid, param_grid, cv=5)
clf.fit(x_train, y_train)
clf.best_params_

{'var_smoothing': 1e-05}

In [77]:
gnb = GaussianNB(var_smoothing=1e-5)
gnb.fit(x_train, y_train)
gnb_score = gnb.score(x_test, y_test)

In [78]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=250)
lr.fit(x_train, y_train)
lr_score = lr.score(x_test, y_test)

In [79]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
knn_score = knn.score(x_test, y_test)

In [80]:
print("Gaussian Naive Bayes Accuracy:", gnb_score, "\nLogistic Regression Accuracy:", lr_score, "\nKNN Accuracy:", knn_score)

Gaussian Naive Bayes Accuracy: 0.7937219730941704 
Logistic Regression Accuracy: 0.7982062780269058 
KNN Accuracy: 0.7040358744394619
