### Naive Bayes Classifier: Understanding and Implementation


# What is Naive Bayes?
# Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem.
# It assumes that features are independent (hence "naive") and simplifies calculations.
# It is widely used for classification tasks such as spam detection, sentiment analysis, and medical diagnosis.

# Why use Naive Bayes?
# - Simple and fast to train.
# - Works well with small datasets.
# - Handles categorical and numerical data effectively.
# - Performs well in text classification and other probabilistic tasks.

# How does Naive Bayes help in Classification?
# - It calculates the probability of a data point belonging to a specific class.
# - Uses Bayes' theorem to update probabilities based on given data.
# - Assumes independence between features to make computation easier.

# Bayes' Theorem Formula:
# P(Class | Data) = ( P(Data | Class) * P(Class) ) / P(Data)
# Where:
# P(Class | Data) is the posterior probability.
# P(Data | Class) is the likelihood.
# P(Class) is the prior probability.
# P(Data) is the evidence (normalizing constant).


# Import necessary libraries

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder

# Load the dataset 

In [15]:
# Load dataset
df = pd.read_csv("titanic.csv")

df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [16]:
# Drop unnecessary columns
df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked'], axis='columns', inplace=True)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


In [17]:
# Define inputs and target variable
inputs = df.drop('Survived', axis='columns')
target = df.Survived


In [18]:
# Convert categorical variable 'Sex' to dummy variables
dummies = pd.get_dummies(inputs.Sex)
inputs = pd.concat([inputs, dummies], axis='columns')
inputs.drop(['Sex', 'male'], axis='columns', inplace=True)


In [19]:
# Handle missing values in 'Age' column
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()


Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True
3,1,35.0,53.1,True
4,3,35.0,8.05,False


In [20]:
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.3, random_state=42)


In [21]:
# Train Naive Bayes model
# Initialize dictionary for encoders
label_encoders = {}

# Convert categorical columns to numeric using LabelEncoder
for col in X_train.select_dtypes(include=['object']).columns:
    label_encoders[col] = LabelEncoder()
    X_train[col] = label_encoders[col].fit_transform(X_train[col])

# Ensure no missing values
X_train = X_train.fillna(0)

# Train Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

In [22]:
# Evaluate model accuracy
accuracy = model.score(X_test, y_test)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.7761194029850746


In [26]:

# Test model predictions
sample_predictions = model.predict(X_test[:30])
print("Predictions:", sample_predictions)


Predictions: [0 0 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1]


In [24]:
# Predict probability of survival
prediction_probabilities = model.predict_proba(X_test[:10])
print("Prediction Probabilities:\n", prediction_probabilities)

Prediction Probabilities:
 [[0.96936275 0.03063725]
 [0.93717177 0.06282823]
 [0.96222728 0.03777272]
 [0.15327261 0.84672739]
 [0.3768726  0.6231274 ]
 [0.02067746 0.97932254]
 [0.46303706 0.53696294]
 [0.95888508 0.04111492]
 [0.38848423 0.61151577]
 [0.08683721 0.91316279]]


In [25]:
# Perform cross-validation
cv_scores = cross_val_score(GaussianNB(), X_train, y_train, cv=5)
print("Cross-validation Scores:", cv_scores)


Cross-validation Scores: [0.752      0.864      0.704      0.74193548 0.80645161]
