### Titanic Survival Prediction

You are a data scientist / AI engineer working on a binary classification problem to predict the survival of passengers from the Titanic crash. You have been provided with a dataset named **`"titanic.csv"`** which includes various features of passengers to predict whether they survived or not. The dataset comprises the following columns:

- `passenger_id:` The unique identifier for each passenger.
- `name:` The name of the passenger.
- `p_class:` The passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
- `sex:` The gender of the passenger.
- `age:` The age of the passenger.
- `sib_sp:` The number of siblings or spouses the passenger had aboard the Titanic.
- `parch:` The number of parents or children the passenger had aboard the Titanic.
- `ticket:` The ticket number of the passenger.
- `fare:` The fare the passenger paid for the ticket.
- `cabin:` The cabin number where the passenger stayed.
- `embarked:` The port where the passenger boarded the Titanic (C = Cherbourg; Q = Queenstown; S = Southampton).
- `survived:` Whether the passenger survived (1) or not (0).

Your task is to use this dataset to build and evaluate a `Gaussian Naive Bayes` model to predict whether a passenger survived based on their features. You will also evaluate the model's performance using precision, recall, and other classification metrics.

**Import Necessary Libraries**

In [50]:
# Import Necessary Librarie
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


### Task 1: Data Preparation and Exploration

1. Import the data from the `"titanic.csv"` file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset.
5. Drop columns that do not add much value `(passenger_id, name, sib_sp, parch, ticket, cabin, embarked)`.
6. Visualize the distribution of the target variable `survived` and `p_class` using a bar chart.
7. Visualize the distribution of `sex` using a pie chart (percentage).
8. Visualize the distribution of `age` and `fare` using histograms.

In [51]:
# Step 1: Import the data from the "titanic.csv" file and store it in a variable df
df = pd.read_csv("titanic.csv")


# Step 2: Display the number of rows and columns in the dataset
df.shape


# Step 3: Display the first few rows of the dataset to get an overview
df.head()


Unnamed: 0,passenger_id,name,p_class,sex,age,sib_sp,parch,ticket,fare,cabin,embarked,survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [52]:
# Step 4: Check for any missing values in the dataset 
df.isnull().sum()


passenger_id      0
name              0
p_class           0
sex               0
age             177
sib_sp            0
parch             0
ticket            0
fare              9
cabin           687
embarked          2
survived          0
dtype: int64

In [53]:
# Step 5: Drop columns that do not add much value (passenger_id, name, sib_sp, parch, ticket, cabin, embarked)
df.drop(["passenger_id" , "name" , "sib_sp" , "parch" , "ticket" , "cabin" , "embarked" ]  , axis = 1 , inplace = True)

df.head()

Unnamed: 0,p_class,sex,age,fare,survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


In [54]:
# Step 6: Visualize the distribution

#'survived'
df["survived"].value_counts()

# 'p_class'
df["p_class"].value_counts()

p_class
3    491
1    216
2    184
Name: count, dtype: int64

In [55]:
# Step 7: Visualize the distribution of 'sex' using a pie chart (percentage)


In [56]:
# Step 8: Visualize the distribution of 'age' using a histogram


In [57]:
# Step 9: Visualize the distribution of 'fare' using a histogram


### Task 2: Data Preprocessing

1. Fill in missing values in the `age and fare` columns with their median values.
2. Encode the sex column using one-hot encoding.
3. Standardize the fare column using StandardScaler.
4. Select the features `(p_class, sex, age, fare)` and the target variable `(survived)` for modeling.
5. Split the dataset into training and testing sets with a test size of 30%.

In [58]:
# Step 1: Fill in missing values in the 'age' and 'fare' columns with their median values
median_age = df["age"].median()
median_fare  = df["fare"].median()

df["age"].fillna(median_age , inplace = True)
df["fare"].fillna(median_fare  , inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(median_age , inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["fare"].fillna(median_fare  , inplace = True)


In [59]:
df["fare"].isnull().sum()

np.int64(0)

In [60]:
df  = pd.get_dummies(df , columns  =["sex"] , drop_first = True)

df.head()

Unnamed: 0,p_class,age,fare,survived,sex_male
0,3,22.0,7.25,0,True
1,1,38.0,71.2833,1,False
2,3,26.0,14.4542,1,False
3,1,35.0,53.1,1,False
4,3,35.0,8.05,0,True


In [70]:
# Step 3: Standardize the 'fare' column using StandardScaler
from sklearn.preprocessing import StandardScaler

scaler  = StandardScaler()

df["standarized_fare"] = scaler.fit_transform(df[["fare"]])

df.drop(["fare"] , axis = 1 , inplace = True)


In [71]:
df.head()

Unnamed: 0,p_class,age,survived,sex_male,standarized_fare
0,3,22.0,0,True,-0.500819
1,1,38.0,1,False,0.788518
2,3,26.0,1,False,-0.35576
3,1,35.0,1,False,0.42239
4,3,35.0,0,True,-0.484711


In [73]:
# Step 4: Select the features and target variable for modeling
X = df.drop(["survived"] , axis = 1)
y = df["survived]



# Step 5: Split the dataset into training and testing sets with a test size of 30%
X_train , X_test , y_train , y_test  = train_test_split(X , y , test_size = 0.3 , random_state = 42)



SyntaxError: unterminated string literal (detected at line 3) (1097553888.py, line 3)

### Task 3: Model Training Using Gaussian Naive Bayes

1. Initialize and train a `Gaussian Naive Bayes` model using the training data.
2. Make predictions on the test set using the trained model.
3. Evaluate the model using a classification report and print the report.
4. Visualize the confusion matrix for the model.

In [14]:
# Step 1: Initialize and train a Gaussian Naive Bayes model using the training data


# Step 2: Make predictions on the test set using the trained model


In [15]:
# Step 3: Evaluate the model using a classification report and print the report


In [16]:
# Step 4: Visualize the confusion matrix for the model
