#üõ≥Ô∏è Titanic Survival Prediction (Machine Learning Project)

In this project, we build a machine learning model to predict whether a passenger survived the Titanic disaster.
The dataset contains passenger information such as age, gender, class, fare, and embark location.
The goal is to understand which factors influence survival and compare different classification models.

üìå Steps in This Project

1. Import Libraries

2. Load the Dataset

3. Explore the Data

4. Handle Missing Values

5. Encode Categorical Features

6. Prepare Features and Target (X and y)

7. Train‚ÄìTest Split

8. Train Models

    Logistic Regression

    Decision Tree

    Random Forest

9. Evaluate Models

10. Final Conclusion

#üß© Step 1: Import Libraries

We import the required Python libraries:

pandas for data manipulation

seaborn to load the Titanic dataset

In [54]:
import pandas as pd
import seaborn as sns


#üß© Step 2: Load the Dataset

We load the Titanic dataset using Seaborn.
This dataset contains details of passengers such as age, sex, fare, and survival status.

In [55]:
df = sns.load_dataset("titanic")
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


#üîç Step 3: Explore the Data

Before building models, we explore the dataset to understand:

    structure of the data

    data types

    summary statistics

    class distribution

    missing values

In [56]:
df.info()
df.describe()
df["survived"].value_counts()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


Unnamed: 0_level_0,count
survived,Unnamed: 1_level_1
0,549
1,342


#‚ùì Step 4: Check Missing Values

Machine learning models cannot handle missing values directly, so we first check how many missing values are present.

In [57]:
df.isnull().sum()


Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


#üßπ Step 5: Handle Missing Values

We handle missing values using appropriate strategies:

Drop deck column because it has too many missing values

Fill age with median, as age is an important numerical feature

Fill embarked and embark_town with mode, since they are categorical

In [58]:
# dropping deck as it has large missing value
df = df.drop(columns=["deck"])

# fill missing age with median
df["age"] = df["age"].fillna(df["age"].median())

# fill embarked and embark_town with mode
df["embarked"] = df["embarked"].fillna(df["embarked"].mode()[0])
df["embark_town"] = df["embark_town"].fillna(df["embark_town"].mode()[0])


#üîÑ Step 6: Encode Categorical Features

Machine learning models cannot work directly with text data.
So we convert categorical columns into numerical form using One-Hot Encoding.

We use drop_first=True to avoid multicollinearity.

In [59]:
df = df.drop(columns=["class", "who", "alive", "embark_town"])

df_encoded = pd.get_dummies(
    df,
    columns=["sex", "embarked", "adult_male", "alone"],
    drop_first=True
)

df_encoded.head()


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_male,embarked_Q,embarked_S,adult_male_True,alone_True
0,0,3,22.0,1,0,7.25,True,False,True,True,False
1,1,1,38.0,1,0,71.2833,False,False,False,False,False
2,1,3,26.0,0,0,7.925,False,False,True,False,True
3,1,1,35.0,1,0,53.1,False,False,True,False,False
4,0,3,35.0,0,0,8.05,True,False,True,True,True


#üß© Step 7: Prepare Features (X) and Target (y)

X ‚Üí input features

y ‚Üí target variable (survived)

In [60]:
X = df_encoded.drop("survived", axis=1)
y = df_encoded["survived"]

X.shape, y.shape



((891, 10), (891,))

#‚úÇÔ∏è Step 8: Train‚ÄìTest Split

We split the data into:

Training set (80%) ‚Üí used to train the models

Testing set (20%) ‚Üí used to evaluate model performance

This helps check how well the model generalizes to unseen data.

In [61]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


((712, 10), (179, 10))

#ü§ñ Step 9: Model 1 ‚Äî Logistic Regression

Logistic Regression is a simple and widely used classification algorithm.
It works well for binary classification problems like survival prediction.

In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_model = LogisticRegression(max_iter=5000)
log_model.fit(X_train, y_train)

y_pred_log = log_model.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))


Logistic Regression Accuracy: 0.8156424581005587


#üå≥ Step 10: Model 2 ‚Äî Decision Tree Classifier

Decision Trees split data using decision rules.
They can capture non-linear patterns but may overfit if not controlled.

In [63]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))


Decision Tree Accuracy: 0.776536312849162


#üå≤ Step 11: Model 3 ‚Äî Random Forest Classifier

Random Forest is an ensemble model that combines multiple decision trees.
It improves accuracy and reduces overfitting.

In [64]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))


Random Forest Accuracy: 0.8212290502793296


#üìä Step 12: Model Evaluation (Random Forest)

We evaluate the best-performing model using:

    Confusion Matrix

    Precision

    Recall

    F1 Score

In [65]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

cm = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:\n", cm)

print("Precision:", precision_score(y_test, y_pred_rf))
print("Recall:", recall_score(y_test, y_pred_rf))
print("F1 Score:", f1_score(y_test, y_pred_rf))


Confusion Matrix:
 [[91 14]
 [18 56]]
Precision: 0.8
Recall: 0.7567567567567568
F1 Score: 0.7777777777777778


#‚úÖ Final Conclusion

In this project, we built a machine learning model to predict Titanic passenger survival.

We trained three models:

    Logistic Regression

    Decision Tree Classifier

    Random Forest Classifier

Among them, Random Forest performed the best, providing higher accuracy and better evaluation metrics.
