# Titanic Survival Prediction Using Random Forest

## Objective
Predict whether a passenger survived the Titanic disaster using a Random Forest classifier.
We will also examine which features were most important in predicting survival.


In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

 

## Load Dataset

We load the Titanic dataset from Kaggle and inspect the first few rows.


In [29]:
data = pd.read_csv('/kaggle/input/titanic-dataset/Titanic-Dataset.csv')
data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data Preprocessing

Steps:
- Fill missing values in `Age` and `Embarked`
- Encode categorical variables (`Sex`, `Embarked`)
- Drop non-informative columns (`Name`, `Ticket`, `Cabin`)


In [30]:
# Fill missing values
data['Age'] = data['Age'].fillna(data['Age'].median())
data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0])

# Encode categorical variables
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
data['Embarked'] = data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

# Drop non-informative columns
data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True, errors='ignore')


## Features and Target

- Target variable: `Survived`
- Features: all other columns


In [31]:
X = data.drop('Survived', axis=1)
y = data['Survived']


## Train-Test Split

Split the dataset into:
- 80% training data
- 20% testing data
Random state is set for reproducibility.


In [32]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## Train Random Forest Classifier

Random Forest is an ensemble model of decision trees. It captures non-linear patterns
and automatically handles feature interactions.


In [33]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)


## Predictions and Accuracy

Predict survival on the test set and evaluate model accuracy.


In [34]:
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Random Forest Accuracy:", accuracy)


Random Forest Accuracy: 0.8324022346368715


## Feature Importance

Examine which features contributed most to the model’s predictions.


In [35]:
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(feature_importance)


       Feature  Importance
2          Sex    0.257114
0  PassengerId    0.198014
6         Fare    0.192336
3          Age    0.164777
1       Pclass    0.079702
4        SibSp    0.043360
5        Parch    0.032479
7     Embarked    0.032218


## Sample Predictions

View some sample predictions along with Passenger IDs to inspect results.


In [36]:
predictions_df = pd.DataFrame({
    'PassengerId': data.loc[X_test.index, 'PassengerId'],
    'ActualSurvival': y_test,           # The true labels
    'RandomForestPred': y_pred          # Model predictions
})

predictions_df.head(10)


Unnamed: 0,PassengerId,ActualSurvival,RandomForestPred
709,710,1,0
439,440,0,0
840,841,0,0
720,721,1,1
39,40,1,0
290,291,1,1
300,301,1,1
333,334,0,0
208,209,1,1
136,137,1,1


## Model Summary

We used a **Random Forest Classifier** to predict Titanic passenger survival.

### Key Points:
- **Accuracy:** The model predicts survival on the test set with 83.24% accuracy.
- **Feature Importance:** 
  - **Sex** is the most important predictor — historically, females survived more often because they were prioritized during evacuation.
  - `Fare` and `Age` are also important.
  - Other features like `SibSp`, `Parch`, and `Embarked` have lower influence in this dataset.
- **Insights:** 
  - Random Forest captures **non-linear patterns** and **feature interactions** that a simple linear model like Logistic Regression might miss.
  - The model confirms historical survival trends on the Titanic.

This notebook demonstrates a **complete workflow**: data preprocessing → model training → prediction → feature analysis.
