**Titanic Survival Prediction (using Random Forest)**



In [12]:
# Importing required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load Titanic dataset
# Make sure to update the file path with the location of the Titanic dataset on your machine.
titanic_file_path = 'Titanic.csv'  # Update this path to your Titanic CSV file
titanic_df = pd.read_csv(titanic_file_path)

# Display the first few rows of the dataset to understand its structure (optional).
# print(titanic_df.head())

# Step 1: Handling missing values
# # The 'age' column has some missing values, we'll fill these with the median age.
# # The 'embarked' column also has missing values, we'll fill these with the mode (most common embarkation point).
# titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)
# titanic_df['embarked'].fillna(titanic_df['embarked'].mode()[0], inplace=True)

# Handling missing values without the inplace=True parameter
# Assign the result of fillna() back to the DataFrame to avoid FutureWarnings
titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].median())  # Replace missing ages with the median
titanic_df['embarked'] = titanic_df['embarked'].fillna(titanic_df['embarked'].mode()[0])  # Replace missing embarkation points with the mode


# Step 2: Encoding categorical features
# The columns 'sex', 'embarked', 'class', 'who', and 'alone' contain categorical data.
# We need to convert them to numerical format because machine learning models can't work with strings.
# We'll use one-hot encoding, which converts categories into binary columns.
# For example, 'sex' will become two columns: 'sex_male' and 'sex_female', and we can drop one to avoid multicollinearity.
titanic_df = pd.get_dummies(titanic_df, columns=['sex', 'embarked', 'class', 'who', 'alone'], drop_first=True)

# Step 3: Define the feature matrix (X) and target variable (y)
# X will contain all the features (input data), and y will contain the target ('survived' column).
X_titanic = titanic_df.drop('survived', axis=1)  # Drop the 'survived' column to get the features
y_titanic = titanic_df['survived']  # The target variable we are trying to predict

# Step 4: Split the dataset into training and testing sets
# We'll use 80% of the data for training and 20% for testing the model.
X_train_titanic, X_test_titanic, y_train_titanic, y_test_titanic = train_test_split(X_titanic, y_titanic, test_size=0.2, random_state=42)

# Step 5: Train a Random Forest Classifier
# Random Forest is an ensemble method that builds multiple decision trees and averages their predictions.
# It usually performs well for classification tasks like this.
rf_model_titanic = RandomForestClassifier(random_state=42)  # Set random_state for reproducibility
rf_model_titanic.fit(X_train_titanic, y_train_titanic)  # Fit the model on the training data

# Step 6: Make predictions on the test set
# Now that the model is trained, we'll use it to make predictions on the test data.
y_pred_titanic = rf_model_titanic.predict(X_test_titanic)

# Step 7: Evaluate the model
# We'll use accuracy, precision, recall, and F1-score to evaluate how well the model performs.
# Accuracy gives the overall correctness, while precision and recall help to evaluate the model's performance on each class (survived/didn't survive).
accuracy_titanic = accuracy_score(y_test_titanic, y_pred_titanic)  # Calculate the accuracy of the model
classification_report_titanic = classification_report(y_test_titanic, y_pred_titanic)  # Detailed classification report

# Step 8: Print the results
print(f"Accuracy: {accuracy_titanic:.4f}")  # Print accuracy with 4 decimal places
print("Classification Report:\n", classification_report_titanic)  # Print the detailed classification report


Accuracy: 0.8212
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.88      0.85       105
           1       0.81      0.74      0.77        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179

