<h1>Titanic Survival Prediction Project</h1>

<h2>Introduction</h2>

This project involves predicting the survival status of passengers aboard the Titanic, whether they "Survived" or "Did Not Survive" the disaster. This is a classic binary classification problem in supervised machine learning, where we aim to predict one of two distinct outcomes: "Survived" (represented as "1") or "Did not Survive" (represented as "0").


<h2>Exploratory Data Analysis (EDA)</h2>

Before constructing our predictive model, we conducted an exploratory data analysis (EDA) to achieve the following objectives:

* Gain insights into the dataset.
* Clean and preprocess the data to make it suitable for model building.

In the EDA phase, we examined various aspects of the data to better understand its characteristics and potential patterns. This analysis will inform our subsequent steps in constructing a predictive model for Titanic survival.

<h2>DATA CLEANING</h2>

In [None]:
#Load libraries into code environment

import numpy as np
import pandas as pd

<h3>Data Loading and Initial Assessment</h3>

In this stage of the project, we loaded the training and testing datasets into our code environment. We utilized the .info() method to gain a preliminary understanding of the data, including its basic characteristics and the presence of missing values.

<h3>Identifying Missing Data</h3>

Upon inspecting the datasets, we observed that both the training and testing datasets exhibit missing values specifically in the "Age" and "Cabin" columns. The presence of missing data is a crucial aspect to address during data preprocessing to ensure the quality of our model and predictions.

In [None]:
#Load test and train datasets into code environment

test = pd.read_csv('/kaggle/input/titanic/test.csv')
train = pd.read_csv('/kaggle/input/titanic/train.csv')

#Preview and get some info about the dataset
test.info()
train.info()

test.Age.isnull().sum()

<h3>Data Preprocessing: Handling Missing Data</h3>

To address the missing data in the "Age" column of the training dataset, we applied the following data imputation technique:

<h3>Imputing Missing Age Values</h3>

We employed the Simple Imputer method from the sklearn library to fill in the missing values in the "Age" column. Specifically, we imputed these missing data points with the average age of the passengers. This approach helps ensure that the dataset remains complete and ready for further analysis and model building.

In [None]:
#Fill missing values in Age column of train dataset with mean of the column

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')

age_column_train = train[['Age']]


imputer.fit(age_column_train)


train['Age'] = imputer.transform(age_column_train)




<h2>Feature Engineering</h2>

In the feature engineering phase, we aim to prepare our dataset for model building by selecting and transforming relevant features. Here are the key steps we took:

<h3>Identifying the Target Variable</h3>
We identified the target variable for our prediction task as the 'Survived' column in the training dataset.

<h3>Feature Selection</h3>
To enhance the quality of our model, we conducted feature selection by excluding columns that are not significant for predicting the target variable. Notably, we dropped the 'Cabin' column due to a large number of missing values (687 out of 891 datapoints).

<h3>Handling Categorical and Ordinal Data</h3>
To make the dataset compatible with various machine learning algorithms and to improve model performance, we converted categorical and ordinal data into numeric format. This transformation ensures that our model can effectively utilize these variables for prediction.






In [None]:
#Drop the columns that are less significant

train_rev = train.drop(['Name', 'Cabin', 'Ticket', 'PassengerId'], axis=1)




In [None]:
#Convert column with ordinal data to numeric data using Label Encoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
train_rev['Pclass'] = le.fit_transform(train_rev['Pclass'])


#Convert columns with categorical data to numeric data using One Hot Encoder
train_rev = pd.get_dummies(train_rev, columns=['Sex', 'Embarked'])


train_rev.info()


<h3>More Feature Engineering</h3>

After successfully converting all relevant data to numeric format, we now have a dataset with 11 features remaining.

<h3>Correlation Analysis</h3>

To better understand the relationships between these features and our target variable, "Survived," we conducted correlation analysis. This analysis provides insights into how each feature may influence the likelihood of survival and is visualized to facilitate our understanding.

In [None]:
#Check relationship between features and target variable using correlation analysis

import seaborn as sns
import matplotlib.pyplot as plt

target_variable = 'Survived'
correlation_matrix = train_rev.corr()
target_correlations = correlation_matrix[target_variable]


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='cividis', fmt=".2f")
plt.title(f'Correlation Heatmap (Target Variable: {target_variable})')
plt.show()


<h3>Correlation Analysis Results</h3>

Upon performing correlation analysis, we observed that some of the strongest correlations with the target variable, "Survived," are found within the gender-related features.

<h3>Feature Selection Based on Correlation</h3>

In light of these findings, we made informed decisions regarding feature selection. Specifically, we opted to drop some of the features with the lowest correlation to the target variable. This step was taken because these features showed little to no relationship with survival, and retaining them would not significantly contribute to the predictive power of our model.

In [None]:
#Drop features with the lowest correlation to the target variable

train_rev = train_rev.drop(['Parch','Embarked_C', 'Embarked_Q', 'Embarked_S'], axis=1)

train_rev.head()

<h2>Model Training</h2>

In this phase of the project, we have carefully selected three models suitable for binary classification problems. These models will be trained using the preprocessed training dataset, and their performance will be rigorously evaluated.

<h3>Model Selection</h3>
We chose three distinct models, each known for its effectiveness in binary classification tasks. These models will serve as candidates for predicting the survival outcome:

* Logistic regression
* Decision Tree Classifier
* Random Forest Classifier

In [None]:
#Import the necessary sklearn libraries
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

X = train_rev.drop("Survived", axis=1)
y = train_rev[['Survived']]

#Split the train dataset
X_train, X_test ,y_train ,y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Initializing models
models = [
    ("Logistic Regression", LogisticRegression()),
    ("Random Forest", RandomForestClassifier()),
    ("Decision Tree", DecisionTreeClassifier())
        ]

<h2>Model Evaluation</h2>

In this section, we comprehensively evaluate the performance of our models using three key metrics: 

<h3>Performance Metrics</h3>

To assess the performance of these models, we have established three key metrics:

* Accuracy: Measures the overall correctness of predictions.
* Precision: Evaluates the ability of the model to make accurate positive predictions.
* Recall: Measures the ability of the model to correctly identify actual positive instances.

<h3>Confusion Matrix</h3>
We utilize confusion matrices to visually interpret the performance of each model. These matrices provide a comprehensive breakdown of metric scores, giving us valuable insights into the strengths and weaknesses of each model. Ultimately, this analysis will guide our selection of the model that best suits our prediction task.
These matrices offer a visual breakdown of performance, including true positives, true negatives, false positives, and false negatives.

<h3>Cross Validation</h3>

To enhance the reliability and generalizability of our models, we employ 5-fold cross-validation. This approach involves training and evaluating the models multiple times on different data subsets. By using our selected metrics as the scoring criteria during cross-validation, we not only mitigate overfitting but also maximize the utilization of our dataset, ultimately ensuring the robustness of our models.

In [None]:
#Metric 1 - Accuracy

for model_name,model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"{model_name} Accuracy: {accuracy}")


In [None]:
# Metric 2 - Precision

for model_name,model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    precision = precision_score(y_test, y_pred)
    
    print(f"{model_name} Precision: {precision}")

In [None]:
#Metric 3 - Recall

for model_name,model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    recall = recall_score(y_test, y_pred)
    
    print(f"{model_name} Recall: {recall}")

In [None]:
for model_name,model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    cm = confusion_matrix(y_test, y_pred)
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.title(f"Confusion Matrix for {model_name}")
    plt.show()

In [None]:
scoring = ['accuracy', 'precision', 'recall']

In [None]:
# Perform cross-validation and evaluate models
for model_name, model in models:
    # Perform cross-validation (5-fold in this example)
    cv_results = cross_validate(model, X_train, y_train, cv=5, scoring= scoring)
    
    # Fit the model on the full training data
    model.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = model.predict(X_test)
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    for metric in scoring:
        scores = cv_results[f"test_{metric}"]
        print(f"{model_name} - {metric.capitalize()} Scores: {scores}")

<h2>Observation and Model Selection</h2>

After conducting cross-validation, we observed a considerable reduction in the variance of model performance. Among the three models we employed for training our dataset, we noticed a notable observation within the Confusion Matrix of the Random Forest model:

<h3>Random Forest Model Performance</h3>

The Random Forest model exhibited the highest number of True Positives.
This model demonstrated some of the best scores when evaluated using our selected metrics, including Accuracy, Precision, and Recall.

<h3>Model Selection</h3>

Taking these observations into account, we have made the decision to select the Random Forest Classifier model as our preferred choice. This model will be used for validating our test data and making predictions, as it has shown promising performance during our evaluation phase.







<h2>Inference</h2>

In this phase, we apply our trained Random Forest model to make predictions on a new dataset, the Test data. Before proceeding with predictions, it is crucial to ensure that the Test data is preprocessed and formatted consistently with our training dataset.

<h3>Data Preprocessing</h3>

To prepare the Test data for prediction, we perform the following steps:

* Data Cleaning: We address any missing values and ensure that the dataset is free of inconsistencies or outliers.
* Data Formatting: The Test data is formatted to match the structure of our training dataset, ensuring that it contains the same features and data types.

<h3>Prediction</h3>
Once the Test data is appropriately cleaned and formatted, we utilize our previously trained Random Forest model to make predictions. These predictions will provide insights into the likelihood of survival for passengers in the Test dataset, based on the patterns and information learned from our training data.

In [None]:
#Check the test dataset

#Fill missing data in Age column with mean
age_column_test = test[['Age']]
imputer.fit(age_column_test)
test['Age'] = imputer.transform(age_column_test)

#Drop columns that were dropped in train dataset
test_rev = test.drop(['Name', 'Cabin', 'Ticket', 'Embarked', 'Parch','PassengerId'], axis=1)

#Transform ordinal data to numeric data
test_rev['Pclass'] = le.fit_transform(test_rev['Pclass'])
#Convert columns with categorical data to numeric data using Label Encoder
test_rev = pd.get_dummies(test_rev, columns=['Sex']) 


#Fill the missing data in the fare column with the most frequent data

imputer2 = SimpleImputer(strategy='most_frequent')
fare_column_test = test_rev[['Fare']]
imputer2.fit(fare_column_test)
test_rev['Fare'] = imputer2.transform(fare_column_test)

test_rev.info()

In [None]:
for model_name, model in models:
    if model_name == "Random Forest":
        random_forest_model = model
        
        y_train = y_train.values.ravel()
        
        random_forest_model.fit(X_train, y_train)
        
        predictions = random_forest_model.predict(test_rev)
        

In [None]:
predictions = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})


predictions.to_csv('submissions.csv', index=False)

predictions[predictions['Survived'] == 1].sum()


In [None]:

# To download the file in Jupyter Notebook

from IPython.display import FileLink
FileLink('submissions.csv')



<h2>Prediction and Result Storage</h2>

Following the application of our trained Random Forest model to the Test dataset, predictions were generated and stored in the 'Survived' column of a predictions dataframe.

Our model successfully predicted that 138 passengers survived the Titanic disaster.

To facilitate submission and further analysis, the predictions dataframe has been converted and saved as a CSV file named 'submissions.csv'. This file contains the predicted survival outcomes for the passengers in the Test dataset, allowing for easy sharing and assessment of our model's performance.