# Titanic - Machine Learning from disaster

This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize 

yourself with how the Kaggle platform works.

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

**Author:** Gerley Adriano Miranda Cruz

## 1. Setup

### 1.1 Importing the common packages

First, let's import all the packages that we will use in this notebook. We will use the following packages:

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization using matplotlib
import seaborn as sns # data visualization using seaborn
import os # accessing directory structure

from sklearn.ensemble import RandomForestClassifier # Random Forest algorithm
from sklearn.model_selection import train_test_split # splitting the data into train and test
from sklearn.metrics import accuracy_score # accuracy score

### 1.2 Loading the important files

Now, let's load the important files that we will use in this notebook. We will use the following files:

In [None]:
# Input data files are available in the read-only "input" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('input'):
    for filename in filenames: # print all files in the input directory
        print(os.path.join(dirname, filename)) # print the path of all files in the input directory

### 1.3 Building the dataframes

Now, let's build the dataframes that we will use in this notebook. We will use the following dataframes:

In [2]:
# Loading the training data
train_df = pd.read_csv("input/train.csv")

# Show the first five rows of the training data
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# Loading the test data
test_df = pd.read_csv("input/test.csv")

# Show the first five rows of the test data
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## 2. Exploratory Data Analysis

### 2.1 Overview

In this section, we will do an overview of the dataframes that we will use in this notebook. The shape of the dataframes are:

In [None]:
# Show the shape of the training data
train_df.shape

In [None]:
# Show the shape of the test data
test_df.shape

In [None]:
# Show the general information of the training data
train_df.info()

In [None]:
# Show the general information of the test data
test_df.info()

By the way, I would like to mention that upon thorough examination, it has been determined that there are no missing values 

present within the dataframes. Therefore, there is no necessity to perform any imputation techniques or procedures. This 

assurance allows us to proceed confidently with our analysis, knowing that the data is complete and does not require any further 

adjustments or modifications.

In [None]:
# Show the summary statistics of the training data
train_df.describe(include='all').T

### 2.2 Passenger Class Distribution

In this section, we will analyze the distribution of the passenger class in the train and test dataframes. The distribution of the passenger class in the train dataframe is:

In [None]:
# Show the passanger class distribution of the training data using a bar plot (countplot)
sns.countplot(x='Pclass', data=train_df, palette='hls', hue='Survived',
            order=train_df['Pclass'].value_counts().index)
sns.set(rc={'figure.figsize':(8,6)})
plt.xlabel('Passenger Class')
plt.ylabel('Count')
# Add a title to the plot
plt.title('Passenger Class Distribution - Training Data')
plt.show()

### 2.2 Only passenger distribution

The instruction for this section is to analyze the distribution of the passenger class in the train and test dataframes. The 

distribution of the passenger class in the train dataframe is:

In [None]:
# Show only the passenger class distribution of the training data using a bar plot (countplot)
survived_counts = train_df['Survived'].value_counts()

sns.countplot(data=train_df, x='Survived', palette='hls')
plt.title('Passenger Class Distribution - Training Data')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

### 2.3 Age Distribution by Survival Status

In this section, we will analyze the distribution of the age by survival status in the train dataframe. The distribution of the age by survival status in the train dataframe is:

In [None]:
# Show the passenger age distribution of the training data using a bar plot (countplot)
sns.histplot(data=train_df, x='Age', hue='Survived', kde=True,
             palette='hls', multiple='stack', edgecolor='.3', linewidth=.5)
plt.xlabel('Age')
plt.title('Passenger Age Distribution - Training Data (Survived vs Non-Survived)')
plt.show()

### 2.4 Fare Distribution by Survival Status

In this section, we will analyze the distribution of the fare by survival status in the train dataframe. The distribution of the 

fare by survival status in the train dataframe is:

In [None]:
# Show the fare distribution of the training data using a bar plot (countplot)
sns.histplot(data=train_df, x='Fare', hue='Survived', kde=True,
                palette='hls', multiple='stack', edgecolor='.3', linewidth=.5)
plt.title('Passenger Fare Distribution - Training Data (Survived vs Non-Survived)')
plt.show()

### 2.5 Survival Status by parking lot location

In this section, we will analyze the survival status by parking lot location in the train dataframe. The survival status by 

parking lot location in the train dataframe is:

In [None]:
# Show the survival rate by embarked of the training data using a bar plot (countplot)
sns.barplot(x='Embarked', y='Survived', data=train_df, palette='hls',
            errorbar=None, order=['S', 'C', 'Q'])
plt.xlabel('Embarked (S = Southampton, C = Cherbourg, Q = Queenstown)')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Embarked - Training Data (S, C, Q)')
plt.show()

### 2.6 Correlation Matrix

In this section, we will analyze the correlation matrix in the train dataframe. The correlation matrix in the train dataframe is:

In [None]:
# Show the correlation matrix of the training data using a heatmap
correlation_matrix = train_df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
            linewidths=0.2, vmin=-1, vmax=1, linecolor='white', cbar=False,
            square=True, fmt='.2f', mask=np.triu(correlation_matrix), center=0, robust=False,
            yticklabels=True, xticklabels=True, ax=None)
plt.title('Correlation Matrix - Training Data')
plt.show()

### 2.7 Distribution of the survival status by family size

In this section, we will analyze the distribution of the survival status by family size in the train dataframe. The distribution

of the survival status by family size in the train dataframe is:


In [None]:
# Show the distribution of the survival status by family size
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
sns.barplot(x='FamilySize', y='Survived', data=train_df, palette='hls', errorbar=None,
            order=train_df['FamilySize'].value_counts().index, estimator=np.mean)
plt.xlabel('Family Size (1 = Alone)')
plt.ylabel('Survival Rate (%)')
plt.title('Survival Rate by Family Size - Training Data')
plt.show()

### 2.8 Distribution of the survival status by gender

In [None]:
# Show the distribution of the survival rate by each gender
sns.barplot(x='Sex', y='Survived', data=train_df, palette='hls', errorbar=None)
plt.xlabel('Sex')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Gender')
plt.xticks([0, 1], ['Female', 'Male'])
plt.show()

### 2.9 Distribution of the relation between the cabin location and the distance to lifeboats

In this section, we will analyze the distribution of the relation between the cabin location and the distance to lifeboats in the

train dataframe. The distribution of the relation between the cabin location and the distance to lifeboats in the train dataframe

is:

In [None]:
# Show the relation between the distance to lifeboats and the survival status
train_df['Cabin_Location'] = train_df['Cabin'].str.extract('([A-Za-z])', expand=False)
estimated_distance = train_df.groupby('Cabin_Location')['Fare'].mean()

sns.barplot(x=estimated_distance.index, y=estimated_distance.values, palette='hls', errorbar=None)
plt.xlabel('Cabin Location')
plt.ylabel('Estimated Distance to Lifeboats')
plt.title('Estimated Distance to Lifeboats by Cabin Location - Training Data')
plt.show()

## 3. Feature Engineering

In this step, we will perform the feature engineering in the train and test dataframes.

### 3.1 Select the data

In this section, we will select the data that we will use in the feature engineering. The data that we will use in the feature

engineering is:

In [4]:
# Select the features to be used for training the model
train_df = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# Select the features to be used for testing the model
test_df = test_df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

### 3.2 Missing Values

In this section, we will analyze the missing values in the train and test dataframes. The missing values in the train dataframe are:

In [5]:
# Fill the missing values in the training and test data
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)

# Fill the missing values in the test data
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace=True)

### 3.3 Categorical Variables

In this section, we will analyze the categorical variables in the train and test dataframes. The categorical variables in the train dataframe are:

In [6]:
# Codification of the categorical variables for the training data
train_df['Sex'] = train_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
train_df['Embarked'] = train_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

# Codification of the categorical variables for the test data
test_df['Sex'] = test_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
test_df['Embarked'] = test_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

## 4. Modelling and Evaluation

In this step, we will perform the modelling and evaluation in the train and test dataframes.

In [7]:
# Select the features and the target variable for the training data
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']

# Split the training data into training and validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4.1 Model Selection

The model selected for this problem is the Random Forest Classifier. The Random Forest Classifier is a very powerful model that

can be used for classification and regression problems. The Random Forest Classifier is a very powerful model that can be used for

classification and regression problems.

In [8]:
# Create the model and train it with the training dat
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

### 4.2 Predictions

In this section, we will analyze the predictions in the train and test dataframes. The predictions in the train dataframe are:

In [17]:
# Drop the passenger id column from the test data
test_data = test_df.drop('PassengerId', axis=1)

# Make predictions using the test data
predictions = model.predict(test_data)

print(predictions)

[0 0 0 1 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 1 0 1 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 1
 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]


### 5. Submission

In this step, we will perform the submission in the test dataframe.

In [None]:
# Create a dataframe with the passenger id and the predictions

submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived': predictions})

# Save the dataframe as a csv file

submission.to_csv('working/submission.csv', index=False)