<a href="https://www.kaggle.com/code/aletbm/titanic-eda-and-feature-engineering?scriptVersionId=143048057" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🚢 Titanic - EDA, Data wrangling and Feature engineering
<img src=https://cdn.wallpapersafari.com/9/99/g7mtvV.jpg>

In [None]:
!pip install missingno
!pip install mplcyberpunk

# Loading packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from scipy.stats.mstats import winsorize

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import SequentialFeatureSelector

from imblearn.over_sampling import SMOTE

import xgboost as xgb

import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

import re
import missingno as msno

import mplcyberpunk
import random

plt.style.use("cyberpunk")

In [None]:
seed_value = 42
os.environ['PYTHONHASHSEED'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

# About dataset

* **survival**: Survival (0 = No, 1 = Yes)
* **pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
* **sex**: Sex
* **Age**: Age in years
* **sibsp**: Number of siblings / spouses aboard the Titanic
* **parch**: Number of parents / children aboard the Titanic
* **ticket**: Ticket number
* **fare**: Passenger fare
* **cabin**: Cabin number
* **embarked**: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

## Variable Notes
**pclass**: A proxy for socio-economic status (SES)
> 1st = Upper
> 
> 2nd = Middle
> 
> 3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
> Sibling = brother, sister, stepbrother, stepsister
> 
> Spouse = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
> Parent = mother, father
> 
> Child = daughter, son, stepdaughter, stepson
> 
> Some children travelled only with a nanny, therefore parch=0 for them.

# Loading data

The Python Pandas packages help us work with our datasets. This way, we will transform our train and test datasets into Pandas Dataframe ables to run certain operations.

In [None]:
train = pd.read_csv("../input/titanic/train.csv", index_col="PassengerId")
test = pd.read_csv("../input/titanic/test.csv", index_col="PassengerId")

df_all = pd.concat([train.drop(["Survived"], axis=1), test], axis=0)

We run some operations to obtain dataset preview.

In [None]:
df_all.info()

Both datasets contain object type feature, these feature should be proccesed later. Futhermore, some features contain severals NaN data, this is a problem that we must solve.

In [None]:
df_all.head(10)

In [None]:
target = "Survived"

# Looking for NaN values and duplicate data

In [None]:
msno.matrix(df_all)

In [None]:
df_all.isna().sum()

In [None]:
print(f"NaN Cabin values represent {round(df_all.Cabin.isna().sum()*100/len(df_all), 3)}% of the set.")
print(f"NaN Age values represent {round(df_all.Age.isna().sum()*100/len(df_all), 3)}% of the set.")
print(f"NaN Embarked values represent {round(df_all.Embarked.isna().sum()*100/len(df_all), 3)}% of the set.")
print(f"NaN Fare values represent {round(df_all.Fare.isna().sum()*100/len(df_all), 3)}% of the set.")

## Filling NaN values

## Age

I was scrapping a web page and I obtained a auxiliar dataset, in my [notebook](https://colab.research.google.com/drive/1SIajxlnX5emncjESxMtsaxZvJMHdBOCM?authuser=1) you will see how.

In [None]:
df_aux = pd.read_csv("/kaggle/input/my-titanic-extra-data/my_titanic_csv")
df_aux

### Searching to the passengers for them name

First, we need to give the same format to the names in both datasets

In [None]:
for i, name in enumerate(df_aux["Name"]):
    df_aux.loc[i, "Name"] = re.sub(r'[^\w\s-]', '', name).upper()
    
for i, name in enumerate(df_all["Name"]):
    df_all.loc[i+1, "Name_clean"] = re.sub(r'[^\w\s-]', '', name).upper()
df_all.Name_clean = df_all.Name_clean.str.split()
df_aux.head()

In [None]:
df_all.head()

In [None]:
df_all["df_aux_index"] = np.nan
df_all["df_aux_index"] = df_all["df_aux_index"].astype('object')

for i, name in enumerate(df_all["Name_clean"]):
    sub_df = df_aux.copy()
    for x in name:
        rows = len(sub_df[sub_df["Name"].str.contains(x)])
        if rows == 1:
            df_all.loc[i+1, "df_aux_index"] = sub_df[sub_df["Name"].str.contains(x)].index.values
            break
        elif rows == 0:
            if len(sub_df) == 1:
                df_all.loc[i+1, "df_aux_index"] = sub_df.index.values
                break
        else:
            sub_df = sub_df[sub_df["Name"].str.contains(x)]

Some passenger we can't find in the auxiliar dataset since them name are differents in the original dataset, we see who they are:

In [None]:
df_all[(df_all["Age"].isna()) & (df_all["df_aux_index"].isna())]

### Searching to the passengers for the ticket feature
For example [the family Lefebvre](https://www.encyclopedia-titanica.org/lefebvre.html):

In [None]:
ticket = "4133"
df_all[df_all["Ticket"].str.contains(ticket)]

In [None]:
df_aux[df_aux["Ticket"].str.contains(ticket)]

In this way, we find all passengers in the auxiliar dataset

In [None]:
df_all.loc[20, "df_aux_index"] = 1606
df_all.loc[27, "df_aux_index"] = 2039
df_all.loc[37, "df_aux_index"] = 1430
df_all.loc[48, "df_aux_index"] = 653
df_all.loc[66, "df_aux_index"] = 1589
df_all.loc[177, "df_aux_index"] = 1288
df_all.loc[224, "df_aux_index"] = 1631
df_all.loc[242, "df_aux_index"] = 1604
df_all.loc[257, "df_aux_index"] = 2233
df_all.loc[261, "df_aux_index"] = 2107
df_all.loc[325, "df_aux_index"] = 1969
df_all.loc[352, "df_aux_index"] = 2397
df_all.loc[410, "df_aux_index"] = 1289
df_all.loc[412, "df_aux_index"] = 950
df_all.loc[496, "df_aux_index"] = 1081
df_all.loc[508, "df_aux_index"] = 281
df_all.loc[525, "df_aux_index"] = 20
df_all.loc[553, "df_aux_index"] = 1669
df_all.loc[558, "df_aux_index"] = 1911
df_all.loc[569, "df_aux_index"] = 547
df_all.loc[612, "df_aux_index"] = 1112
df_all.loc[698, "df_aux_index"] = 1595
df_all.loc[710, "df_aux_index"] = 1590
df_all.loc[826, "df_aux_index"] = 746
df_all.loc[840, "df_aux_index"] = 1389
df_all.loc[879, "df_aux_index"] = 161
df_all.loc[921, "df_aux_index"] = 1983
df_all.loc[980, "df_aux_index"] = 634
df_all.loc[1008, "df_aux_index"] = 2226
df_all.loc[1019, "df_aux_index"] = 1451
df_all.loc[1024, "df_aux_index"] = 1287
df_all.loc[1025, "df_aux_index"] = 2190
df_all.loc[1117, "df_aux_index"] = 1588
df_all.loc[1174, "df_aux_index"] = 738
df_all.loc[1178, "df_aux_index"] = 710
df_all.loc[1184, "df_aux_index"] = 1618
df_all.loc[1224, "df_aux_index"] = 2225
df_all.loc[1234, "df_aux_index"] = 1966

In [None]:
df_all.loc[df_all['Age'].isna(), "Age"] = df_all[df_all['Age'].isna()].merge(df_aux, left_on='df_aux_index', right_index=True)["Age_y"]
df_all.drop(["df_aux_index", "Name_clean"], axis=1, inplace=True)
df_all["Age"] = df_all["Age"].astype("float64")
msno.matrix(df_all)

Perfect! but, we still have some null values in the age feature, this cases we can fill they with median value.

In [None]:
df_all["Age"] = df_all.groupby(['Sex', 'Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))
msno.matrix(df_all)

## Embarked

In [None]:
df_all[df_all["Embarked"].isna()]

Both of those passengers have a same ticket number, this means that embarked from the same port. [Rose Amélie Icard](https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html) embarked from Southampton (S) with his employer [Martha Evelyn Stone](https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html).

In [None]:
df_all.loc[df_all["Embarked"].isna(), "Embarked"] = "S"
msno.matrix(df_all)

## Fare

In [None]:
df_all[df_all["Fare"].isna()]

We don't know the fare that paid [Storey, Mr. Thomas](https://www.encyclopedia-titanica.org/titanic-victim/thomas-storey.html) but we know that Mr. Storey traveled in 3rd class and he embarked in Southampton, therefore:

In [None]:
df_all["Fare"] = df_all.groupby(['Sex', 'Embarked', 'Pclass'])['Fare'].apply(lambda x: x.fillna(x.median()))
df_all[df_all["Name"] == "Storey, Mr. Thomas"]

In [None]:
msno.matrix(df_all)

## Cabin

In [None]:
df_all.drop(["Cabin"], axis=1, inplace=True)
train.drop(["Cabin"], axis=1, inplace=True)
test.drop(["Cabin"], axis=1, inplace=True)
msno.matrix(df_all)

# Duplicate data

In [None]:
train.duplicated().sum()

In [None]:
test.duplicated().sum()

In [None]:
train.loc[:, df_all.columns] = df_all.loc[train.index]
test.loc[:, df_all.columns] = df_all.loc[test.index]

# Data Visualization

## Percentage ratios:

#### Percentage of men and women in the training set:

In [None]:
sizes = train["Sex"].value_counts()
plt.figure(figsize=(15, 15))
_, _, autotexts = plt.pie(sizes, labels=train["Sex"].unique(), autopct='%1.2f%%')
for autotext in autotexts:
    autotext.set_color('black')
plt.title("Sex")
plt.show()

In [None]:
plt.figure(figsize=(20, 8))
sns.countplot(data=train, y="Sex", hue="Survived")

The "Sex" feature seems to be a determining factor in predicting whether or not and individual will survive, the women have a markedly greater probability of surviving than men, this is obvious because of during the sinking the lives women and children were priorized.

#### Percentage of survivors by passenger class in the training set:

In [None]:
plt.figure(figsize=(20, 8))
sns.countplot(data=train, y="Pclass", hue="Survived")

It is quite obvious that the first-class passengers had most probabilities of surviving than the rest passenger classes. Therefore, this feature so will be importante for the prediction.

#### Percentage of survivors by port of embarkation in the dataset:

In [None]:
plt.figure(figsize=(20, 8))
sns.countplot(data=train, y="Embarked", hue="Survived")

The "Embarked" feature is important to determine whether or not individual will survive, we can see that the passengers that embarked in Southampton had less likely to survive.

#### Percentages for accompanied passengers or lonely passengers:

In [None]:
pd.DataFrame({"Amount":train['SibSp'].value_counts(),
                "%":train['SibSp'].value_counts()*100/train.shape[0]}, index=train['SibSp'].value_counts().keys())

We can see that more than half of the passengers boarded alone. And a third of the passengers on board with a brother/spouse.

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=train, y="SibSp", hue="Survived")

However, the group of people who shipped with one or two siblings/spouse had a better percentage of survivors than with the group of people who shipped with no siblings/spouse, beforehand we could say that shipped with a companion is important for predict whether or not individual will survive but, however, when we see the group of passengers who shipped with more than two siblings/spouse, the percentage of survivors is lower, even in some cases it is null.

In [None]:
pd.DataFrame({"Amount":train['Parch'].value_counts(),
                "%":train['Parch'].value_counts()*100/train.shape[0]}, index=train['Parch'].value_counts().keys())

Something similar can be observed with respect to the percentage of passengers who shipped without children, which is more than 50% of the passengers.

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=train, y="Parch", hue="Survived")

Something similar happens to what is observed with "SibSp" features. In short, there is a greater chance of survival if the individual doesn't board the ship alone. This will require create a new feature that define this situation and perhaps we can leave out of the "SibSp" and "Parch" features.

#### Percentage of survivors

In [None]:
sizes = train["Survived"].value_counts()
plt.figure(figsize=(15, 15))
_, _, autotexts = plt.pie(sizes, labels=["Non-survivors", "Survivors"], autopct='%1.2f%%')
for autotext in autotexts:
    autotext.set_color('black')
plt.title("Survived")
plt.show()

The dataset is imbalanced.

#### Features by condition of survival

In [None]:
colors = [
    '#08F7FE',  # teal/cyan
    '#FE53BB',  # pink
    '#F5D300',  # yellow
    '#00ff41', # matrix green
    '#1E22AA',
]

In [None]:
plt.subplots_adjust(hspace=0.2)

fig,axs = plt.subplots(4,2, figsize = (20,18))
i=1
for feature in train.columns:
    if feature not in ["PassengerId", "Survived", "Name", "Ticket", "Cabin"]:
        plt.subplot(4,2,i)
        sns.histplot(data=train, x=feature, kde=True, hue='Survived')
        i+=1

Important observations:
+ The highest number of deaths corresponds to the third class passengers, the second class passengers have a more balanced situation and the first class passengers have a highest number of survivors.
+ We also confirmed that the highest number of deaths was of male passengers.
+ A large majority of the passengers were between 20 to 30 years of age, in addition, we can observe that for passengers under 16 years of age the curve of survivors tends to be greater than that the curve of deaths.
+ The highest number of deaths corresponds to lonely passengers.
+ A large majority of the passengers who did not survive paid a fare of less than 50. We also see atypical cases where some passengers paid a fare greater than 200.
+ A large part of the passengers shipped the Titanic in the port of Southampton, where more than half of these passengers did not survive. The passengers that shipped in the port of Queenstown more than half of these passengers did survive. We will see later why this is so.

#### Features by condition of sex

In [None]:
fig,axs = plt.subplots(4,2, figsize = (20,18))
i=1
for feature in train.columns:
    if feature not in ["Sex", "Name", "Ticket", "Cabin"]:
        plt.subplot(4,2,i)
        sns.histplot(data=train, x=feature, kde=True, hue='Sex')
        i+=1

Important observations:
+ As we already mentioned the large part of survivors were women, less than half of the survivors were men.
+ In all passengers classes the large part of deaths were men, mostly male passengers of the third class.
+ The mode in the age of men was around 25 to 30 years, while that the mode in the age of women was around 20 to 25 years.
+ The percentage of lonely men were higher than that the percentage of lonely women.
+ On fare issues, for both men and women, the majority paid a rate of less than 50.
+ For both men and women, the majority on board in the port of Southampton.

#### Features by condition of passenger class 

In [None]:
fig,axs = plt.subplots(4,2, figsize = (20,18))
i=1
for feature in train.columns:
    if feature not in ["Pclass", "Name", "Ticket", "Cabin"]:
        plt.subplot(4,2,i)
        sns.histplot(data=train, x=feature, kde=True, hue='Pclass',palette=colors[:3])
        i+=1

Important observation:
+ As we already mentioned, the large part of deaths correspond to the third class passengers, but incredibly they were also the passenger with the largets number of survivors, perhaps this is due to the disproportinate number of passengers that correspond to the third class.
+ The largest population were male passengers of the third class.
+ We can see how the average age of the passengers was higher the better the class of the ticket, the third class passengers had a mode between 20 to 25 years, the mode for second class passengers was between 25 to 35 years and the mode for first class passengers was between 35 to 40 years. This clearly reflects a relation between the age and the individual's economy.
+ The large part of lonely men correspond to the third class.
+ And as it was obvious, the third class passenger paid lower fares than the second and first class passengers. There are cases where third class passenger paid higher fares than the second and first class passengers.
+ Finally, we said that of the passengers who boarded at the port of Queenstown more than half survived, this was precisely because the vast majority of those who boarded at that port were first class passengers.

In [None]:
plt.figure(figsize=(20,10))
sns.scatterplot(data=train, x="Age", y="Fare", hue="Survived")

# Detecting outliers

In [None]:
train.describe().T

In [None]:
fig,axs = plt.subplots(2,1, figsize = (20,10))
i=1
for feature in ["Fare", "Age"]:
    plt.subplot(2,1,i)
    sns.boxplot(data=train, x=feature)
    i+=1

With a boxplot we can clearly see how many outliers we can find in the Fare and Age features.

In [None]:
def detect_outliers(df, columns, method="IQR", llimit=0.05, ulimit=0.05):
    df_ = df.copy()
    for col in columns:
        if method == "IQR":
            Q1 = df_[col].quantile(0.25)
            Q3 = df_[col].quantile(0.75)
            IQR = Q3 - Q1
            interval = ((df_[col] < Q1 - 1.5*IQR) | (df_[col] > Q3 + 1.5*IQR))
            df_.loc[interval, col] = df_.loc[~interval, col].mean()
        if method == "WIN":
            df_[col] = winsorize(df_[col], limits = [llimit,ulimit])
    return df_

train = detect_outliers(train, ["Age", "Fare"], "WIN")

fig,axs = plt.subplots(2,1, figsize = (20,5))
i=1
for feature in ["Fare", "Age"]:
    plt.subplot(2,1,i)
    sns.boxplot(data=train, x=feature)
    i+=1

In [None]:
train = detect_outliers(train, ["Fare"], "WIN", ulimit=0.13)

plt.figure(figsize = (20,3))
sns.boxplot(data=train, x=feature)

# Feature Engineering

According to exposed [here](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/), the title from the name is an important feature for the prediction so that we'll try to process this information to obtain a numerical feature. Also, the raw feature name is a difficult feature to correlate with the label output since each row represent a different category and this is not useful.

In the previous blog they give us an idea to process these names, It basically consist of observing the title that each person receive, let's see which ones exist:

## Working on the "Name" feature

In [None]:
df_all['Title'] = df_all["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)
pd.DataFrame({"Amount":df_all['Title'].value_counts(),
                "%":df_all['Title'].value_counts()*100/df_all.shape[0]}, index=df_all['Title'].value_counts().keys())

Well, we are insterested in the four top most frequent titles, these are Mr, Miss, Mrs and Master, the rest we will group them in a category called "Other". Based on this we will assing them each category a number to convert this feature into a numerical feature.

Let's keep in mind that "Mlle" is "Mademoiselle" which is synonym for "Miss", the same can be observed for Mme = Mrs and Ms = Miss.

In [None]:
df_all["Title"].replace(["Dr", "Countess", "Sir", "Don", "Jonkheer"], "Other", inplace=True)
df_all["Title"].replace([ 'Rev', 'Major', 'Col', 'Capt'], "Military", inplace=True)
df_all["Title"].replace("Ms", "Miss", inplace=True)
df_all["Title"].replace("Lady", "Miss", inplace=True)
df_all["Title"].replace(["Mme", "Dona"], "Mrs", inplace=True)
df_all["Title"].replace("Mlle", "Miss", inplace=True)
df_all['Title'].value_counts()

In [None]:
plt.figure(figsize=(20, 5))
sns.countplot(data=df_all, y="Title", order=df_all["Title"].value_counts().sort_values(ascending=False).index)

Thus, we can now leave out the "Name" feature.

## Working on the "Ticket" feature

In [None]:
df_all["Ticket_number"] = np.nan
df_all["Ticket_split"] = df_all["Ticket"].str.split()
for i, ticket in enumerate(df_all["Ticket_split"]):
    row = ticket.copy()
    for x in row:
        if not re.match(r'^([\s\d]+)$', x):
            ticket.remove(x)
    if len(ticket) > 0:
        df_all.loc[i+1, "Ticket_number"] = max(ticket, key=len)

In [None]:
df_all[df_all["Ticket_number"].isna()]

We going to assign to LINE ticket the number 0.

In [None]:
df_all.loc[df_all["Ticket_number"].isna(), "Ticket_number"] = 0

In [None]:
top_ticket = df_all["Ticket_number"].value_counts().sort_values(ascending=False).iloc[0:10]

plt.figure(figsize=(20, 7))
sns.countplot(data=df_all[df_all["Ticket_number"].isin(top_ticket.index)], y="Ticket_number", order=top_ticket.index)

In [None]:
df_all["Ticket_number"] = df_all["Ticket_number"].astype("int64")

## Ticket passenger is the same for several passenger. Were they traveled with family and friends?

In [None]:
df_all["Family"] = df_all.merge(df_all["Ticket_number"].value_counts(), left_on='Ticket_number', right_index=True)["Ticket_number_y"]
df_all.head()

In [None]:
plt.figure(figsize=(20, 7))
sns.countplot(data=df_all, y="Family")

## Dropping features

In [None]:
df_all.drop(columns=["Ticket", "Name", "Ticket_split"], axis=1, inplace=True)
df_all.head()

#### Is the passenger alone?

In [None]:
df_all["Alone"] = 0
df_all["Alone"] = df_all["Alone"].astype("uint8")
df_all.loc[df_all["Family"] == 1, "Alone"] = 1
print(f"The {round(df_all['Alone'].sum()*100/len(df_all), 3)}% of passengers were alone.")

In [None]:
train_ = df_all.loc[train.index]
train_["Survived"] = train["Survived"]

In [None]:
plt.figure(figsize=(20, 5))
sns.countplot(data=train_, y="Alone", hue="Survived")

It appears that traveling alone affects the passenger's chance of survival.

#### Combining age and passenger classes

In [None]:
df_all["Age"] = np.floor(df_all["Age"]).astype('int32')
df_all["AgexPclass"] = df_all["Age"] * df_all["Pclass"]
df_all.head()

#### Features by age of the passengers:

In [None]:
df_all["AgeBand"] = 1
df_all.loc[ df_all['Age'] <= 16, 'AgeBand'] = 1
df_all.loc[(df_all['Age'] > 16) & (df_all['Age'] <= 32), 'AgeBand'] = 2
df_all.loc[(df_all['Age'] > 32) & (df_all['Age'] <= 48), 'AgeBand'] = 3
df_all.loc[(df_all['Age'] > 48) & (df_all['Age'] <= 64), 'AgeBand'] = 4
df_all.loc[ df_all['Age'] > 64, 'AgeBand'] = 5

In [None]:
fig,axs = plt.subplots(4,2, figsize = (20,18))
i=1
for feature in df_all.columns:
    if feature not in ["Title", "Alone", "AgexPclass", "Age", "AgeBand"]:
        ax = plt.subplot(4,2,i)
        sns.histplot(data=df_all, x=feature, kde=True, hue='AgeBand', palette=colors)
        i += 1

#### Features by price of the fare:
I would like to see the relations regarding the fare, for this I will implement a new feature that will try to reduce the number of categories that "Fare" contains. We will take the values of the quartiles to define classification limits.

In [None]:
low_cost = df_all["Fare"].describe()["25%"]
medium_cost = df_all["Fare"].describe()["50%"]
expensive = df_all["Fare"].describe()["75%"]

df_all["FareGroup"] = "Low"
df_all.loc[(df_all["Fare"] > low_cost) & (df_all["Fare"] <= medium_cost), "FareGroup"] = "Standard"
df_all.loc[(df_all["Fare"] > medium_cost) & (df_all["Fare"] <= expensive), "FareGroup"] = "Expensive"
df_all.loc[df_all["Fare"] > expensive, "FareGroup"] = "Too Expensive"
df_all.head()

In [None]:
fig,axs = plt.subplots(4,2, figsize = (20,18))
i=1
for feature in df_all.columns:
    if feature not in ["Fare", "FareGroup", "AgeBand", "Title", "Alone", "AgexPclass"]:
        plt.subplot(4,2,i)
        sns.histplot(data=df_all, x=feature, kde=True, hue='FareGroup', palette=colors[:4])
        i+=1

We can do some observations but the most important for me is that those people that paid higher fares had a better chance of survival than those who paid lower fares.

# Enconding Features

## Converting FareGroup from categorical to numerical:

In [None]:
fare_map = {"Low":1, "Standard":2, "Expensive":3, "Too Expensive":4}
df_all["FareGroup"] = df_all["FareGroup"].map(fare_map)
df_all[["FareGroup"]].head()

## Converting Sex from categorical to numerical:

In [None]:
df_all["Sex"] = df_all["Sex"].map({"male":0, "female":1})
df_all[["Sex"]].head()

## Converting Embarked from categorical to numerical:

In [None]:
df_all["Embarked"] = df_all["Embarked"].map({'C':0, 'Q':1, 'S':2})
df_all[["Embarked"]].head()

## Converting Title from categorical to numerical:

In [None]:
title_map = {"Mr":1, "Miss":2, "Mrs":3, "Master":4, "Military":5, "Other":0}
df_all["Title"] = df_all["Title"].map(title_map)
df_all[["Title"]].head()

In [None]:
train_ = df_all.loc[train.index, :]
train_["Survived"] = train["Survived"]
train = train_.copy()

test = df_all.loc[test.index, :].copy()

# Correlations

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(train.corr(), annot=True ,cmap="BuPu", vmin=-1, vmax=1);

## Dealing with multicollinearity problem

In [None]:
def detect_VIF(df):
    df_ = df.copy()
    df_.drop(["Survived"], axis=1, inplace=True)
    df_['intercept'] = 1
    with np.errstate(divide='ignore'):
        while(True):
            df_vif = pd.DataFrame(columns=["Features", "VIF"])
            df_vif["Features"] = df_.columns
            df_vif["VIF"] = [variance_inflation_factor(df_.values, i) for i in range(len(df_.columns))]
            df_vif = df_vif[df_vif["Features"] != "intercept"].sort_values("VIF", ascending=False)
            if df_vif.iloc[0]["VIF"] > 5:
                df_.drop([df_vif.iloc[0]["Features"]], axis=1, inplace=True)
            else:
                next_ = False
                break
    df_.drop(["intercept"], axis=1, inplace=True)
    return df[df_.columns.tolist() + ["Survived"]], df_vif
        
train_clean, df_vif = detect_VIF(train)
df_vif

In [None]:
plt.figure(figsize=(15, 15))
sns.heatmap(train_clean.corr(), annot=True ,cmap="BuPu", vmin=-1, vmax=1);

# Splitting into features and labels

In [None]:
target = "Survived"
continuos_col = train.select_dtypes(['float64']).columns.tolist()
discretes_col = train.select_dtypes(['int64', 'int32']).columns.tolist()
discretes_col.remove(target)

In [None]:
X = train.drop([target], axis=1)
y = train[target].copy()

# Dealing with imbalanced data

In [None]:
sm = SMOTE(k_neighbors=8,random_state=42)
X, y = sm.fit_resample(X, y)
y.value_counts()

# Splitting into training, test and validation set

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.30, random_state=seed_value, stratify=y)
X_valid, X_test, y_valid, y_test = train_test_split(X_valid, y_valid, test_size=0.30, random_state=seed_value, stratify=y_valid)
test_ = test.copy()

print(f'Amount of data in training dataset: {X_train.shape[0]}')
print(f'Amount of data in validation dataset: {X_valid.shape[0]}')
print(f'Amount of data in test dataset: {X_test.shape[0]}')

# Standardization and Normalization

In [None]:
std = StandardScaler()
mms = MinMaxScaler()

X_train[continuos_col+discretes_col] = std.fit_transform(X_train[continuos_col+discretes_col], y=y_train)
X_train[continuos_col+discretes_col] = mms.fit_transform(X_train[continuos_col+discretes_col], y=y_train)

X_valid[continuos_col+discretes_col] = std.transform(X_valid[continuos_col+discretes_col])
X_valid[continuos_col+discretes_col] = mms.transform(X_valid[continuos_col+discretes_col])

X_test[continuos_col+discretes_col] = std.transform(X_test[continuos_col+discretes_col])
X_test[continuos_col+discretes_col] = mms.transform(X_test[continuos_col+discretes_col])

test_[continuos_col+discretes_col] = std.transform(test_[continuos_col+discretes_col])
test_[continuos_col+discretes_col] = mms.transform(test_[continuos_col+discretes_col])

In [None]:
X_train

# Training Models

We will train three supervised learning models to tasks of classification also we will use grid search to tuning models's hyperparameters, additionally we will evaluate their performance with confusion matrix where:

<table>
    <tr>
        <th colspan="2" rowspan="2"></th>
        <th colspan="2">Predicted</th>
    </tr>
    <tr>
        <td>Negative</td>
        <td>Positive</td>
    </tr>
    <tr>
        <th rowspan="2">Actual</th>
        <td>Negative</td>
        <td>TN</td>
        <td>FP</td>
    </tr>
    <tr>
        <td>Positive</td>
        <td>FN</td>
        <td>TP</td>
    </tr>
</table>

Taking into account that:
+ Case negative: The passenger did not survive
+ Case positive: The passenger survived
+ TN: The prediction tells us that the passenger did not survive when actually the passenger did not survive.
+ TP: The prediction tells us that the passenger survived when actually has the passenger survived.
+ FN: The prediction tells us that the passenger did not survive when actually the passenger survived.
+ FP: The prediction tells us that the passenger survived when actually the passenger did not survive.

The worst case is a prediction of type FN, since we would be determining that the passenger did not survive, omitting a live passenger. For this reason, we will focus to reduce these type of predictions.
However, the amount of predictions of type FP shouldn't be too large since our model will be very useless.
The score that help us to analysis the amount of predictions of type FN is the recall where:

$Recall = \frac{TP}{TP + FN}\quad\text{if}\quad FN \rightarrow 0 \Longrightarrow Recall \rightarrow 1$

Also:

$Precision = \frac{TP}{TP + FP}\quad\text{if}\quad FP \rightarrow 0 \Longrightarrow Precision \rightarrow 1$

and

$F1 = \frac{TP}{TP + \frac{FN + FP}{2}}\quad\text{if}\quad FN, FP \rightarrow 0 \Longrightarrow F1 \rightarrow 1$

#### Logistic Regression

In [None]:
lr = LogisticRegression(random_state=seed_value)

param_grid = {"max_iter": [100, 150, 200],
             "C": np.linspace(1, 10, 10),
             "tol":np.linspace(5e-4, 0.005, 11)}

grid = GridSearchCV(lr, param_grid, cv=10, scoring=["f1", "recall"], refit="f1")
grid.fit(X_valid, y_valid)
grid.best_params_

In [None]:
lr_best = grid.best_estimator_
lr_best.fit(X_train, y_train)

In [None]:
def my_cm(y_test, y_pred):
    plt.figure(figsize=(10, 10))
    cm_val = confusion_matrix(y_test, y_pred)
    cm_pgs = np.round(confusion_matrix(y_test, y_pred, normalize='true')*100, 4)

    formatted_text = (np.asarray([f"{pgs}%\n({val})" for val, pgs in zip(cm_val.flatten(), cm_pgs.flatten())])).reshape(2, 2)

    sns.heatmap(cm_pgs, annot=formatted_text, fmt='', cmap='BuPu')
    plt.title("Confusion matrix")
    plt.xlabel("Prediction")
    plt.ylabel("Actual")
    return

y_pred_lr = lr_best.predict(X_test)
my_cm(y_test, y_pred_lr)

In [None]:
pd.DataFrame(data=classification_report(y_test, y_pred_lr, digits=6, output_dict=True)).transpose()

#### Support Vector Classification

In [None]:
svc = SVC(random_state=seed_value)

param_grid = {"max_iter": [300, 350, 400],
              "kernel": ["linear", "poly", "rbf"],
              "tol": np.linspace(1e-1, 1, 11)}

grid = GridSearchCV(svc, param_grid, cv=10, scoring=["f1", "recall"], refit="f1")
grid.fit(X_valid, y_valid)
grid.best_params_

In [None]:
svc_best = grid.best_estimator_
svc_best.fit(X_train, y_train)

In [None]:
y_pred_svc = svc_best.predict(X_test)
my_cm(y_test, y_pred_svc)

In [None]:
pd.DataFrame(data=classification_report(y_test, y_pred_svc, digits=6, output_dict=True)).transpose()

#### K-Nearest Neighbors

In [None]:
knn = KNeighborsClassifier()

param_grid = {"n_neighbors": [5, 6, 7, 8, 10],
              "leaf_size":  np.linspace(30, 40, 11),
             "weights": ["uniform", "distance"]}

grid = GridSearchCV(knn, param_grid, cv=10, scoring=["f1", "recall"], refit="f1")
grid.fit(X_valid, y_valid)
grid.best_params_

In [None]:
knn_best = grid.best_estimator_
knn_best.fit(X_train, y_train)

In [None]:
y_pred_knn = knn_best.predict(X_test)
my_cm(y_test, y_pred_knn)

In [None]:
pd.DataFrame(data=classification_report(y_test, y_pred_knn, digits=6, output_dict=True)).transpose()

#### Decission Tree

In [None]:
dt = DecisionTreeClassifier(random_state=seed_value)

param_grid = {"criterion":["gini", "entropy"],
              "max_leaf_nodes": np.linspace(2, 11, 10, dtype="int32"),
              "max_depth": np.linspace(1, 10, 10, dtype="int32")}

grid = GridSearchCV(dt, param_grid, cv=10, scoring=["f1", "recall"], refit="f1")
grid.fit(X_valid, y_valid)
grid.best_params_

In [None]:
dt_best = grid.best_estimator_
dt_best.fit(X_train, y_train)

In [None]:
y_pred_dt = dt_best.predict(X_test)
my_cm(y_test, y_pred_dt)

In [None]:
pd.DataFrame(data=classification_report(y_test, y_pred_dt, digits=6, output_dict=True)).transpose()

#### Random Forest

In [None]:
rf = RandomForestClassifier(random_state=seed_value)

param_grid = {"n_estimators":[250, 300, 350, 400],
              "criterion":["gini", "entropy"]}

grid = GridSearchCV(rf, param_grid, cv=10, scoring=["f1", "recall"], refit="f1")
grid.fit(X_valid, y_valid)
grid.best_params_

In [None]:
rf_best = grid.best_estimator_
rf_best.fit(X_train, y_train)

In [None]:
y_pred_rf = rf_best.predict(X_test)
my_cm(y_test, y_pred_rf)

In [None]:
pd.DataFrame(data=classification_report(y_test, y_pred_rf, digits=6, output_dict=True)).transpose()

#### XGBoost Classifier

In [None]:
xgb_model = xgb.XGBClassifier()

params = {
    "gamma": np.linspace(0.5, 1, 5),
    "learning_rate": np.linspace(0.01, 0.1, 10),
    "n_estimators": [100, 200, 300]
}

grid = GridSearchCV(xgb_model, param_grid=params, cv=5, scoring=["f1", "recall"], refit="f1")

grid.fit(X_valid, y_valid)
grid.best_params_

In [None]:
xgb_best = grid.best_estimator_
xgb_best.fit(X_train, y_train)

In [None]:
y_pred_xgb = xgb_best.predict(X_test)
my_cm(y_test, y_pred_xgb)

In [None]:
pd.DataFrame(data=classification_report(y_test, y_pred_xgb, digits=6, output_dict=True)).transpose()

In [None]:
from catboost import CatBoostClassifier

catboost_model = CatBoostClassifier(
    iterations=200, 
    learning_rate=0.1,
)
catboost_model.fit(X_train, y_train)

In [None]:
y_pred_cat = catboost_model.predict(X_test)
my_cm(y_test, y_pred_cat)

In [None]:
pd.DataFrame(data=classification_report(y_test, y_pred_cat, digits=6, output_dict=True)).transpose()

# 🤔 Model comparision

## With the training set

In [None]:
def print_scores(y, y_pred, pp_scores=False):
    ac, pr, rc, f1 = accuracy_score(y, y_pred)*100, precision_score(y, y_pred, average='macro')*100, recall_score(y, y_pred, average='macro')*100, f1_score(y, y_pred, average='weighted')*100
    if pp_scores == True:
        print(f"Accuracy:{ac}")
        print(f"Precision:{pr}")
        print(f"Recall:{rc}")
        print(f"F1-score:{f1}")
    return {'Accuracy': ac, 'Precision':pr, 'Recall':rc, 'F1-score':f1}

y_train_lr = lr_best.predict(X_train)
y_train_svc = svc_best.predict(X_train)
y_train_dt = dt_best.predict(X_train)
y_train_knn = knn_best.predict(X_train)
y_train_rf = rf_best.predict(X_train)
y_train_xgb = xgb_best.predict(X_train)
y_train_cat = catboost_model.predict(X_train)

lr_scores = print_scores(y_train, y_train_lr)
svc_scores = print_scores(y_train, y_train_svc)
dt_scores = print_scores(y_train, y_train_dt)
knn_scores = print_scores(y_train, y_train_knn)
rf_scores = print_scores(y_train, y_train_rf)
xgb_scores = print_scores(y_train, y_train_xgb)
cat_scores = print_scores(y_train, y_train_cat)

scores = pd.DataFrame(data=[list(lr_scores.values()),
                            list(svc_scores.values()),
                            list(dt_scores.values()),
                            list(knn_scores.values()),
                            list(rf_scores.values()),
                            list(xgb_scores.values()),
                            list(cat_scores.values()),
                           ], columns=list(lr_scores.keys()))

scores = scores.transpose()
scores = scores.rename(columns={0:"Linear Regression",
                                1:"Support Vector Machine",
                                2:"K-Nearest Neighbors",
                                3:"Decission Tree",
                                4:"Random Forest",
                                5:"XGBoost",
                                6:"CatBoost"
                               })
scores.style.highlight_min(color = 'red', axis = 1).highlight_max(color = 'green', axis = 1)

## With the testing set

In [None]:
lr_scores = print_scores(y_test, y_pred_lr)
svc_scores = print_scores(y_test, y_pred_svc)
dt_scores = print_scores(y_test, y_pred_dt)
knn_scores = print_scores(y_test, y_pred_knn)
rf_scores = print_scores(y_test, y_pred_rf)
xgb_scores = print_scores(y_test, y_pred_xgb)
cat_scores = print_scores(y_test, y_pred_cat)

scores = pd.DataFrame(data=[list(lr_scores.values()),
                            list(svc_scores.values()),
                            list(dt_scores.values()),
                            list(knn_scores.values()),
                            list(rf_scores.values()),
                            list(xgb_scores.values()),
                            list(cat_scores.values()),
                           ], columns=list(lr_scores.keys()))

scores = scores.transpose()
scores = scores.rename(columns={0:"Linear Regression",
                                1:"Support Vector Machine",
                                2:"K-Nearest Neighbors",
                                3:"Decission Tree",
                                4:"Random Forest",
                                5:"XGBoost",
                                6:"CatBoost"
                               })
scores.style.highlight_min(color = 'red', axis = 1).highlight_max(color = 'green', axis = 1)

For me, CatBoost classifier is the best classifier since it has the best balance in the predictions.

In [None]:
feature_importances = pd.Series(catboost_model.get_feature_importance(), index=X_train.columns)
colors = [
    "#00ff9f",
    '#08F7FE',  # teal/cyan
    "#001eff",
    '#FE53BB',  # pink
    '#F5D300',  # yellow
    '#00ff41',  # matrix green
    "#bd00ff",
    "#F600ff"
]
fig, ax = plt.subplots(figsize=(20, 20))
feature_importances.sort_values().plot.barh(color=list(reversed(colors)), ax=ax)
ax.set_title("Fetures importances")
ax.set_ylabel("Average impurity decrease")
fig.tight_layout()

# Solve:

In [None]:
y_pred = catboost_model.predict(test_)

In [None]:
submission = pd.DataFrame({
        "PassengerId": test.index,
        "Survived": y_pred
    })
submission.to_csv('./submission.csv', index=False)

In [None]:
sizes = submission["Survived"].value_counts()

plt.figure(figsize=(15, 15))
_, _, autotexts = plt.pie(sizes, labels=["Survived", "No survived"], autopct='%1.2f%%')

for autotext in autotexts:
    autotext.set_color('black')
plt.title("Survived")
plt.show()

## My position in the leaderboard at the time of editing this notebook:  
😃 Position 1734 of 14,140 teams - My submission scored 0.78468