# Data preprocessing - (train.csv)

## Importing libraries, cloning the repository and reading data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
train = pd.read_csv('../data/train.csv')

## Exploratory analysis

In [None]:
train

In [None]:
train.head()

In [None]:
# Number of (0) passengers who did not survive; (1) number of passengers who survived
np.unique(train['Survived'], return_counts=True)

In [None]:
train.isnull().sum()

The 'Age' column has 177 NaN values, so these NaNs will be replaced for the median. For the 'Cabin', I chose to remove this column, because not only does it have many NaN values but we can also infer whether the cabin is located on a lower or higher deck based on the 'Fare' or 'Pclass' columns. Finally, for the 'Embarked' column, I will replace the two NaN values with the mode.

## Data Cleaning

In [None]:
train['Age'].fillna(train['Age'].median(), inplace=True)

In [None]:
train.drop('Cabin', axis=1, inplace=True)

In [None]:
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)

In [None]:
train.isnull().sum()

In [None]:
train.describe()

Now the train dataset does not have any NaN values or missing values.

## Data Analysis

### Categorical Variable X Survived

In [None]:
sns.countplot(x=train['Pclass'], hue=train['Survived']);
plt.title('How many passengers survived per class?')
plt.show()

Here we can see a clear pattern: passenger of the 3rd class have a lower survivability rate than those belonging to the 1st or 2st class.

In [None]:
sns.countplot(x=train['Sex'], hue=train['Survived']);
plt.title('How many passengers of each gender have survived?')
plt.show()

Now, comparing the survivability rate of each gender, we can confirm that female passengers have a higher chance of survival

In [None]:
sns.countplot(x=train['Embarked'], hue=train['Survived']);
plt.title('Number of survivors by port of embarkation')
plt.show()

Passengers coming from Southampton have lower survivability rate compared to passengers coming from Queenstown.

### Numerical Variables X Survived

In [None]:
sns.histplot(x=train['Age'], hue=train['Survived'], kde=True);

Comparing the survivability by age, we see there is only one point where the number of survivors is higher than or equal to the number of deaths. This occurs in children aged 10 years or younger. This confirms that the protocol 'women and children first' was followed.

In [None]:
sns.histplot(x=train['Fare'], hue=train['Survived'], kde=True);

Ticket fare gives us a proxy for socio-economic status and passenger class. Passengers in 1st Class had cabins on the upper decks, giving them easier and faster access to the lifeboats. On the other hand, passengers in 3rd Class had cabins on the lower decks, making their escape far more difficult.

In [None]:
sns.histplot(x=train['SibSp'], hue=train['Survived'], kde=True);

The graph shows that passengers who traveled alone (SibSp = 0) had a low survival rate. A probable explanation for this is the lack of support during the evacuation. For the passengers who traveled with one companion (SibSp = 1), the survival rate was the highest, probably due to the mutual support to evacuate the ship. Finally, those who traveled in larger families (SibSp >= 2), the survival rate drops sharply; this may be because coordinating the evacuation of a larger family was more difficult and time-consuming.

In [None]:
sns.histplot(x=train['Parch'], hue=train['Survived'], kde=True);

Survival by Family Size (Parents/Children - Parch)

This graph shows that survival rates varied based on the number of parents or children (Parch) a passenger had on board.

    Traveling without Dependents (Parch = 0): Passengers traveling without parents or children had a lower survival rate.

    Small Families (Parch = 1 or 2): This was the "sweet spot." Traveling with one or two family members in a parent/child relationship offered the highest chance of survival.

    Large Families (Parch >= 3): The survival rate dropped significantly for larger families, likely due to the difficulty of keeping the group together and evacuating.

### Relationship between variables

In [None]:
sns.catplot(x=train['Pclass'], hue=train['Survived'], col=train['Sex'], kind="count");

Survival by Sex and Passenger Class

This chart provides a decisive look at how gender and social class were the two most critical factors for survival on the Titanic.

    Gender was the primary factor: The "women and children first" rule is clearly visible. Women had a vastly higher survival rate than men, regardless of class.

    Class was the secondary factor: Within each gender, a higher class meant a better chance of survival. First-class passengers had better odds than second-class, who in turn had better odds than third-class.

    A Woman in 3rd Class > A Man in 1st Class: The data shows that being female was a greater survival advantage than being wealthy. A woman in third class had a better chance of surviving than a man in first class.

In [None]:
sns.boxplot(x=train['Pclass'], y=train['Fare'], hue=train['Survived']);

Fare, Class, and Survival: A Deeper Look

This box plot provides a statistical summary of how ticket fare related to survival within each passenger class, revealing a key nuance.

    In First Class, Money Talked: Passengers who survived in 1st Class paid a significantly higher median fare than those who did not. This suggests that more expensive tickets, which likely corresponded to better-located cabins, were correlated with a higher chance of survival.

    In 2nd & 3rd Class, Fare Was Not a Factor: For these passengers, the fare distributions for survivors and non-survivors are nearly identical. This indicates that within the middle and lower classes, the specific price of a ticket had no discernible impact on survival odds.

In [None]:
sns.violinplot(x=train['Sex'], y=train['Age'], hue=train['Survived'], split=True);

Age, Sex, and Survival: A Violin Plot Analysis

This split violin plot shows the age distribution for survivors (orange side) and non-survivors (blue side), separated by gender. It perfectly illustrates the "women and children first" protocol.

    Male Survivors Were Mostly Boys: The plot for males shows that while the largest group of casualties were adults aged 20-40, the survivors were overwhelmingly young boys. For an adult male, age had little bearing on their low chance of survival.

    Female Survival Was High Across Most Ages: The plot for females shows that survivors outnumbered casualties across a wide age range. Being female provided a high chance of survival, regardless of whether they were a child, a young adult, or middle-aged.

## Division between predictors and classes

In [None]:
train.head()

In [None]:
X_predictors = train.drop(['Survived', 'Name', 'PassengerId', 'Ticket'], axis=1)
X_predictors

In [None]:
X_predictors.shape

In [None]:
Y_train = train['Survived']
Y_train

## Scaling values

In [None]:
X_predictors.describe()

Analyzing the results, we observe a strong outlier in 'Fare,' so we can't use either the StandardScaler, since the maximum value of $512 would distort the mean and standard deviation calculations, or the MinMaxScaler, because this outlier would be mapped to 1, while most passengers paid around $31. Furthermore, there are still outliers in other columns such as 'Age,' 'SibSp,' and 'Parch'. Therefore, we will use the RobustScaler, which is better suited for data with outliers.

In [None]:
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
numeric_cols = ['Age', 'Fare', 'SibSp', 'Parch', 'Pclass']
categorical_cols = ['Sex', 'Embarked']

In [None]:
scaler = RobustScaler()
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

In [None]:
# Scaling of numerical features
X_train_num_scaled= scaler.fit_transform(X_predictors[numeric_cols])
X_train_num_df = pd.DataFrame(X_train_num_scaled, columns=numeric_cols, index=X_predictors.index)

In [None]:
X_train_num_df.head()

In [None]:
# Encoding of categorical features
X_train_categ_enconded = encoder.fit_transform(X_predictors[categorical_cols])
X_train_categ_df = pd.DataFrame(X_train_categ_enconded, columns=encoder.get_feature_names_out(categorical_cols), index=X_predictors.index)

In [None]:
X_train_categ_df.head()

In [None]:
X_train = pd.concat([X_train_num_df, X_train_categ_df], axis=1)

In [None]:
X_train

In [None]:
X_train.shape

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_naive_train, X_naive_test, Y_naive_train, Y_naive_test = train_test_split(X_train, Y_train, test_size=0.15, random_state=0)

In [None]:
X_naive_train.shape, Y_naive_train.shape

In [None]:
X_naive_test.shape, Y_naive_test.shape

## Saving the variables

In [None]:
import pickle

In [None]:
with open('naive-titanic.pkl', mode='wb') as f:
    pickle.dump([X_naive_train, Y_naive_train, X_naive_test, Y_naive_test], f)

# Data preprocessing - (test.csv)

In [58]:
test_final = pd.read_csv('../data/test.csv')
test_final.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [59]:
test_final.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [63]:
median_age_from_train = train['Age'].median()
mode_fare_from_train = train['Fare'].median()

In [64]:
test_final['Age'].fillna(median_age_from_train, inplace=True)
test_final['Fare'].fillna(mode_fare_from_train, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_final['Age'].fillna(median_age_from_train, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_final['Fare'].fillna(mode_fare_from_train, inplace=True)


In [67]:
test_final.drop('Cabin', axis=1, inplace=True)

In [68]:
test_final.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [69]:
X_test_num_scaled = scaler.transform(test_final[numeric_cols])
X_test_num_df = pd.DataFrame(X_test_num_scaled, columns=numeric_cols, index=test_final.index)

In [70]:
X_test_categ_encoded = encoder.transform(test_final[categorical_cols])
X_test_categ_df = pd.DataFrame(X_test_categ_encoded, columns=encoder.get_feature_names_out(categorical_cols), index=test_final.index)

In [72]:
X_test_final = pd.concat([X_test_num_df, X_test_categ_df], axis=1)
X_test_final

Unnamed: 0,Age,Fare,SibSp,Parch,Pclass,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0.500000,-0.286926,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.461538,-0.322838,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,2.615385,-0.206444,0.0,0.0,-1.0,0.0,1.0,0.0,1.0,0.0
3,-0.076923,-0.250836,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,-0.461538,-0.093839,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
413,0.000000,-0.277363,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
414,0.846154,4.090404,0.0,0.0,-2.0,1.0,0.0,1.0,0.0,0.0
415,0.807692,-0.312011,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
416,0.000000,-0.277363,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [73]:
with open('X_test_final.pkl', 'wb') as f:
    pickle.dump(X_test_final, f)