# Titanic Comptetion

## 1. Introduction

### 1.1. Problem Statement

This competition revolves around a classification problem for the titanic dataset. Apparently, not everyone on the titanic ship died during the horrifying event you all know. Although it seems that people should have survived at random, there seems to have been a correlation between certain attributes of the passengers and their chance of survival!
Here are some of these attributes:

| **Attribute**  |              **Description**           | **Column in dataset**  |
|----------------|:---------------------------------------|:-----------------------|
|  Ticket class  |  Three possible classes: 1st, 2nd, 3rd |        *pclass*        |
|     Gender     |  The sex of the passenger              |         *sex*          |
|      Age       |  Age of the passenger in years         |         *age*          |
|   Siblings     |  *Number* of siblings aboard Titanic   |         *sibsp*        |
|    Parents     |  *Number* of parents aboard Titani     |         *parch*        |
|     Ticket     |  Ticket number of the passenger        |         *ticket*       |
|      Fare      |  Passenger fare                        |         *fare*         |
|      Cabin     |  Cabin number of the passenger         |         *cabin*        |
| Port of entry  |  The station at which passneger boarded|         *embarked*     |

### 1.2. Dataset

There are two ```.csv``` files to this problem which are placed in the ```./data/``` directory:

- **train.csv**: which contains the attributes in the above table along with an extra column called *survived* which indicates whether the passenger has survived or not (1 for survived and 0 for not survivied). This dataset is meant to be used for model trainig.
- **test.csv**: which contains only the attributes of the above table without the *survived* column since this dataset is meant for testing the model and the answers for it are hidden. The performance of your model on this dataset will be used to rank the model in the competition.

### 1.3. Output

The output of the model is the answers it gives for the examples provided in test.csv file. Therefore, the output is a ```.csv``` file that contains two columns:

- **PassengerID**: which is the passenger ID of the passengers in the test.csv file.
- **Survived**: which is 1 for people who survived and 0 otherwise. These values are the predictions given by the trained model.

# 2. Data Insight

## 2.1. Loading data and importing modules

First we need to import the necessary modules and packages:

- **pandas**: for data manipulation
- **np**: for array operations
- **matplotlib.pyplot**: for data visualization
- **seaborn**: for data visualization
- **scipy**: for statistical operations.

Then, we set the appropriate directories to read data from and finally we read data and load them onto a pandas dataframe.

In [None]:
# Importing required packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Setting up the directories.
CURRENT_DIR = './'
DATA_DIR = CURRENT_DIR + 'data/'
TRAIN_FILE_NAME = 'titanic_train.csv'
TEST_FILE_NAME = 'titanic_test.csv'

In [None]:
# Loading data and print the first few rows
train_ds = pd.read_csv(DATA_DIR + TRAIN_FILE_NAME)
print(f'The dataset has {train_ds.shape[0]} data points.')
train_ds.head()

## 2.2 Data Analysis

Next, we need to analyze our data to understand them and gain more insight into them. This is an important step since it allows for better fabrication of the prediction model. The steps in analysis include:

- Checking for duplicate values and deciding how to manage them.
- Checking Null (or NaN) values and deciding how to manage them.
- Understanding the distribution of each attribute.
- Understanding the relationship between different attributes and the target variable (i.e., Survived).
- Understanding the relationship between input attributes to understand if there exists any redundant information in the dataset.

### 2.2.1 Duplicate value handling

Handling duplicate values is important since these values introduce a bias to the final system that makes the prediction model to give more attention to these repeated values. This in turn can cause overfitting that prevents the prediction system to generalize well to unseen data points.

Based on the tables shown above, there are two ways that we can check if a row is duplicate:

- Check by the name of the passengers
- Check by the ticket number of passengers.

In [None]:
# Checking for duplicate values among person names
duplicate_num = len(train_ds.duplicated(subset=['Name'])[train_ds.duplicated(subset=['Name'])])
if duplicate_num > 0:
    print(f'We have {duplicate_num} duplicate persons in the dataset')
else:
    print('There are no duplicate persons based on name')

In [None]:
# Showing the duplicate values based on tickets
train_ds[train_ds.duplicated(subset=['Ticket'])]

Analyzing the ticket number of passengers shows that there are multiple passengers with the same ticket number. However, the result of the next cell (for a single ticket number) shows that its because all the family members of a certain family have the similar ticket number. Therefore, we can conclude that there are now duplicate passenger information in this table, thus, no need to handle duplicate values.

In [None]:
train_ds[train_ds['Ticket'] == '382652']

### 2.2.2 Handling missing values

Now we need to address the missing values in the dataset. The next cell shows how many missing values is present for each column of our dataset.

In [None]:
for c in train_ds.columns:
    print(f'{c} column has {train_ds[c].isna().sum()} NaN values')

From the above cell we can see that Most columns do not have missing values. From the columns that do have missing values (i.e., Age, Cabin, Embarked), *Cabin* has the most number of missing values while *Embarked* has the least number of missing values. Now we need to decide either to fill the missing values with some other value or to completely remove a certain column. Since *Cabin* column has a missing value for nearly 75% of the rows and since by our own insight, we can see that cabin number may not introduce that much information regarding whether a person was survived or not, we can decide to remove this column for now. Also, we can get rid of ticket number and name since they are also likely to introduce only a negligible amount of information to the final prediction model.

As fo *Age* and *Embarked* we can replace them by their median and mode, respectively. We choose median for *Age* since it's a numerical value and mode for *Embarked* since it is a categorical value.

In [None]:
# Replacing NaNs in Age and Embarked columns with median and most frequent values
train_ds[['Age']] = train_ds[['Age']].fillna(value=train_ds['Age'].median())
train_ds[['Embarked']] = train_ds[['Embarked']].fillna(train_ds['Embarked'].mode()[0])

In [None]:
# Losing the useless columns
train_ds.drop(['Cabin', 'Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

### 2.2.3 Univariate Analysis

Now we need to analyze each column (or attribute) individually. The best way to this is to understand what is the distribution of all possible values of each attribute. We plot the distibution of *numerical* and *categorical* attibutes in separate cells.

In [None]:
CATEGORICAL = ['Sex', 'Pclass', 'Survived', 'Embarked']
NUMERICAL = ['Age', 'Parch', 'SibSp', 'Fare']

In [None]:
# Distribution of Numerical parameters
plt.rcParams.update({'font.size': 10})
train_ds.hist(column=NUMERICAL, figsize=(10.0, 10.0), grid=True)

As it can be seen from the cell above, the distribution of data over numerical attributes is fairly skewed. Since most ML models work better with normally distributed data or at least data that is zero-centered. So, this is an indication that we need to transfrom our numerical attributes before feeding them into any ML model. Also, one thing that we should note here is that for *SibSp* (or number of siblings on board), *Fare* and *Parch* (or number of parents on board), most people had a value that are very close to 0. Especially for *SibSp* and *Parch* that have discrete values, this means that had a value of 0. When the value of a certain attribute is mostly repeated at a certain value, it means that it introduces little information to the system (less entropy). Although a covariate analysis in the next sections can give a better idea whether each of these attributes introduce significant information towards our target variable (i.e., Survived).

In [None]:
figure = plt.figure(figsize=(10.0, 10.0))
for c in range(len(CATEGORICAL)):
    plt.subplot(2, 2, c+1)
    train_ds[CATEGORICAL[c]].value_counts().plot(kind='bar')
    plt.title(f'{CATEGORICAL[c]}')

Finally for the categorical attribiutes we can see that the data is less unbalanced. Of course more people have died than survived and apparently there were more men on board than women. The cell above also shows that most of the people had acquired a ticket class of 3 and they mostly boarded the ship from the *s* station.

### 2.2.2 Bivariate Analysis

Now it is time to perform an analysis to better understand the relationship between different attributes and the target variable (i.e., Survived). First we have to pay attention that our target variable is a categorical one and not a numerical variable (i.e., eitehr survived and not survived). The fact that we show these two possible values with integers 1 and 0 does not make this variable a numerical one. Secondly, we should remember that between other attributes, some are categorical (sucha as Sex) and some are numerical (such as Age).

What we want to understand here is (for each attribute) whether there exists a meaningful relationship between the target variable and the attribute. In other words (and statistically speaking), if we assume that there is no relationship between the attribute and the target variable (i.e., null hypothesis), are we able to reject this null hypothesis or not? If yes, then we have successfully shown that there exists a meaningful relationship between the attribute and the target variable. We fail to reject the null hypothesis and it means that we failed to show that the two are related.

For categorical variables (and given that the target variable is also categorical), we can use a chisquare test on the ```crosstab``` table of the attribute and the target variable. Chi-Square gives a P-value that is the **probability that the null hypothesis holds**. In other words, if P-value is 0.05 percent it means that by a confidence interval of 95% we can reject the null hypothesis. This seems like a reasonable number and we use the same number as a threshold to reject the null hypothesis in the next cell.

In [None]:
for c in CATEGORICAL:
    p_value = chi2_contingency(pd.crosstab(index=train_ds[c], columns=train_ds['Survived']))[1]
    if p_value < 0.05:
        print(f'Null hypothesis rejected -> Survived and {c} columns are related')
    else:
        print(f'We failed to reject the null hypothesis')

As we can see from the cell above, all of our categorical attributes seem to have a meaningful relationship with the target variable so it is wise to keep them in the dataset.

Now we have to also understand the relationship that exists between our numerical attributes and the target variable. To do this, I prefer to use a visualization method. We will analyze each of the attributes one at a time. First, we take a look at the relationship between *Age* and *Survived* in the next cell. For this purpose, we plot a stacked bar chart and a scatterplot with each dot painted as either blue (demised) or orange (survived).

In [None]:
# A stacked bar to understand the correlation between age attribute and survived attribute
figure = plt.figure(figsize=(20.0, 10.0))

# Stacked bar representation
min_age = train_ds['Age'].min()
max_age = train_ds['Age'].max()
age_bins = np.linspace(min_age, max_age, 10)
survived = train_ds[train_ds['Survived']==1].groupby(pd.cut(train_ds['Age'], age_bins))['Age'].count()
n_survived = train_ds[train_ds['Survived']==0].groupby(pd.cut(train_ds['Age'], age_bins))['Age'].count()
plt.subplot(1,2,1)
plt.bar(age_bins[1:], survived, color='blue', width=0.5, label='survived')
plt.bar(age_bins[1:], n_survived, color='red', bottom=survived, width=0.5, label='not survived')
plt.title('Age-Survived distribution in bar format')
plt.xlabel('Age ranges')
plt.ylabel('Counts')
plt.legend()

# Scatterplot representation
plt.subplot(1,2,2)
for i in range(1, 7):
    plt.axhline(y=i*10, color='r', linestyle='-')
sns.scatterplot(data=train_ds, x=train_ds.index, y='Age', hue='Survived', legend='auto')
plt.title('Age-Survived distribution in scatter format (orange survived, blue demised)')
plt.ylabel('Age')
plt.xlabel('Index')
plt.legend()

plt.show()

As we can see from the cell above, there seems to be a relationship between the age and the chance of survival. It seems that younger people had a greater change of survival as compared to older people.

Now, we plot the same stacked bar plot for number of parents and number of siblings on board in the next cell.

In [None]:
# Relationship between number of siblings and survived attribute.
figure = plt.figure(figsize=(20.0, 10.0))
counter = 1
for c in ['SibSp', 'Parch']:
    sib_bins = train_ds[c].unique()
    survived_counts = train_ds[train_ds['Survived']==1].groupby(c)[c].count()
    n_survived_counts = train_ds[train_ds['Survived']==0].groupby(c)[c].count()
    survived = []
    n_survived = []
    for i in sib_bins:
        if survived_counts.get(i):
            survived.append(survived_counts.get(i))
        else:
            survived.append(0)

        if n_survived_counts.get(i):
            n_survived.append(n_survived_counts.get(i))
        else:
            n_survived.append(0)
    
    plt.subplot(1, 2, counter)
    counter += 1
    
    plt.bar(sib_bins, survived, color='blue', width=0.5, label='survived')
    plt.bar(sib_bins, n_survived, color='red', bottom=survived, width=0.5, label='not survived')
    if c == 'SibSp':
        plt.title('Stacked bar relationship between # siblings and survived attribute.')
        plt.xlabel('# Siblings')
    else:
        plt.title('Stacked bar relationship between # parents and survived attribute.')
        plt.xlabel('# parents')
    plt.ylabel('Count')
    plt.legend()
plt.show()

Again from the cell above, there seems to be a meaningful relationship between the number of parents/siblings and the chance of survival. Specifically, it seems that thos who had 1, 2, 3 parents had a higher chance of survival as compared to those who had no parent. Also, the number of siblings seems to have had an effect on chance of survival and those who had one or two siblings on board seem to have had a higher chance of survival than any other passenger (maybe they helped each other better whereas those with higher number of siblings had to struggle so much to help each other that eventually they all ended up drowning).

Lastly, we can take a look at the effect of *Fare* on the chance of survival. We can do so by plotting a scatterplot in the next cell.

In [None]:
# plotting a scatter plot to show the dependency of fare and survived attribute
plt.figure(figsize=(10.0, 10.0))
plt.axhline(y=200, color='r', linestyle='-')
plt.axhline(y=100, color='r', linestyle='-')
sns.scatterplot(data=train_ds, x=train_ds.index, y='Fare', hue='Survived')

Once again we can see that poeple who had paid higher fairs seem to have had a higher chance of survival as compared to those who paid lower fairs (maybe there was a protocol for the crew of the ship to first aid the richer people in case of emergency).

All in all we can conclude that all of our attributes have a meaningful relationship with the target variable and we should keep them in our dataset for modelling.

## 2.3 Data Manipulation

No it is time to prepare our dataset so that it can be fed to a Machine Learning model. Most of the ML models that we know of are actually statistical models that look into the underlying statistical pattern and try to figure it out. Therefore, any data or attribute that is fed to them should be in number format so that the algorithm can understand it.

### 2.3.1 One-hot Encoding
As you have already seen, some of the attributes in our dataset are actually categorical values that need to be converted into numerical values (i.e., assigning a numerical value to each category.

For example, the *Embarked* attribute has three possible values which are *C*, *Q* and *S*. Although we can simply assign numbers 1, 2 and 3 to these categories, it is better to represent this attribute with a one-hot encoded vector. This is because *C*, *Q* and *S* probably have no connection with each other in real life while numbers 1, 2 and 3 have a logical and arithmetic relation to each other. Therefore, to better capture this independence, it is better to use one-hot encoding. As an example, if the *Embarked* attribute for a certain data point is *S*, then we can represent the *Embarked* attribute for that data point with the [0, 0, 1] vector where the first element represents *C* value, the second element represents *Q* value and the third element represents *S* value. We need to do this for *Sex* and *Pclass* attributes as well since they are also categorical.

In [None]:
# One hot encoding for the categorical attributes
one_hot_cols = ['Sex', 'Embarked', 'Pclass']
concat_dfs = [train_ds]

# Creating a one-hot encoded version of categorical columns
# which will result in a new DataFrame which will be appended
# to a collection of dataframes that should later be concatenated
# with each other
for c in one_hot_cols:
    concat_dfs.append(pd.get_dummies(train_ds[c]))

# Creating an encoded training dataset
train_ds_enc = pd.concat(concat_dfs, axis=1)

# Dropping the non-encoded columns
for c in one_hot_cols:
    train_ds_enc.drop(c, axis=1, inplace=True)

# Showing the one-hot encoded dataframe
train_ds_enc.head()

### 2.3.2 Train-Dev Split

In [None]:
# Splitting the dataset into train and dev set since the test dataset is already given
x_features = list(train_ds_enc.columns)
x_features.remove('Survived')
X_train, X_dev, y_train, y_dev = train_test_split(train_ds_enc[x_features], train_ds_enc['Survived'], test_size=0.3,
                                                   random_state=42)

### 2.3.3 Data Normalization and Scaling

In [None]:
# Scaling the numerical values
scaler = StandardScaler()
scaler.fit(X_train[NUMERICAL])
X_train_norm = scaler.transform(X_train[NUMERICAL])
X_dev_norm = scaler.transform(X_dev[NUMERICAL])
X_train[NUMERICAL] = X_train_norm
X_dev[NUMERICAL] = X_dev_norm

# 3 Model Training and Evaluation

In [None]:
# Building a logistic regression model
log_reg_model = LogisticRegression(penalty='l2', fit_intercept=True, class_weight='balanced', warm_start=False,
                                   max_iter=500, solver='saga')

log_reg_model.fit(X_train.to_numpy(), y_train)
train_score = log_reg_model.score(X_train.to_numpy(), y_train)
dev_score = log_reg_model.score(X_dev.to_numpy(), y_dev)

print(f"The training accuracy is {train_score * 100}%")
print(f"The dev accuracy is {dev_score * 100}%")

In [None]:
log_reg_cv_model = LogisticRegressionCV(penalty='l2', cv=5, fit_intercept=True, class_weight='balanced')
log_reg_cv_model.fit(pd.concat([X_train, X_dev]).to_numpy(), pd.concat([y_train, y_dev]))

log_reg_cv_model.score(X_train.to_numpy(), y_train)

In [None]:
# Building a neural model
hidden_layers = [1024, 256, 64, 16, 4]

neural_model = MLPClassifier(solver='adam', beta_1=0.9, alpha=1e-2, hidden_layer_sizes=hidden_layers,
                             max_iter=10**3, learning_rate_init=1e-4, random_state=42)

neural_model.fit(X_train.to_numpy(), y_train)

train_score = neural_model.score(X_train.to_numpy(), y_train)
dev_score = neural_model.score(X_dev.to_numpy(), y_dev)

print(f"The training accuracy is {train_score * 100}%")
print(f"The dev accuracy is {dev_score * 100}%")

In [None]:
max_depth_values = list(range(2, 10))

train_score_per_depth = []
dev_score_per_depth = []

for d in max_depth_values:
    # Building a random forest model
    rf_model = RandomForestClassifier(max_depth=d, random_state=0)
    rf_model.fit(X_train.to_numpy(), y_train)

    train_score = rf_model.score(X_train.to_numpy(), y_train)
    dev_score = rf_model.score(X_dev.to_numpy(), y_dev)

    train_score_per_depth.append(train_score * 100)
    dev_score_per_depth.append(dev_score * 100)
    
plt.plot(max_depth_values, train_score_per_depth, label='Train Set Scores', color='blue')
plt.plot(max_depth_values, dev_score_per_depth, label='Dev Set Scores', color='red')
plt.title('Accuracy as a function of maximum depth of the tree')
plt.xlabel('Maximum Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.show()