# Assessments

<div style='text-align: justify;'>
    Now in this challenge we will solve the proposed problem of <a href="https://www.kaggle.com/competitions/titanic">Kaggle's Titanic</a>. The task consists of processing the data, generating the model, and doing a deploy. Let's start!
</div>


## Problem #2: Titanic

<p style='text-align: justify;'>
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in 1502 deaths out of 2224 passengers and crew
</p>
While some element of luck was involved in surviving, some groups were more likely to survive than others.
<p style='text-align: justify;'>
In this problem, we ask you to build a predictive model that answers the question: "What sorts of people were more likely to survive?" using passenger data (i.e., name, age, gender, socio-economic class, etc.).
    </p>

In this competition, you will gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled `train.csv`, and the other is titled `test.csv`.

`train.csv` will contain the details of a subset of the passengers on board ($891$ to be exact) and, importantly, will reveal whether they survived, also known as the _ground truth_.

The `test.csv` dataset contains similar information but does not disclose the _ground truth_ for each passenger. It is your job to predict these outcomes.

Using the patterns in the `train.csv` data, predict whether the other $418$ passengers on board (found in `test.csv`) survived.

Check out the *Data* tab to explore the datasets even further. Once you feel you have created a competitive model, please submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

### Data dictionary

| Variable | Definition | 	Key |
|----------|------------|-------|
| survival | Survival	| 0 = No, 1 = Yes|
| pclass   | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex	   | Sex	| |
| Age	   | Age in years	| |
| sibsp	   | # of siblings / spouses aboard the Titanic	| |
| parch	   | # of parents / children aboard the Titanic	| |
| ticket   | Ticket number	| |
| fare	   | Passenger fare	| |
| cabin	   | Cabin number	| |
| embarked | Port of embarkation |	C = Cherbourg, Q = Queenstown, S = Southampton |

#### Variable notes

**pclass**: A proxy for socio-economic status (SES)

* 1st = Upper
* 2nd = Middle
* 3rd = Lower

---

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

----

**sibsp**: The dataset defines family relations in this way...

* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)

----
**parch**: The dataset defines family relations in this way...

* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

# ☆ Solution problem #2 - Titanic ☆

## Data preparation

#### ⊗ Import Python packages 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### ⊗ Loading the training file

In [None]:
path = "../datasets/titanic"

In [None]:
train_data = pd.read_csv(f"{path}/train.csv", sep=",")
train_data

In [None]:
train_data.describe().T

In [None]:
def display_missing(train_data):    
    for col in train_data.columns.tolist():          
        print('{} missing data: {}'.format(col, train_data[col].isnull().sum()))
    print('\n')
    
display_missing(train_data)

#### ⊗ Loading the test file

In [None]:
test_data = pd.read_csv(f"{path}/test.csv", sep=",")
test_data

In [None]:
test_data.describe().T

#### ⊗ Treating the data

1. Remove name and tickets

2. map gender to:

Male: 0<br>
Female: 1

3. Map Embarked to:

* C: 0
* Q: 1
* S: 2

In [None]:
mapping_sex = {'male': 0, 'female': 1}
train_data = train_data.replace({'Sex': mapping_sex})
test_data = test_data.replace({'Sex': mapping_sex})

In [None]:
mapping_embarked = {'C': 0, 'Q': 1, 'S': 2}
train_data = train_data.replace({'Embarked': mapping_embarked})
test_data = test_data.replace({'Embarked': mapping_embarked})

We need to fill in the data that has no age. For this, let's try to find the correlation between age and other attributes:

In [None]:
train_data_corr = train_data.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
train_data_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
train_data_corr[train_data_corr['Feature 1'] == 'Age']

We noticed that there is a high correlation between age and Pclass, so we can use Pclass to try to fill in age.

In addition, we have a suspicion that gender can greatly influence the average if we use social class

In [None]:
age_by_pclass_sex = train_data.groupby(['Sex', 'Pclass']).median()['Age']

for pclass in range(1, 4):
    # 0 - male, 1 - female
    for sex in [1, 0]:
        print('Average age of Pclass {} {}s: {}'.format(pclass, sex, age_by_pclass_sex[sex][pclass]))
print('Average age of all: {}'.format(train_data['Age'].median()))

The average age differs greatly between Pclass and gender. So we can use these two pieces of information to fill in the data:

In [None]:
train_data['Age'] = train_data.groupby(['Sex', 'Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))

In [None]:
display_missing(train_data)

In [None]:
display_missing(test_data)

Now let's try to treat the missing cabins. The position of the place on the ship must influence survival (whoever crashed first may have died from the crash, for example).

And we know that the organization is usually by social class. So let's try to relate the first letter of the cabin (which should be the sector) with the social class.

In [None]:
# Putting M for missing on missing data
train_data['Cabin'] = train_data['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')
test_data['Cabin'] = test_data['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')

In [None]:
train_data['Cabin']

Trying to relate the lyrics to the class.

In [None]:
display_missing(train_data)

In [None]:
display_missing(test_data)

In [None]:
train_data_cabin_corr = train_data.groupby(['Cabin', 'Pclass']).count().drop(columns=['Survived', 'Sex', 'Age', 'SibSp', 'Parch', 
                                                                        'Fare', 'Embarked', 'PassengerId', 'Ticket']).rename(columns={'Name': 'Count'}).transpose()

In [None]:
train_data_cabin_corr

In [None]:
mapping_cabin = {'A': 0, 'B': 0, 'C': 0, 'D': 1, 'E': 1, 'F': 2, 'G': 2, 'T': 2, 'M': 3}
train_data = train_data.replace({'Cabin': mapping_cabin})
test_data = test_data.replace({'Cabin': mapping_cabin})

In [None]:
display_missing(train_data)

Now we just need to treat the embarked, which has 2 missing.

In [None]:
train_data_corr = train_data.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
train_data_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
train_data_corr[train_data_corr['Feature 1'] == 'Embarked']

In [None]:
train_data[train_data['Embarked'].isnull()]

In [None]:
train_data = train_data.dropna()

In [None]:
train_data = train_data[[col for col in train_data.columns if col not in ['Name', 'Ticket']]]
test_data = test_data[[col for col in test_data.columns if col not in ['Name', 'Ticket']]]

In [None]:
for col in train_data.columns:
    plt.figure(figsize=[16, 12])
    train_data[col].plot()
    plt.title(col)

## MLP solution

#### ⊗ Import library packages

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

#### ⊗ Create model

In [None]:
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        return x

#### ⊗ Data preprocessing

In [None]:
mlp_train_datax = train_data[[col for col in train_data.columns if col not in ['Survived']]]
mlp_train_datay = train_data[['Survived']]

In [None]:
mlp_x_train = mlp_train_datax.values.astype(np.float32)
mlp_y_train = mlp_train_datay.values.astype(np.float32)

In [None]:
mlp_x_test = test_data.values.astype(np.float32)

#### ⊗ Data standardization

In [None]:
scaler = StandardScaler()
mlp_x_train = scaler.fit_transform(mlp_x_train)
mlp_x_test = scaler.transform(mlp_x_test)

#### ⊗  Preparation of training labels

In [None]:
num_classes = 2
mlp_y_train = np.eye(num_classes)[mlp_y_train.reshape(-1).astype(int)]
mlp_y_train = torch.from_numpy(mlp_y_train)

In [None]:
input_size = mlp_x_train.shape[1]
hidden_size = 9

Model compilation and optimizer definition:

In [None]:
mlp_model = MLP(input_size, hidden_size, num_classes)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adamax(mlp_model.parameters())

#### ⊗  Model training

In [None]:
num_epochs = 200
batch_size = 32

mlp_x_train = torch.from_numpy(mlp_x_train)
mlp_dataset = torch.utils.data.TensorDataset(mlp_x_train, mlp_y_train)
mlp_dataloader = torch.utils.data.DataLoader(mlp_dataset, batch_size=batch_size, shuffle=True)

In [None]:
mlp_model.train()
mlp_loss_history = []

for epoch in range(num_epochs):
    epoch_loss = 0.0
    for inputs, targets in mlp_dataloader:
        optimizer.zero_grad()
        outputs = mlp_model(inputs)
        loss = criterion(outputs, torch.argmax(targets, dim=1))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()

    epoch_loss /= len(mlp_dataloader)
    mlp_loss_history.append(epoch_loss)

In [None]:
mlp_model.eval()
with torch.no_grad():
    mlp_y_pred = mlp_model(mlp_x_test)
    mlp_predictions = torch.argmax(mlp_y_pred, dim=1)

mlp_predictions = mlp_predictions.numpy()

In [None]:
torch.save(mlp_model.state_dict(), f"{path}/mlp.pth")

In [None]:
plt.figure(figsize=(20, 5))
plt.plot(mlp_loss_history, color='blue')
plt.title('Model loss', fontsize=20)
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.show()

In [None]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': mlp_predictions})
output.to_csv(f"{path}/mlp_submission.csv", index=False)
print("Your submission was successfully saved!")

## Clear the memory

Before moving on, please execute the following cell to clear up the CPU memory. This is required to move on to the next notebook.

In [None]:
#import IPython
#app = IPython.Application.instance()
#app.kernel.do_shutdown(True)

## Clear the temporary files

After finishing the assessment, please execute the following cell to clear up the directory.

In [None]:
!rm -rf ../models/handwritten-model.pt ../datasets/cifar-10-python.tar.gz  ../datasets/cifar-10-batches-py ../datasets/cifar-10-batches-py ../datasets/MNIST  cifar-10*  