# Lab Description

This lab aims to introduce you to the practical considerations when performing machine learning in the real world. Real data is generally not clean. We may be given data with gaps, outliers or irrelevant features. For models to perform well, we need to do some data cleaning to fix these issues, and decide what to consider a feature for the machine learning, and what to drop. 

> __Problem Statement__: Your aim is to predict if a given person survived the Titanic's maiden voyage given some limited information about them.

## Competition.
This lab is based on a famous Kaggle competition. In the spirit of the famous Kaggle competition, this lab will also be a competition!

#### Competition Rules: 
- No cheating! It would be against the spirit of the competition to search for the test data on Kaggle (or elsewhere), for instance! Searching the names of Titanic survivors would also be considered cheating!
- The winner will be the entry with the highest score submitted before the end of the lab on the unseen test data.
- To optimise hyperparameters, you must perform a grid search, random search, Bayesian optimisation, or some other optimisation technique. Specifically, entries where you have stumbled upon good hyperparameters by chance will be considered invalid.
- No parallelization, please! Certain algorithms allow you to run on multiple CPU cores. We are running on a shared server with limited resource, so please do not attempt to parallelise your training. For example, do not set `n_jobs` to any value (keep the default of 1).
- Entries that "luck out" and achieve a very high score against the test data despite having a modest score on the validation data will be considered invalid.
  
#### Prize: 
- Pride!
- A small prize (please do not get too excited; it really will be nominal!)


## MATLAB vs. Python
I understand you have been trained in MATLAB, but I want to put you out of your comfort zone a little and encourage you to use Python for this lab. Firstly, it'll be good for your CV to pick up a few Python skills. Secondly, the entire machine-learning community is centred around Python, so it would be amiss of us not to use it in AERO40041 too.

If you struggle, please do feel free to explore the dataset in MATLAB instead, but please do try the Python code first. We're also here to help you and have several GTAs on hand who know Python very well should you need it. 

# Exploring the Data

Pandas is used to open the csv file, explore its contents, and start to perform some feature cleanup. In Python, it is common to import a package while giving the package a shortened name. For example, in the cell below, we write `import pandas as pd` which sets `pd` as an alias for `pandas` so we can write `pd.read_csv(<filename>)` rather than `pandas.read_csv(<filename>)`.

In [12]:
!pip install seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np



Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


Now we use Pandas to load the training data as a Pandas dataframe called df. The `head` method of the df object prints the first few lines of the file. Study the data format and familiarise yourself with the type of data we have available.

In [85]:
df = pd.read_csv('TitanicTrainVal.csv')
df.head(100)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,0,3,"Shorney, Mr. Charles Joseph",male,,0,0,374910,8.0500,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C
98,99,1,2,"Doling, Mrs. John T (Ada Julia Bone)",female,34.0,0,1,231919,23.0000,,S


In [86]:
df['Ticket'] = df['Ticket'].str.extract(r'(\d{1})', expand=False)
print(df[df['Ticket'].isnull()])
df = df.dropna(subset=['Ticket'])
df.info()

     PassengerId  Survived  Pclass                             Name   Sex  \
179          180         0       3              Leonard, Mr. Lionel  male   
271          272         1       3     Tornquist, Mr. William Henry  male   
302          303         0       3  Johnson, Mr. William Cahoone Jr  male   
597          598         0       3              Johnson, Mr. Alfred  male   

      Age  SibSp  Parch Ticket  Fare Cabin Embarked  
179  36.0      0      0    NaN   0.0   NaN        S  
271  25.0      0      0    NaN   0.0   NaN        S  
302  19.0      0      0    NaN   0.0   NaN        S  
597  49.0      0      0    NaN   0.0   NaN        S  
<class 'pandas.core.frame.DataFrame'>
Index: 887 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  887 non-null    int64  
 1   Survived     887 non-null    int64  
 2   Pclass       887 non-null    int64  
 3   Name         887 non-null    

Here is a more detailed description about the meaning of each attribute and its values:

* PassengerId - A column added by Kaggle to uniquely identify each row and make submissions easier
* Survived - Whether the passenger survived or not (0=No, 1=Yes) __(*)__
* Pclass - The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd).
* Name - The passenger's name and title.
* Sex - The passenger's sex
* Age - The passenger's age in years
* SibSp - The number of siblings or spouses the passenger had aboard the Titanic
* Parch - The number of parents or children the passenger had aboard the Titanic
* Ticket - The passenger's ticket number
* Fare - The fare the passenger paid in £
* Cabin - The passenger's cabin number
* Embarked - The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

---

__(*)__ This is what we aim to predict. This is training (or validation) data, so has this label. The test data will have the label hidden.

---

## Exploring the data

Let's explore the data. The dataset dimensions are:

In [87]:
print(df.shape)

(887, 12)


This means there are 891 samples, and 12 columns (PassengerId, Survived, etc)

Now we check the presence of missing values:

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 887 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  887 non-null    int64  
 1   Survived     887 non-null    int64  
 2   Pclass       887 non-null    int64  
 3   Name         887 non-null    object 
 4   Sex          887 non-null    object 
 5   Age          710 non-null    float64
 6   SibSp        887 non-null    int64  
 7   Parch        887 non-null    int64  
 8   Ticket       887 non-null    object 
 9   Fare         887 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     885 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 90.1+ KB


Most columns have 891 entries, but Age, Cabin and Embarked all have some missing entries. We can compute the percentage of missing value for each potential feature as follows:

In [89]:
pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0.0,0.0,0.0,0.0,0.0,19.954904,0.0,0.0,0.0,0.0,77.001127,0.225479


~20% of Age entries, ~77% of cabin entries and ~0.2% of Embarked entries are missing. We will need to decide what to do about this shortly.

We can also compute statistics for each column as follows:

In [90]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,887.0,887.0,887.0,710.0,887.0,887.0,887.0
mean,446.485908,0.384442,2.305524,29.684746,0.525366,0.383315,32.349436
std,257.617131,0.486738,0.836662,14.540819,1.104669,0.807466,49.758238
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.925
50%,447.0,0.0,3.0,28.0,0.0,0.0,14.4583
75%,669.5,1.0,3.0,38.0,1.0,0.0,31.1375
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


---
## Prepare the Data

Now we will perform some data cleaning and preparation. You can consider this as pre-processing the data before any machine learning is performed to get it ready for learning. What you do here is up to you! The score you achieve against the held-out test data is likely to be strongly dependent on how well you prepare the data.

---
__Suggestions__

1. Data cleaning:
    * Fix or remove outliers 
    * Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns).
2. Feature selection:
    * Drop the attributes that provide no useful information for the task.
3. Feature engineering, where appropriate:
    * Discretize continuous features.
    * Decompose features (e.g., categorical, date/time, etc.).
    * Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
    * Aggregate features into promising new features.

4. Feature scaling: standardize or normalize features.


In the cells below, we perform some basic data preparation. __It is likely you can do better than this by following some of the suggestions above.__ Our approach below is a simple one.

---

> Note: In the cells below, we will be dropping selected rows, columns, generating new columns, etc. These edits will be performed on the original dataframe. If you want to go back to the original to try something new, you can simply re-read the CSV file by running `df = pd.read_csv('TitanicTrainVal.csv')`

---


 __Dealing with ``Cabin`` and ``Embarked``__

Since ``Cabin`` has 77% of missing values, we will drop the entire column.

Analogously we can drop the few NaN instances corresponding to the ``Embarked`` column.

In [91]:
df = df.drop('Cabin', axis=1)
df = df.dropna(subset=['Embarked'], axis=0)
df.shape

(885, 11)

Note we now have 889 samples (because the few without `Embarked` have been dropped). We also now only have 11 columns (because `Cabin` was dropped)

__Dealing with ``Sex`` and ``Embarked``__

These columns are categorical. It  is better to have them as a one-hot-encoded vector.

In [92]:
df = pd.get_dummies(df, columns=['Sex', 'Embarked','Ticket'])
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Fare,Sex_female,Sex_male,...,Embarked_S,Ticket_1,Ticket_2,Ticket_3,Ticket_4,Ticket_5,Ticket_6,Ticket_7,Ticket_8,Ticket_9
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.25,False,True,...,True,False,False,False,False,True,False,False,False,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,True,False,...,False,True,False,False,False,False,False,False,False,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,7.925,True,False,...,True,False,True,False,False,False,False,False,False,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,53.1,True,False,...,True,True,False,False,False,False,False,False,False,False
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,8.05,False,True,...,True,False,False,True,False,False,False,False,False,False


__Dealing with ``Ticket``__

The ticket column is unlikely to be useful as there is no obvious link between the ticket number and survival rate (e.g. `A/5 21171` and `PC 17599` are selected values). We will drop it for now (you can decide what to do with it if you want; maybe there is some subtle information in there that is useful). 

In [None]:
import matplotlib

In [74]:
df.head()

Unnamed: 0,Ticket_1,Ticket_2,Ticket_3,Ticket_4,Ticket_5,Ticket_6,Ticket_7,Ticket_8,Ticket_9
0,False,False,False,False,True,False,False,False,False
1,True,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False
4,False,False,True,False,False,False,False,False,False


The ``Name`` column may bring some interesting information. For example, Mr., Miss, Mrs., Maester, etc. may have some correlation on the rate of survival. This might help improve the model, but for now, we drop it for simplicity as it is not a trivial column to use. The salutation is also likely to be highly correlated with the `Sex` column, and somewhat correlated with the `Age` column.

In [72]:
df = df.drop('Name', axis=1)

KeyError: "['Name'] not found in axis"

We will also drop PassengerId since this is just a label Kaggle added to uniquely identify the passenger, and has no correlation with survival rate. 

In [21]:
df = df.drop('PassengerId', axis=1)

### Dealing with missing values

The ``Age`` column has some missing values. We can try to fill them by using the average age (this is not necessarily the best option, but is perhaps a reasonable one. It is left to you to investigate this). 

In [22]:
mean_age = df['Age'].mean()

df['Age'] = df['Age'].fillna(mean_age)

Now let's check we have no missing values:

In [23]:
print("Shape: ", df.shape)
df.info()

Shape:  (889, 11)
<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    889 non-null    int64  
 1   Pclass      889 non-null    int64  
 2   Age         889 non-null    float64
 3   SibSp       889 non-null    int64  
 4   Parch       889 non-null    int64  
 5   Fare        889 non-null    float64
 6   Sex_female  889 non-null    bool   
 7   Sex_male    889 non-null    bool   
 8   Embarked_C  889 non-null    bool   
 9   Embarked_Q  889 non-null    bool   
 10  Embarked_S  889 non-null    bool   
dtypes: bool(5), float64(2), int64(4)
memory usage: 53.0 KB


### Separating the features and labels
Now we have cleaned the data, we are ready to generate our features (X) and label (y).

In [24]:
y = df['Survived'].copy() 
X = df.loc[:, df.columns != 'Survived'].copy()  

## Running Models

We will use the machine learning package `scikit-learn` to perform the machine learning. 

We split our data into training data (which we use to optimise the parameters) and validation data (which we use to optimise the hyperparameters). The function `train_test_split` does this with a random selection. Despite the function's name, the split data should be considered as training data and validation data (not test data). We also have some held-out test data with hidden labels, which you will use to get your final score. The argument `test_size=0.2` means we select a random 20% of the original data as our validation data. You may change this if you wish.

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_val,  y_train, y_val = train_test_split(X, y, random_state=123, test_size=0.2)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

(711, 10) (711,) (178, 10) (178,)


In the following sections, we will try several classification algorithms.

### __Logistic Regression__

In [26]:
from sklearn.linear_model import SGDClassifier

You can read the documentation of the SGDClassifier class by running the cell below. Pay attention to the arguments the function can take. Most of them have default values, but the defaults will not necessarily work well for every problem. It is very likely you will need to perform a hyperparameter search to achieve good performance. If you do not understand certain parts of the documentation, it is likely because we had not covered it in AERO40041. You may ask a GTA if you are curious or just stick to the algorithms we have covered.

---

> Note: Running the cell below produces a lot of text. Depending on your version of Jupyter, you may want to click to the left of the cell after running it to collapse the ouput into a smaller scrollable window (this may be the default).

---

In [27]:
SGDClassifier??

[1;31mInit signature:[0m
[0mSGDClassifier[0m[1;33m([0m[1;33m
[0m    [0mloss[0m[1;33m=[0m[1;34m'hinge'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mpenalty[0m[1;33m=[0m[1;34m'l2'[0m[1;33m,[0m[1;33m
[0m    [0malpha[0m[1;33m=[0m[1;36m0.0001[0m[1;33m,[0m[1;33m
[0m    [0ml1_ratio[0m[1;33m=[0m[1;36m0.15[0m[1;33m,[0m[1;33m
[0m    [0mfit_intercept[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mmax_iter[0m[1;33m=[0m[1;36m1000[0m[1;33m,[0m[1;33m
[0m    [0mtol[0m[1;33m=[0m[1;36m0.001[0m[1;33m,[0m[1;33m
[0m    [0mshuffle[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mepsilon[0m[1;33m=[0m[1;36m0.1[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlearning_rate[0m[1;33m=[0m[1;34

The class `SGDClassifier` uses stochastic gradient descent to train various linear algorithms. To use logistic regression, we need to set `loss='log_loss'` (otherwise, we are using a different machine learning algorithm to Logistic regression). The argument `random_state=42` is useful to ensure repeatable results. The specific number 42  is arbitrary (42 is the answer to life, the universe, and everything according to Douglas Adams' comedic science fiction series, "The Hitchhiker's Guide to the Galaxy."). Finally, we set learning_rate=`constant` to use a fixed learning rate throughout the training. This is chosen for simplicity. Other options are `optimal` and `adaptive` which dynamically set the learning rate at each update of the parameters according to the schedule described in the documentation in the cell above. The specific value we use is set by the argument `eta0=1e-3`. (Note that scikit-learn uses `eta` to mean learning rate, whereas we had used `alpha` in the lectures. Both are common nomenclature). To further add to the confusion, the regularisation hyperparameter ($\lambda$ in the lectures) is referred to as `alpha` in scikit learn! 

In [None]:
LogReg = SGDClassifier(loss='log_loss', learning_rate='constant', eta0=0.1, alpha=0.0001, random_state=42)
LogReg.fit(X_train, y_train)

We have trained a logistic regression model! Let's see how it did __on the validation data__ (not test data yet as we have not optimised hyperparameters):

In [29]:
from sklearn.metrics import accuracy_score

# Predict on the validation set
y_pred = LogReg.predict(X_val)

# Compute accuracy
accuracy = accuracy_score(y_val, y_pred)

print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.41


0.41 means we only got 41% of predictions correct! This is pretty terrible (in fact, we might on average have expected to do better by just guessing!)

#### Parameter Tuning
In the cells below, we will perform a basic hyperparameter search to get you started. This is not exhaustive, but may do better than 41% accuracy (we hope!). Scikit-learn has a few options for hyperparameter searches built in. These use k-fold cross-validation. For more info, see [here](https://scikit-learn.org/1.5/api/sklearn.model_selection.html#hyper-parameter-optimizers). However, since we have an explicit validation set created above, we will implement our own simple grid search rather than use the k-fold cross-validation. A random search may also be useful instead of a grid search (see [ParameterSampler](https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.ParameterSampler.html#sklearn.model_selection.ParameterSampler) to get started on implementing a random search). 

We will attempt to optimise the regularisation type and the regularisation parameter $\lambda$ (i.e. $\alpha$ in scikit-learn). You can search over other hyperparameters.

---
> Note: Please do not run excessive hyperparameter searches that take more than a few minutes to run, as the server may not be able to handle the load. If your cell takes more than 5 minutes to complete, please interrupt the calculation by clicking Kernel -> interrupt and retry a more modest hyperparameter search. The number of individual models trained will equal the number of options to test for hyperparameter 1 multiplied by the number of options to test for hyperparameter 2, etc. = 9*3 = 27 models in the cell below.). Limit the number of models to <100. You can start with a coarse hyperparameter search, and then fine-tune it once you have narrowed down the range. 
---

In [30]:
import numpy as np
from sklearn.model_selection import ParameterGrid

param_grid = {
    'penalty': ['l1', 'l2', None],
    'alpha' : np.logspace(-7, 1, 9)
}

best_accuracy = 0
for params in ParameterGrid(param_grid):
    LogReg = SGDClassifier(**params, loss='log_loss', learning_rate='constant', eta0=0.1, random_state=42)
    LogReg.fit(X_train, y_train)
    
    # Predict on the validation set
    y_pred = LogReg.predict(X_val)

    # Compute accuracy
    accuracy = accuracy_score(y_val, y_pred)
    
    if( accuracy > best_accuracy ):
        best_accuracy = accuracy
        best_model = LogReg
        best_params = params
        
print("Best Param:", best_params, f"validation accuracy: {best_accuracy:.2f}")


Best Param: {'alpha': 0.0001, 'penalty': 'l2'} validation accuracy: 0.70


This improvement in accuracy is a step in the right direction, but you can do better!

## Checking on the test data (hidden labels) and submitting scores.
To make predictions on the test data, you will need to provide a function that prepares the test data in exactly the same way as you did for the training/validation data. The cell below will therefore need to change to reflect the pre-processing you actually did (if you did something different to the simple preprocessing we proposed above). It is vital that you perform exactly the same steps as you did to prepare your data for training. 

In [None]:
def preprocess( df ):
    df = df.drop('Cabin', axis=1)
    df = df.dropna(subset=['Embarked'], axis=0)
    df = pd.get_dummies(df, columns=['Sex', 'Embarked'])
    df = df.drop('Ticket', axis=1)
    df = df.drop('Name', axis=1)
    df = df.drop('PassengerId', axis=1)
    df['Age'] = df['Age'].fillna(mean_age)
    return df

You pass the model and the preprocessing function to our testing function via `AERO40041.testTitanicModel(model, preprocessing_function)`. The testing function is hosted on PyPI so can be installed using `pip install AERO40041` if you are not using the MaSC-portal. We are providing this test function rather than the test data directly to remove the risk of data leakage (where you accidentally train your model on the test data due to a subtle bug). Some method of complete segregation/isolation between the test set and training set is generally recommended. 

In [None]:
import AERO40041

accuracy = AERO40041.testTitanicModel(best_model, preprocess)

print(f"Test Accuracy: {accuracy:.3f}")

# Things to try:
- Scikit-learn provides other classification models with the same call sequence of `model.fit` and `model.predict`. This makes it very easy to try other models. You may try a neural network (`from sklearn.neural_network import MLPClassifier`) or a decision tree (`from sklearn.tree import DecisionTreeClassifier`), for instance, or something else!
- Each model will have its own set of hyperparameters. You can learn more about them by running a cell like this `MLPClassifier??` or `DecisionTreeClassifier??`.
- Different choices of feature cleanup or feature engineering is likely to yield significant gains. Play around with this.
- Once you have your best score, share it with a GTA to enter the competition!
- How does your own neural network developed in Lab 1 perform?