## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [1]:
import re

import feature_engine.encoding
import feature_engine.imputation

import numpy as np

import pandas as pd

import sklearn.base
import sklearn.linear_model
import sklearn.metrics
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing

## Prepare the data set

In [2]:
data = pd.read_csv("https://www.openml.org/data/get_csv/16826755/phpMYEkMl")

In [3]:
data = data.replace("?", np.nan)  # replace interrogation marks by NaN values

In [4]:
def get_first_cabin(row):
    """Retain only the first cabin if more than 1 are available."""
    try:
        return row.split()[0]
    except AttributeError:
        return np.nan
    
data["cabin"] = data["cabin"].apply(get_first_cabin)

In [5]:
def get_title(passenger):
    """Extracts the title (Mr, Ms, etc) from the name variable."""
    line = passenger
    if re.search("Mrs", line):
        return "Mrs"
    elif re.search("Mr", line):
        return "Mr"
    elif re.search("Miss", line):
        return "Miss"
    elif re.search("Master", line):
        return "Master"
    else:
        return "Other"
    
data["title"] = data["name"].apply(get_title)

In [6]:
data["fare"] = data["fare"].astype("float")
data["age"] = data["age"].astype("float")

In [7]:
data.drop(labels=["name","ticket", "boat", "body", "home.dest"], axis=1, inplace=True)

## Begin Assignment

### Configuration

In [8]:
target = "survived"

In [9]:
NUMERICAL_VARIABLES = ["age", "fare"]

CATEGORICAL_VARIABLES = ["sex", "cabin", "embarked", "title"]

CABIN = ["cabin"]

## Separate data into train and test

Use the code below for reproducibility. Don't change it.

In [10]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    data.drop(target, axis=1), data[target], test_size=0.2, random_state=0
)

X_train.shape, X_test.shape

((1047, 9), (262, 9))

## Preprocessors

### Class to extract the letter from the variable Cabin

In [11]:
class ExtractLetterTransformer(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    """Extract fist letter of variable."""
    def __init__(self, variables):
        if not isinstance(variables, list):
            raise ValueError("variables should be a list")
        self.variables = variables
           
    def fit(self, X, y=None):
        """Not used here."""
        return self

    def transform(self, X):
        X = X.copy()
        
        for var in self.variables:
            X[var] = X[var].str[0]
        
        return X

# Pipeline

+ Impute categorical variables with string missing
+ Add a binary missing indicator to numerical variables with missing data
+ Fill NA in original numerical variable with the median
+ Extract first letter from cabin
+ Group rare Categories
+ Perform One hot encoding
+ Scale features with standard scaler
+ Fit a Logistic regression

In [12]:
# set up the pipeline
titanic_pipe = sklearn.pipeline.Pipeline([

    # ===== IMPUTATION =====
    # impute categorical variables with string 'missing'
    (
        "categorical_imputation",
        feature_engine.imputation.CategoricalImputer(
            imputation_method="missing", variables=CATEGORICAL_VARIABLES
        )
    ),

    # add missing indicator to numerical variables
    (
        "missing_indicator", 
        feature_engine.imputation.AddMissingIndicator(
            variables=NUMERICAL_VARIABLES
        )
    ),

    # impute numerical variables with the median
    (
        "median_imputation", 
        feature_engine.imputation.MeanMedianImputer(
            imputation_method="median", variables=NUMERICAL_VARIABLES
        )
    ),


    # Extract first letter from cabin
    ("extract_letter", ExtractLetterTransformer(CABIN)),


    # == CATEGORICAL ENCODING ======
    # remove categories present in less than 5% of the observations (0.05)
    # group them in one category called 'Rare'
    (
        "rare_label_encoder", 
        feature_engine.encoding.RareLabelEncoder(
            tol=0.05, n_categories=1, variables=CATEGORICAL_VARIABLES
        )
    ),


    # encode categorical variables using one hot encoding into k-1 variables
    (
        "categorical_encoder", 
        feature_engine.encoding.OneHotEncoder(
            drop_last=True, variables=CATEGORICAL_VARIABLES
        )
    ),

    # scale using standardization
    ("scaler", sklearn.preprocessing.StandardScaler()),

    # logistic regression (use C=0.0005 and random_state=0)
    ('Logit', sklearn.linear_model.LogisticRegression(C=0.0005, random_state=0)),
])

In [13]:
titanic_pipe.fit(X_train, y_train)

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

In [14]:
# train set
y_pred = titanic_pipe.predict_proba(X_train)[:,1]

print("train roc-auc: {}".format(sklearn.metrics.roc_auc_score(y_train, y_pred)))
print("train accuracy: {}".format(sklearn.metrics.accuracy_score(y_train, np.round(y_pred))))
print()

# test set
y_pred_test = titanic_pipe.predict_proba(X_test)[:,1]

print("train roc-auc: {}".format(sklearn.metrics.roc_auc_score(y_test, y_pred_test)))
print("train accuracy: {}".format(sklearn.metrics.accuracy_score(y_test, np.round(y_pred_test))))
print()

train roc-auc: 0.8450386398763523
train accuracy: 0.7220630372492837

train roc-auc: 0.8354629629629629
train accuracy: 0.7137404580152672



That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**