## Predicting Survival on the Titanic

### History
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Assignment:

Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.

Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps.

In [2]:
!pip install --force-reinstall -r ../../my-assignment-section-05/requirements/requirements.txt

Collecting numpy<2.0.0,>=1.21.0 (from -r ../../my-assignment-section-05/requirements/requirements.txt (line 4))
  Using cached numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (61 kB)
Collecting pandas<2.0.0,>=1.3.5 (from -r ../../my-assignment-section-05/requirements/requirements.txt (line 5))
  Downloading pandas-1.5.3-cp311-cp311-macosx_10_9_x86_64.whl.metadata (11 kB)
Collecting pydantic<2.0.0,>=1.8.1 (from -r ../../my-assignment-section-05/requirements/requirements.txt (line 6))
  Downloading pydantic-1.10.18-cp311-cp311-macosx_10_9_x86_64.whl.metadata (152 kB)
Collecting scikit-learn<2.0.0,>=1.1.3 (from -r ../../my-assignment-section-05/requirements/requirements.txt (line 7))
  Downloading scikit_learn-1.5.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (12 kB)
Collecting strictyaml<2.0.0,>=1.3.2 (from -r ../../my-assignment-section-05/requirements/requirements.txt (line 8))
  Downloading strictyaml-1.7.3-py3-none-any.whl.metadata (11 kB)
Collecting ruamel.yaml<1.0.0,>=0.16

In [3]:
# !pip install feature_engine
!pip install --upgrade pandas

Collecting pandas
  Using cached pandas-1.3.5-cp37-cp37m-macosx_10_9_x86_64.whl (11.0 MB)
[31mERROR: osmnx 1.1.2 has requirement matplotlib>=3.4, but you'll have matplotlib 3.3.4 which is incompatible.[0m
[31mERROR: osmnx 1.1.2 has requirement numpy>=1.21, but you'll have numpy 1.20.1 which is incompatible.[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.2.2
    Uninstalling pandas-1.2.2:
      Successfully uninstalled pandas-1.2.2
Successfully installed pandas-1.3.5


In [3]:
import re

# to handle datasets
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# to build the models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import accuracy_score, roc_auc_score

# to persist the model and the scaler
import joblib

from sklearn.pipeline import Pipeline

# ========== NEW IMPORTS ========
# Respect to notebook 02-Predicting-Survival-Titanic-Solution

from sklearn.base import TransformerMixin, BaseEstimator

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)

from feature_engine.encoding import (
    RareLabelEncoder,
    OrdinalEncoder, 
    OneHotEncoder,
)

from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer, 
)

from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper


## Prepare the data set

In [8]:
# load the data - it is available open source and online

data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')

# display data
data.head()

  return method()
  return method()


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


In [9]:
# replace interrogation marks by NaN values

data = data.replace('?', np.nan)

In [10]:
# retain only the first cabin if more than
# 1 are available per passenger

def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [11]:
# extracts the title (Mr, Ms, etc) from the name variable

def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
    
data['title'] = data['name'].apply(get_title)

In [12]:
# cast numerical variables as floats

data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

In [13]:
# drop unnecessary variables

data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)

# display data
data.head()

  return method()
  return method()


Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,1,female,29.0,0,0,211.3375,B5,S,Miss
1,1,1,male,0.9167,1,2,151.55,C22,S,Master
2,1,0,female,2.0,1,2,151.55,C22,S,Miss
3,1,0,male,30.0,1,2,151.55,C22,S,Mr
4,1,0,female,25.0,1,2,151.55,C22,S,Mrs


In [8]:
# # save the data set

# data.to_csv('titanic.csv', index=False)

# Begin Assignment

## Configuration

In [14]:
# list of variables to be used in the pipeline's transformers

NUMERICAL_VARIABLES = ['pclass', 'age', 'sibsp', 'parch', 'fare']

CATEGORICAL_VARIABLES = ['sex', 'cabin', 'embarked', 'title']

YEOJOHNSON_VARIABLES=["age", "fare"]

CABIN = ["cabin"]

## Separate data into train and test

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1047, 9), (262, 9))

## Preprocessors

### Class to extract the letter from the variable Cabin

In [16]:
class ExtractLetterTransformer(BaseEstimator, TransformerMixin):
    # Extract fist letter of variable

    def __init__(self, cabin_col, fill_value="Missing"):
        self.cabin_col = cabin_col
        self.fill_value = fill_value

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        X = X.copy()
        X[self.cabin_col] = X[self.cabin_col].apply(lambda x: self.fill_value \
                                                    if x is np.nan or \
                                                    x==self.fill_value else x[:1])
        return X


## Pipeline

- Impute categorical variables with string missing
- Add a binary missing indicator to numerical variables with missing data
- Fill NA in original numerical variable with the median
- Extract first letter from cabin
- Group rare Categories
- Perform One hot encoding
- Scale features with standard scaler
- Fit a Logistic regression

In [17]:
# set up the pipeline
titanic_pipe = Pipeline([

    # ===== IMPUTATION =====
    # impute categorical variables with string 'missing'
    ('categorical_imputation', CategoricalImputer(imputation_method="missing",
                                                  fill_value="Missing",
                                                  variables=CATEGORICAL_VARIABLES)),

    # add missing indicator to numerical variables
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARIABLES)),

    # impute numerical variables with the median
    ('median_imputation', MeanMedianImputer(imputation_method="median",
                                            variables=NUMERICAL_VARIABLES)),


    # Extract first letter from cabin
    ('extract_letter', ExtractLetterTransformer(CABIN[0],fill_value="Missing")),


    # == CATEGORICAL ENCODING ======
    # remove categories present in less than 5% of the observations (0.05)
    # group them in one category called 'Rare'
    ('rare_label_encoder', RareLabelEncoder(tol=0.05,
                                            n_categories=1,
                                            variables=CATEGORICAL_VARIABLES)),


    # encode categorical variables using one hot encoding into k-1 variables
    ('categorical_encoder', OneHotEncoder(drop_last=True,
                                          variables=CATEGORICAL_VARIABLES)),

    ('yeojohnson', YeoJohnsonTransformer(variables=YEOJOHNSON_VARIABLES)),
    # scale using standardization
    ('scaler', SklearnTransformerWrapper(transformer=StandardScaler(),
                                         variables=YEOJOHNSON_VARIABLES)),

    # logistic regression (use C=0.0005 and random_state=0)
    ('Logit', LogisticRegression())#C=0.0005,random_state=0)),
])

In [18]:
# train the pipeline

titanic_pipe.fit(X_train,y_train)

## Make predictions and evaluate model performance

Determine:
- roc-auc
- accuracy

**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**

In [13]:
# make predictions for train set
class_ = titanic_pipe.predict(X_train)
pred = titanic_pipe.predict_proba(X_train)[:,1]

# determine mse and rmse
print('train roc-auc: {}'.format(roc_auc_score(y_train, pred)))
print('train accuracy: {}'.format(accuracy_score(y_train, class_)))
print()

# make predictions for test set
class_ = titanic_pipe.predict(X_test)
pred = titanic_pipe.predict_proba(X_test)[:,1]


# determine mse and rmse
print('test roc-auc: {}'.format(roc_auc_score(y_test, pred)))
print('test accuracy: {}'.format(accuracy_score(y_test, class_)))
print()

train roc-auc: 0.8597681607418856
train accuracy: 0.8147086914995224

test roc-auc: 0.8586111111111111
test accuracy: 0.8015267175572519



That's it! Well done

**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**

In [None]:
Pipeline(titanic_pipe.steps[:-1]).transform(X_train)[["embarked_S","embarked_C","embarked_Q"]]