# Introduction

### Competition: [Titanic Kaggle](https://www.kaggle.com/c/titanic/overview)

This is notebook contains a simple data science project framework, for learning and portfolio construction purposes.

# Libs

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from sklearn.pipeline import Pipeline

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, Embedding,  Flatten
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.optimizers import RMSprop

from tensorflow.data import Dataset
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import QuantileTransformer,  KBinsDiscretizer, StandardScaler
from tensorflow import keras
from sklearn import metrics
from sklearn.impute import SimpleImputer

from sklearn.model_selection import GridSearchCV, cross_val_score

# Load Dataset

This step we simply get our data to our working environment. Because we are not dealing with live data, a simple pandas usage is enough.

In [2]:
%%time

train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

Wall time: 14 ms


# Preprocessing

In [3]:
%%time

train['Survived'] = train['Survived'].astype(str)

train['n_missing'] = train.isna().sum(axis=1)
test['n_missing'] = test.isna().sum(axis=1)

test['Pclass']= test['Pclass'].astype(str)
test['Pclass']= test['Pclass'].astype(str)

features = [col for col in train.columns if col not in ['Survived', 'PassengerId']]

Wall time: 5.99 ms


### *Name* Column 

In [4]:
print(len(train['Name'].unique()))
print(train['Name'].unique()[0:5])

891
['Braund, Mr. Owen Harris'
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
 'Heikkinen, Miss. Laina' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
 'Allen, Mr. William Henry']


**With the *Name* column the way it is, we can't use it in our models.** The reason is because as every person has a unique name, then the name has no information about our variable of interest (*Survived*).

One thing we can see in this column is the presence of titles. **We can probably assume different survival rates when considering different titles.**

In [5]:
name_and_title = [name.split(", ")[1] for name in train['Name']]
title = [name.split(".")[0] for name in name_and_title]
print(len(title))

891


In [6]:
print(len(np.unique(title)))
np.unique(title)

17


array(['Capt', 'Col', 'Don', 'Dr', 'Jonkheer', 'Lady', 'Major', 'Master',
       'Miss', 'Mlle', 'Mme', 'Mr', 'Mrs', 'Ms', 'Rev', 'Sir',
       'the Countess'], dtype='<U12')

In [7]:
train['Name'] = title
test['Name'] = [name.split(".")[0] for name in [name.split(", ")[1] for name in test['Name']]] 

In [8]:
train = pd.concat([train, pd.get_dummies(train['Name']).filter(['Miss', 'Mr', 'Mrs', 'Ms'])], axis = 1)
train.drop('Name', axis = 1, inplace = True)

test = pd.concat([test, pd.get_dummies(test['Name']).filter(['Miss', 'Mr', 'Mrs', 'Ms'])], axis = 1)
test.drop('Name', axis = 1, inplace = True)

### Dealing with the Ticket feature

In [9]:
train['Ticket'][0:5]

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

**One hypothesis we can make** is that the numbers don't contain any relevant information and the prefix may contain relevant information.

In [10]:
ticket_prefixes = [ticket.split()[0] for ticket in train['Ticket']]
ticket_prefixes[0:5]

['A/5', 'PC', 'STON/O2.', '113803', '373450']

In [11]:
for i in range(len(ticket_prefixes)):
    try: 
        int(ticket_prefixes[i])
        ticket_prefixes[i] = "number_only"
    
    except Exception:
        pass

In [12]:
ticket_prefixes[0:5]

['A/5', 'PC', 'STON/O2.', 'number_only', 'number_only']

In [13]:
print(len(np.unique(ticket_prefixes)))
np.unique(ticket_prefixes)

44


array(['A./5.', 'A.5.', 'A/4', 'A/4.', 'A/5', 'A/5.', 'A/S', 'A4.', 'C',
       'C.A.', 'C.A./SOTON', 'CA', 'CA.', 'F.C.', 'F.C.C.', 'Fa', 'LINE',
       'P/PP', 'PC', 'PP', 'S.C./A.4.', 'S.C./PARIS', 'S.O./P.P.',
       'S.O.C.', 'S.O.P.', 'S.P.', 'S.W./PP', 'SC', 'SC/AH', 'SC/PARIS',
       'SC/Paris', 'SCO/W', 'SO/C', 'SOTON/O.Q.', 'SOTON/O2', 'SOTON/OQ',
       'STON/O', 'STON/O2.', 'SW/PP', 'W./C.', 'W.E.P.', 'W/C', 'WE/P',
       'number_only'], dtype='<U11')

In [14]:
ticket_prefixes = [s.replace(".", "") for s in ticket_prefixes]
ticket_prefixes = [s.replace(",", "") for s in ticket_prefixes]
ticket_prefixes = [s.upper() for s in ticket_prefixes]

In [15]:
print(len(np.unique(ticket_prefixes)))
np.unique(ticket_prefixes)

34


array(['A/4', 'A/5', 'A/S', 'A4', 'A5', 'C', 'CA', 'CA/SOTON', 'FA', 'FC',
       'FCC', 'LINE', 'NUMBER_ONLY', 'P/PP', 'PC', 'PP', 'SC', 'SC/A4',
       'SC/AH', 'SC/PARIS', 'SCO/W', 'SO/C', 'SO/PP', 'SOC', 'SOP',
       'SOTON/O2', 'SOTON/OQ', 'SP', 'STON/O', 'STON/O2', 'SW/PP', 'W/C',
       'WE/P', 'WEP'], dtype='<U11')

In [16]:
test_ticket_prefixes = [ticket.split()[0] for ticket in test['Ticket']]
for i in range(len(test_ticket_prefixes)):
    try: 
        int(test_ticket_prefixes[i])
        test_ticket_prefixes[i] = "number_only"
    
    except Exception:
        pass

test_ticket_prefixes = [s.replace(".", "") for s in test_ticket_prefixes]
test_ticket_prefixes = [s.replace(",", "") for s in test_ticket_prefixes]
test_ticket_prefixes = [s.upper() for s in test_ticket_prefixes]

In [17]:
train['Ticket'] = ticket_prefixes
test['Ticket'] = test_ticket_prefixes

In [18]:
train = pd.concat([train, pd.get_dummies(train['Ticket']).filter(['PC', 'CA', 'NUMBER_ONLY'])], axis = 1)
train.drop('Ticket', axis = 1, inplace = True)

test = pd.concat([test, pd.get_dummies(test['Ticket']).filter(['PC', 'CA', 'NUMBER_ONLY'])], axis = 1)
test.drop('Ticket', axis = 1, inplace = True)

### Dealing with the Cabin feature

Same as the *Ticket* feature. I will assume that the number doesn't have relevant information.

In [19]:
train['Cabin'].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

In [20]:
cabin_prefix = []
for i in range(len(train['Cabin'])):
    try:
        cabin_prefix.append(train['Cabin'][i][0: 1: 1])
    
    except:
        cabin_prefix.append(train['Cabin'][i])        

In [21]:
np.unique(cabin_prefix)

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T', 'nan'], dtype='<U32')

In [22]:
cabin_test_prefix = []
for i in range(len(test['Cabin'])):
    try:
        cabin_test_prefix.append(test['Cabin'][i][0: 1: 1])
    
    except:
        cabin_test_prefix.append(test['Cabin'][i])

In [23]:
train['Cabin'] = cabin_prefix
test['Cabin'] = cabin_test_prefix

In [24]:
train = pd.concat([train, pd.get_dummies(train['Cabin']).filter(['NaN', 'B', 'C'])], axis = 1)
train.drop('Cabin', axis = 1, inplace = True)

test = pd.concat([test, pd.get_dummies(test['Cabin']).filter(['NaN', 'B', 'C'])], axis = 1)
test.drop('Cabin', axis = 1, inplace = True)

## Pclass, Sex and Embarked variables

In [25]:
train = pd.get_dummies(train, columns = ['Pclass', 'Sex', 'Embarked'])
test = pd.get_dummies(test, columns = ['Pclass', 'Sex', 'Embarked'])

## Imputer and Scaler

In [26]:
%%time

features = [col for col in train.columns if col not in ['Survived', 'PassengerId']]
numerical_features = [col for col in features if col in ['Age', 'SibSp', 'Parch', 'Fare', 'n_missing']]

pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='mean',missing_values=np.nan)),
        ("scaler", StandardScaler())
        ])

train[numerical_features] = pipe.fit_transform(train[numerical_features])
test[numerical_features] = pipe.transform(test[numerical_features])

Wall time: 12 ms


# Base Models

## Light GBM

In [101]:
import lightgbm as lgb
import optuna

def objective(trial):
    
    param = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 2, 32),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    kf = KFold(5, shuffle = True, random_state = 0)
    kf.split(train)
    
    accuracy_scores = []
    
    for train_ix, test_ix in kf.split(train):
        dtrain = lgb.Dataset(train[features].iloc[train_ix,:], label = train['Survived'].iloc[train_ix])
        
        gbm = lgb.train(param, dtrain)
        preds = np.rint(gbm.predict(train[features].iloc[test_ix]))
        
        accuracy_scores.append(metrics.accuracy_score(train['Survived'].iloc[test_ix], preds))
        
    return np.mean(accuracy_scores)
    

# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000)

[32m[I 2021-12-20 20:56:35,533][0m A new study created in memory with name: no-name-e443085c-99de-4852-abc2-66bf999e67cb[0m
[32m[I 2021-12-20 20:56:35,704][0m Trial 0 finished with value: 0.8047078023978408 and parameters: {'lambda_l1': 3.1701460116821144e-06, 'lambda_l2': 0.00010970626361252692, 'num_leaves': 27, 'feature_fraction': 0.4896625492328683, 'bagging_fraction': 0.46106148942292696, 'bagging_freq': 2, 'min_child_samples': 57}. Best is trial 0 with value: 0.8047078023978408.[0m
[32m[I 2021-12-20 20:56:35,904][0m Trial 1 finished with value: 0.819295712761283 and parameters: {'lambda_l1': 1.99301573747084e-07, 'lambda_l2': 2.652645114970662e-05, 'num_leaves': 27, 'feature_fraction': 0.4685845858904837, 'bagging_fraction': 0.6878932252441168, 'bagging_freq': 6, 'min_child_samples': 49}. Best is trial 1 with value: 0.819295712761283.[0m
[32m[I 2021-12-20 20:56:36,075][0m Trial 2 finished with value: 0.811461929571276 and parameters: {'lambda_l1': 4.983195558115078, 'l

[32m[I 2021-12-20 20:56:41,411][0m Trial 23 finished with value: 0.8350009415604795 and parameters: {'lambda_l1': 4.234077050498874e-05, 'lambda_l2': 9.203183679390227, 'num_leaves': 19, 'feature_fraction': 0.9508529578575722, 'bagging_fraction': 0.9228883373368187, 'bagging_freq': 3, 'min_child_samples': 25}. Best is trial 22 with value: 0.8350134957002071.[0m
[32m[I 2021-12-20 20:56:41,779][0m Trial 24 finished with value: 0.8181595631159375 and parameters: {'lambda_l1': 9.645508239046713e-05, 'lambda_l2': 0.13283834142165957, 'num_leaves': 19, 'feature_fraction': 0.9340310804541807, 'bagging_fraction': 0.7353266273016937, 'bagging_freq': 2, 'min_child_samples': 16}. Best is trial 22 with value: 0.8350134957002071.[0m
[32m[I 2021-12-20 20:56:42,094][0m Trial 25 finished with value: 0.8226539451384095 and parameters: {'lambda_l1': 0.020383766019438067, 'lambda_l2': 0.019323407426536206, 'num_leaves': 17, 'feature_fraction': 0.9263238810117319, 'bagging_fraction': 0.85226190748

[32m[I 2021-12-20 20:56:48,449][0m Trial 46 finished with value: 0.8260121775155358 and parameters: {'lambda_l1': 0.4508959965760269, 'lambda_l2': 3.0150690618847142, 'num_leaves': 20, 'feature_fraction': 0.8646363810786181, 'bagging_fraction': 0.9635717381521471, 'bagging_freq': 2, 'min_child_samples': 19}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:56:48,726][0m Trial 47 finished with value: 0.8294017952419811 and parameters: {'lambda_l1': 7.324830701872487e-06, 'lambda_l2': 0.10881148617801857, 'num_leaves': 18, 'feature_fraction': 0.8249939578380058, 'bagging_fraction': 0.5844907348184092, 'bagging_freq': 1, 'min_child_samples': 36}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:56:49,087][0m Trial 48 finished with value: 0.8282593685267716 and parameters: {'lambda_l1': 0.00032242167040169063, 'lambda_l2': 1.0417332826635384, 'num_leaves': 16, 'feature_fraction': 0.9037573393433203, 'bagging_fraction': 0.9272864371029

[32m[I 2021-12-20 20:56:55,418][0m Trial 69 finished with value: 0.8293955181721172 and parameters: {'lambda_l1': 0.015426443618766794, 'lambda_l2': 0.7466164595121573, 'num_leaves': 26, 'feature_fraction': 0.8492295371346668, 'bagging_fraction': 0.9602580337188541, 'bagging_freq': 2, 'min_child_samples': 31}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:56:55,630][0m Trial 70 finished with value: 0.8081162513338773 and parameters: {'lambda_l1': 9.860673523493556, 'lambda_l2': 0.00021543440086082462, 'num_leaves': 21, 'feature_fraction': 0.8757399124654687, 'bagging_fraction': 0.8271246377960115, 'bagging_freq': 3, 'min_child_samples': 9}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:56:55,918][0m Trial 71 finished with value: 0.8327600276191074 and parameters: {'lambda_l1': 1.3118229031165072, 'lambda_l2': 6.01387849279297e-05, 'num_leaves': 23, 'feature_fraction': 0.9194403660778248, 'bagging_fraction': 0.922632559222382

[32m[I 2021-12-20 20:57:01,903][0m Trial 92 finished with value: 0.8338836231247255 and parameters: {'lambda_l1': 0.006667839201038783, 'lambda_l2': 1.3228896638444816, 'num_leaves': 7, 'feature_fraction': 0.8644664129706086, 'bagging_fraction': 0.774274128494688, 'bagging_freq': 2, 'min_child_samples': 31}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:57:02,155][0m Trial 93 finished with value: 0.827154604230745 and parameters: {'lambda_l1': 0.003031513459399014, 'lambda_l2': 3.1361448242382823, 'num_leaves': 7, 'feature_fraction': 0.868599084891322, 'bagging_fraction': 0.7799593608058623, 'bagging_freq': 6, 'min_child_samples': 28}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:57:02,390][0m Trial 94 finished with value: 0.8294017952419811 and parameters: {'lambda_l1': 0.012093489008145272, 'lambda_l2': 2.0620804646086617, 'num_leaves': 6, 'feature_fraction': 0.8147070384449225, 'bagging_fraction': 0.6848694351829142, 'ba

[32m[I 2021-12-20 20:57:07,718][0m Trial 115 finished with value: 0.8282656455966355 and parameters: {'lambda_l1': 5.6467296226804035e-06, 'lambda_l2': 0.8517671418756679, 'num_leaves': 9, 'feature_fraction': 0.8352612023360108, 'bagging_fraction': 0.9872847612592144, 'bagging_freq': 2, 'min_child_samples': 35}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:57:07,988][0m Trial 116 finished with value: 0.8316364321134895 and parameters: {'lambda_l1': 2.0053443400612985e-05, 'lambda_l2': 1.9535268873387677, 'num_leaves': 8, 'feature_fraction': 0.9629478122492122, 'bagging_fraction': 0.7462059428768427, 'bagging_freq': 2, 'min_child_samples': 34}. Best is trial 45 with value: 0.8361433682756889.[0m
[32m[I 2021-12-20 20:57:08,185][0m Trial 117 finished with value: 0.8137028435126483 and parameters: {'lambda_l1': 0.01870466476286103, 'lambda_l2': 7.218409383924657, 'num_leaves': 3, 'feature_fraction': 0.908353780938965, 'bagging_fraction': 0.97566576434641

KeyboardInterrupt: 

In [94]:
study.best_params
'''
{'lambda_l1': 5.5693205859882666e-08,
 'lambda_l2': 0.0029379573632802307,
 'num_leaves': 218,
 'feature_fraction': 0.4449630393801182,
 'bagging_fraction': 0.6190711470746258,
 'bagging_freq': 1,
 'min_child_samples': 24}
'''

"\n{'lambda_l1': 5.5693205859882666e-08,\n 'lambda_l2': 0.0029379573632802307,\n 'num_leaves': 218,\n 'feature_fraction': 0.4449630393801182,\n 'bagging_fraction': 0.6190711470746258,\n 'bagging_freq': 1,\n 'min_child_samples': 24}\n"

In [95]:
gbm = lgb.train(study.best_params, lgb.Dataset(train[features], label=train['Survived']))
preds = gbm.predict(test[features])

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 251
[LightGBM] [Info] Number of data points in the train set: 891, number of used features: 21
[LightGBM] [Info] Start training from score 0.383838


In [96]:
submission = pd.read_csv('data/submission.csv')

In [97]:
submission['Survived'] = np.abs(np.rint(preds))

In [98]:
submission.to_csv("data/submission.csv", index = False)