# Introduction

### Competition: [Titanic Kaggle](https://www.kaggle.com/c/titanic/overview)

This is notebook contains a simple data science project framework, for learning and portfolio construction purposes.

# Libs

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from sklearn.pipeline import Pipeline

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, Embedding,  Flatten
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.optimizers import RMSprop

from tensorflow.data import Dataset
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import QuantileTransformer,  KBinsDiscretizer, StandardScaler
from tensorflow import keras
from sklearn import metrics
from sklearn.impute import SimpleImputer

from sklearn.model_selection import GridSearchCV, cross_val_score

# Load Dataset

This step we simply get our data to our working environment. Because we are not dealing with live data, a simple pandas usage is enough.

In [2]:
%%time

train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

Wall time: 14 ms


# Preprocessing

In [3]:
%%time

train['Survived'] = train['Survived'].astype(str)

train['n_missing'] = train.isna().sum(axis=1)
test['n_missing'] = test.isna().sum(axis=1)

test['Pclass']= test['Pclass'].astype(str)
test['Pclass']= test['Pclass'].astype(str)

features = [col for col in train.columns if col not in ['Survived', 'PassengerId']]

Wall time: 5.99 ms


### *Name* Column 

In [4]:
print(len(train['Name'].unique()))
print(train['Name'].unique()[0:5])

891
['Braund, Mr. Owen Harris'
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
 'Heikkinen, Miss. Laina' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
 'Allen, Mr. William Henry']


**With the *Name* column the way it is, we can't use it in our models.** The reason is because as every person has a unique name, then the name has no information about our variable of interest (*Survived*).

One thing we can see in this column is the presence of titles. **We can probably assume different survival rates when considering different titles.**

In [5]:
name_and_title = [name.split(", ")[1] for name in train['Name']]
title = [name.split(".")[0] for name in name_and_title]
print(len(title))

891


In [6]:
print(len(np.unique(title)))
np.unique(title)

17


array(['Capt', 'Col', 'Don', 'Dr', 'Jonkheer', 'Lady', 'Major', 'Master',
       'Miss', 'Mlle', 'Mme', 'Mr', 'Mrs', 'Ms', 'Rev', 'Sir',
       'the Countess'], dtype='<U12')

In [7]:
train['Name'] = title
test['Name'] = [name.split(".")[0] for name in [name.split(", ")[1] for name in test['Name']]] 

In [8]:
train = pd.concat([train, pd.get_dummies(train['Name']).filter(['Miss', 'Mr', 'Mrs', 'Ms'])], axis = 1)
train.drop('Name', axis = 1, inplace = True)

test = pd.concat([test, pd.get_dummies(test['Name']).filter(['Miss', 'Mr', 'Mrs', 'Ms'])], axis = 1)
test.drop('Name', axis = 1, inplace = True)

### Dealing with the Ticket feature

In [9]:
train['Ticket'][0:5]

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

**One hypothesis we can make** is that the numbers don't contain any relevant information and the prefix may contain relevant information.

In [10]:
ticket_prefixes = [ticket.split()[0] for ticket in train['Ticket']]
ticket_prefixes[0:5]

['A/5', 'PC', 'STON/O2.', '113803', '373450']

In [11]:
for i in range(len(ticket_prefixes)):
    try: 
        int(ticket_prefixes[i])
        ticket_prefixes[i] = "number_only"
    
    except Exception:
        pass

In [12]:
ticket_prefixes[0:5]

['A/5', 'PC', 'STON/O2.', 'number_only', 'number_only']

In [13]:
print(len(np.unique(ticket_prefixes)))
np.unique(ticket_prefixes)

44


array(['A./5.', 'A.5.', 'A/4', 'A/4.', 'A/5', 'A/5.', 'A/S', 'A4.', 'C',
       'C.A.', 'C.A./SOTON', 'CA', 'CA.', 'F.C.', 'F.C.C.', 'Fa', 'LINE',
       'P/PP', 'PC', 'PP', 'S.C./A.4.', 'S.C./PARIS', 'S.O./P.P.',
       'S.O.C.', 'S.O.P.', 'S.P.', 'S.W./PP', 'SC', 'SC/AH', 'SC/PARIS',
       'SC/Paris', 'SCO/W', 'SO/C', 'SOTON/O.Q.', 'SOTON/O2', 'SOTON/OQ',
       'STON/O', 'STON/O2.', 'SW/PP', 'W./C.', 'W.E.P.', 'W/C', 'WE/P',
       'number_only'], dtype='<U11')

In [14]:
ticket_prefixes = [s.replace(".", "") for s in ticket_prefixes]
ticket_prefixes = [s.replace(",", "") for s in ticket_prefixes]
ticket_prefixes = [s.upper() for s in ticket_prefixes]

In [15]:
print(len(np.unique(ticket_prefixes)))
np.unique(ticket_prefixes)

34


array(['A/4', 'A/5', 'A/S', 'A4', 'A5', 'C', 'CA', 'CA/SOTON', 'FA', 'FC',
       'FCC', 'LINE', 'NUMBER_ONLY', 'P/PP', 'PC', 'PP', 'SC', 'SC/A4',
       'SC/AH', 'SC/PARIS', 'SCO/W', 'SO/C', 'SO/PP', 'SOC', 'SOP',
       'SOTON/O2', 'SOTON/OQ', 'SP', 'STON/O', 'STON/O2', 'SW/PP', 'W/C',
       'WE/P', 'WEP'], dtype='<U11')

In [16]:
test_ticket_prefixes = [ticket.split()[0] for ticket in test['Ticket']]
for i in range(len(test_ticket_prefixes)):
    try: 
        int(test_ticket_prefixes[i])
        test_ticket_prefixes[i] = "number_only"
    
    except Exception:
        pass

test_ticket_prefixes = [s.replace(".", "") for s in test_ticket_prefixes]
test_ticket_prefixes = [s.replace(",", "") for s in test_ticket_prefixes]
test_ticket_prefixes = [s.upper() for s in test_ticket_prefixes]

In [17]:
train['Ticket'] = ticket_prefixes
test['Ticket'] = test_ticket_prefixes

In [18]:
train = pd.concat([train, pd.get_dummies(train['Ticket']).filter(['PC', 'CA', 'NUMBER_ONLY'])], axis = 1)
train.drop('Ticket', axis = 1, inplace = True)

test = pd.concat([test, pd.get_dummies(test['Ticket']).filter(['PC', 'CA', 'NUMBER_ONLY'])], axis = 1)
test.drop('Ticket', axis = 1, inplace = True)

### Dealing with the Cabin feature

Same as the *Ticket* feature. I will assume that the number doesn't have relevant information.

In [19]:
train['Cabin'].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

In [20]:
cabin_prefix = []
for i in range(len(train['Cabin'])):
    try:
        cabin_prefix.append(train['Cabin'][i][0: 1: 1])
    
    except:
        cabin_prefix.append(train['Cabin'][i])        

In [21]:
np.unique(cabin_prefix)

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T', 'nan'], dtype='<U32')

In [22]:
cabin_test_prefix = []
for i in range(len(test['Cabin'])):
    try:
        cabin_test_prefix.append(test['Cabin'][i][0: 1: 1])
    
    except:
        cabin_test_prefix.append(test['Cabin'][i])

In [23]:
train['Cabin'] = cabin_prefix
test['Cabin'] = cabin_test_prefix

In [24]:
train = pd.concat([train, pd.get_dummies(train['Cabin']).filter(['NaN', 'B', 'C'])], axis = 1)
train.drop('Cabin', axis = 1, inplace = True)

test = pd.concat([test, pd.get_dummies(test['Cabin']).filter(['NaN', 'B', 'C'])], axis = 1)
test.drop('Cabin', axis = 1, inplace = True)

## Pclass, Sex and Embarked variables

In [25]:
train = pd.get_dummies(train, columns = ['Pclass', 'Sex', 'Embarked'])
test = pd.get_dummies(test, columns = ['Pclass', 'Sex', 'Embarked'])

## Imputer and Scaler

In [26]:
%%time

features = [col for col in train.columns if col not in ['Survived', 'PassengerId']]
numerical_features = [col for col in features if col in ['Age', 'SibSp', 'Parch', 'Fare', 'n_missing']]

pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='mean',missing_values=np.nan)),
        ("scaler", StandardScaler())
        ])

train[numerical_features] = pipe.fit_transform(train[numerical_features])
test[numerical_features] = pipe.transform(test[numerical_features])

Wall time: 12 ms


# Base Models

## Light GBM

In [64]:
import lightgbm as lgb
import optuna

def objective(trial):
    
    train_x, valid_x, train_y, valid_y = train_test_split(train[features], train['Survived'], test_size=0.20)
    dtrain = lgb.Dataset(train_x, label=train_y)

    # 2. Suggest values of the hyperparameters using a trial object.
    param = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }

    gbm = lgb.train(param, dtrain)
    preds = np.rint(gbm.predict(valid_x))
    
    accuracy = metrics.accuracy_score(valid_y, preds)
    return accuracy
    

# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000)

[32m[I 2021-12-20 20:01:47,356][0m A new study created in memory with name: no-name-ea31ee86-b719-48ba-8aa9-aaa3a685a0ce[0m
[32m[I 2021-12-20 20:01:47,432][0m Trial 0 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 4.014087766874642e-08, 'lambda_l2': 0.00015744713645363923, 'num_leaves': 27, 'feature_fraction': 0.9039739089558969, 'bagging_fraction': 0.5226995774038986, 'bagging_freq': 2, 'min_child_samples': 14}. Best is trial 0 with value: 0.8212290502793296.[0m
[32m[I 2021-12-20 20:01:47,483][0m Trial 1 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 0.1672344251180262, 'lambda_l2': 0.7843167242876422, 'num_leaves': 53, 'feature_fraction': 0.8166852384968912, 'bagging_fraction': 0.7811594127210835, 'bagging_freq': 3, 'min_child_samples': 41}. Best is trial 1 with value: 0.8268156424581006.[0m
[32m[I 2021-12-20 20:01:47,526][0m Trial 2 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 0.00017708841586685798, '

[32m[I 2021-12-20 20:01:49,070][0m Trial 22 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 4.456210459282396e-05, 'lambda_l2': 9.01589611112316e-06, 'num_leaves': 144, 'feature_fraction': 0.5900388157058891, 'bagging_fraction': 0.5909910965605213, 'bagging_freq': 7, 'min_child_samples': 19}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:49,154][0m Trial 23 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 1.3278365624841614e-06, 'lambda_l2': 9.66978833334972e-07, 'num_leaves': 93, 'feature_fraction': 0.65611927475749, 'bagging_fraction': 0.4859082547661398, 'bagging_freq': 6, 'min_child_samples': 46}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:49,235][0m Trial 24 finished with value: 0.7821229050279329 and parameters: {'lambda_l1': 1.5118636760227041e-06, 'lambda_l2': 4.741791220877294e-07, 'num_leaves': 91, 'feature_fraction': 0.6830975755832005, 'bagging_fraction': 0.477272

[32m[I 2021-12-20 20:01:51,196][0m Trial 45 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 3.211467322693401e-08, 'lambda_l2': 3.2897077953322542e-06, 'num_leaves': 151, 'feature_fraction': 0.7430258322057448, 'bagging_fraction': 0.9087981081638211, 'bagging_freq': 3, 'min_child_samples': 41}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:51,286][0m Trial 46 finished with value: 0.8659217877094972 and parameters: {'lambda_l1': 1.3034820165067492e-07, 'lambda_l2': 0.006216751799674592, 'num_leaves': 221, 'feature_fraction': 0.887900669052977, 'bagging_fraction': 0.8822779254045133, 'bagging_freq': 4, 'min_child_samples': 52}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:51,370][0m Trial 47 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 2.635809084276726e-07, 'lambda_l2': 0.008879952194204354, 'num_leaves': 220, 'feature_fraction': 0.8865378310045782, 'bagging_fraction': 0.884

[32m[I 2021-12-20 20:01:53,408][0m Trial 68 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 1.8140344430762188e-06, 'lambda_l2': 0.0027438866043373407, 'num_leaves': 160, 'feature_fraction': 0.773979048213384, 'bagging_fraction': 0.6695430555998325, 'bagging_freq': 6, 'min_child_samples': 32}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:53,508][0m Trial 69 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 0.0021718118431612032, 'lambda_l2': 1.1529553248243765, 'num_leaves': 115, 'feature_fraction': 0.5501104762130927, 'bagging_fraction': 0.9697494729677064, 'bagging_freq': 2, 'min_child_samples': 38}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:53,614][0m Trial 70 finished with value: 0.8100558659217877 and parameters: {'lambda_l1': 4.471531116534123e-08, 'lambda_l2': 3.2801735253452864e-07, 'num_leaves': 97, 'feature_fraction': 0.5140462204030041, 'bagging_fraction': 0.57918

[32m[I 2021-12-20 20:01:55,555][0m Trial 91 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 1.6548445650932262e-05, 'lambda_l2': 0.000719858742598174, 'num_leaves': 179, 'feature_fraction': 0.883737826676938, 'bagging_fraction': 0.44651305979878936, 'bagging_freq': 1, 'min_child_samples': 29}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:55,649][0m Trial 92 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 4.20733353753976e-06, 'lambda_l2': 0.0012452764271294675, 'num_leaves': 189, 'feature_fraction': 0.8968371219295614, 'bagging_fraction': 0.4633428993270486, 'bagging_freq': 3, 'min_child_samples': 35}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:55,752][0m Trial 93 finished with value: 0.7821229050279329 and parameters: {'lambda_l1': 2.052634006531029e-06, 'lambda_l2': 0.0004672383009780988, 'num_leaves': 174, 'feature_fraction': 0.8536946487972983, 'bagging_fraction': 0.424

[32m[I 2021-12-20 20:01:57,790][0m Trial 114 finished with value: 0.770949720670391 and parameters: {'lambda_l1': 1.0827188657268897e-05, 'lambda_l2': 0.06763844839500822, 'num_leaves': 95, 'feature_fraction': 0.6273084998466621, 'bagging_fraction': 0.43509931860873424, 'bagging_freq': 7, 'min_child_samples': 47}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:57,950][0m Trial 115 finished with value: 0.7653631284916201 and parameters: {'lambda_l1': 1.1872210362180367e-07, 'lambda_l2': 1.3148943280413306e-07, 'num_leaves': 43, 'feature_fraction': 0.916895322959463, 'bagging_fraction': 0.6920415881781502, 'bagging_freq': 7, 'min_child_samples': 7}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:01:58,045][0m Trial 116 finished with value: 0.7988826815642458 and parameters: {'lambda_l1': 6.541745235453318e-08, 'lambda_l2': 0.010169896749688025, 'num_leaves': 36, 'feature_fraction': 0.6080851741323076, 'bagging_fraction': 0.7688

[32m[I 2021-12-20 20:02:00,147][0m Trial 137 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 8.616893806441807e-06, 'lambda_l2': 0.02250511424500533, 'num_leaves': 208, 'feature_fraction': 0.46571125886927767, 'bagging_fraction': 0.5071234224116237, 'bagging_freq': 1, 'min_child_samples': 27}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:00,242][0m Trial 138 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 1.7966025403976405e-05, 'lambda_l2': 0.011396255958116093, 'num_leaves': 137, 'feature_fraction': 0.8313256634472986, 'bagging_fraction': 0.48747686879525515, 'bagging_freq': 3, 'min_child_samples': 54}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:00,345][0m Trial 139 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 2.6102423689727785e-08, 'lambda_l2': 0.0023562190231088366, 'num_leaves': 161, 'feature_fraction': 0.572001379520687, 'bagging_fraction': 0

[32m[I 2021-12-20 20:02:02,646][0m Trial 160 finished with value: 0.8659217877094972 and parameters: {'lambda_l1': 2.3383787796429072e-07, 'lambda_l2': 0.00020174422695104442, 'num_leaves': 75, 'feature_fraction': 0.8150862610972748, 'bagging_fraction': 0.9509485925145715, 'bagging_freq': 4, 'min_child_samples': 18}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:02,783][0m Trial 161 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 1.8697643603689944e-07, 'lambda_l2': 0.00019490875272118726, 'num_leaves': 74, 'feature_fraction': 0.8479775101333833, 'bagging_fraction': 0.9521732252284827, 'bagging_freq': 4, 'min_child_samples': 16}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:02,943][0m Trial 162 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 4.634873693756842e-07, 'lambda_l2': 9.38916257610086e-05, 'num_leaves': 65, 'feature_fraction': 0.8155580457417242, 'bagging_fraction': 0

[32m[I 2021-12-20 20:02:05,528][0m Trial 183 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 2.807117124230363, 'lambda_l2': 0.0003076135990395594, 'num_leaves': 82, 'feature_fraction': 0.7795218877439037, 'bagging_fraction': 0.9033806728950946, 'bagging_freq': 3, 'min_child_samples': 10}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:05,623][0m Trial 184 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 6.890695300984491, 'lambda_l2': 2.493260014804215e-06, 'num_leaves': 86, 'feature_fraction': 0.78947691609367, 'bagging_fraction': 0.9238158972850415, 'bagging_freq': 3, 'min_child_samples': 12}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:05,728][0m Trial 185 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 4.225425212724655e-07, 'lambda_l2': 4.623192946587538e-06, 'num_leaves': 108, 'feature_fraction': 0.7490748397664575, 'bagging_fraction': 0.91320043572

[32m[I 2021-12-20 20:02:08,340][0m Trial 206 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 2.2618113603572385, 'lambda_l2': 0.00021408355769354903, 'num_leaves': 85, 'feature_fraction': 0.7967170526317546, 'bagging_fraction': 0.917936188011434, 'bagging_freq': 3, 'min_child_samples': 12}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:08,446][0m Trial 207 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 0.00025905317673608874, 'lambda_l2': 0.0008569056457886786, 'num_leaves': 152, 'feature_fraction': 0.9421983881442128, 'bagging_fraction': 0.4405938800918814, 'bagging_freq': 3, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:08,548][0m Trial 208 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 2.0240281748149682e-07, 'lambda_l2': 5.345281193277869e-07, 'num_leaves': 72, 'feature_fraction': 0.8361321467572659, 'bagging_fraction': 0.92

[32m[I 2021-12-20 20:02:10,890][0m Trial 229 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 1.2565269662214664e-06, 'lambda_l2': 0.00010503692633764399, 'num_leaves': 171, 'feature_fraction': 0.7791260381126968, 'bagging_fraction': 0.42924948203843843, 'bagging_freq': 1, 'min_child_samples': 26}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:10,996][0m Trial 230 finished with value: 0.8547486033519553 and parameters: {'lambda_l1': 3.8767546742634986e-07, 'lambda_l2': 0.00038512397175815546, 'num_leaves': 164, 'feature_fraction': 0.7865006005925195, 'bagging_fraction': 0.4106943757723982, 'bagging_freq': 1, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:11,126][0m Trial 231 finished with value: 0.8603351955307262 and parameters: {'lambda_l1': 3.108020810316539e-07, 'lambda_l2': 0.0004434175909190514, 'num_leaves': 160, 'feature_fraction': 0.7875123645088219, 'bagging_fractio

[32m[I 2021-12-20 20:02:13,778][0m Trial 252 finished with value: 0.8547486033519553 and parameters: {'lambda_l1': 0.0878472778764071, 'lambda_l2': 0.00010933801094395398, 'num_leaves': 201, 'feature_fraction': 0.8747486478473349, 'bagging_fraction': 0.5185793754471086, 'bagging_freq': 3, 'min_child_samples': 15}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:13,904][0m Trial 253 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 0.014978892016786224, 'lambda_l2': 8.560986768526262e-05, 'num_leaves': 204, 'feature_fraction': 0.8788207321794057, 'bagging_fraction': 0.5116258881361417, 'bagging_freq': 3, 'min_child_samples': 16}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:14,017][0m Trial 254 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 0.04207552405627836, 'lambda_l2': 4.213735863932491e-06, 'num_leaves': 184, 'feature_fraction': 0.9013447705665808, 'bagging_fraction': 0.5404

[32m[I 2021-12-20 20:02:16,730][0m Trial 275 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 4.0754459294982414e-07, 'lambda_l2': 0.0003577373683047347, 'num_leaves': 226, 'feature_fraction': 0.7463751832760616, 'bagging_fraction': 0.40816936269801896, 'bagging_freq': 1, 'min_child_samples': 26}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:16,856][0m Trial 276 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 1.9563142029403e-08, 'lambda_l2': 0.00019197750666905976, 'num_leaves': 195, 'feature_fraction': 0.7991136006685112, 'bagging_fraction': 0.8997402273409542, 'bagging_freq': 3, 'min_child_samples': 33}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:16,963][0m Trial 277 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 0.008337075988204386, 'lambda_l2': 9.232330734215654e-05, 'num_leaves': 115, 'feature_fraction': 0.9084321690235874, 'bagging_fraction': 0

[32m[I 2021-12-20 20:02:19,311][0m Trial 298 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 0.0144085297347161, 'lambda_l2': 0.0001368078172290207, 'num_leaves': 203, 'feature_fraction': 0.8100090971062411, 'bagging_fraction': 0.8921052326471085, 'bagging_freq': 4, 'min_child_samples': 52}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:19,425][0m Trial 299 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 0.023695819341750392, 'lambda_l2': 0.0004852479075751373, 'num_leaves': 215, 'feature_fraction': 0.6553293686962113, 'bagging_fraction': 0.8639380781982879, 'bagging_freq': 4, 'min_child_samples': 48}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:19,543][0m Trial 300 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 3.2947801781223064e-08, 'lambda_l2': 0.0006012904518042184, 'num_leaves': 222, 'feature_fraction': 0.6154395965996212, 'bagging_fraction': 0.87

[32m[I 2021-12-20 20:02:22,008][0m Trial 321 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 7.069451531862665e-07, 'lambda_l2': 1.4950914478666751e-05, 'num_leaves': 78, 'feature_fraction': 0.6186838427414684, 'bagging_fraction': 0.8984525877529356, 'bagging_freq': 4, 'min_child_samples': 49}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:22,126][0m Trial 322 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 2.2072786583361383e-07, 'lambda_l2': 6.080477195117049e-06, 'num_leaves': 69, 'feature_fraction': 0.626155109104582, 'bagging_fraction': 0.906133287703949, 'bagging_freq': 4, 'min_child_samples': 46}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:22,242][0m Trial 323 finished with value: 0.8603351955307262 and parameters: {'lambda_l1': 2.0947423247817685e-08, 'lambda_l2': 0.0021311628778020453, 'num_leaves': 153, 'feature_fraction': 0.8127772447605668, 'bagging_fraction': 0.

[32m[I 2021-12-20 20:02:24,853][0m Trial 344 finished with value: 0.7988826815642458 and parameters: {'lambda_l1': 2.5582160109376073e-07, 'lambda_l2': 0.00021316708552383964, 'num_leaves': 102, 'feature_fraction': 0.7911313680489592, 'bagging_fraction': 0.9872955375052178, 'bagging_freq': 3, 'min_child_samples': 24}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:24,977][0m Trial 345 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 1.5436429407454824e-08, 'lambda_l2': 0.006609460105934079, 'num_leaves': 193, 'feature_fraction': 0.7138622778274231, 'bagging_fraction': 0.8686230520303919, 'bagging_freq': 4, 'min_child_samples': 50}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:25,096][0m Trial 346 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 1.0655637199276658e-07, 'lambda_l2': 1.0446846036627286e-05, 'num_leaves': 163, 'feature_fraction': 0.7257755599510444, 'bagging_fraction

[32m[I 2021-12-20 20:02:27,688][0m Trial 367 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 4.617361853825114e-07, 'lambda_l2': 8.908766775636066e-06, 'num_leaves': 78, 'feature_fraction': 0.6597809312687959, 'bagging_fraction': 0.933916151111768, 'bagging_freq': 4, 'min_child_samples': 47}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:27,805][0m Trial 368 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 8.998550921968518e-08, 'lambda_l2': 2.7724769249364757e-06, 'num_leaves': 68, 'feature_fraction': 0.7885902444205152, 'bagging_fraction': 0.9443511913551947, 'bagging_freq': 3, 'min_child_samples': 60}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:27,935][0m Trial 369 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 3.1722543588660915e-08, 'lambda_l2': 2.383302988568021e-06, 'num_leaves': 256, 'feature_fraction': 0.8046937412081788, 'bagging_fraction': 0.

[32m[I 2021-12-20 20:02:30,663][0m Trial 390 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 2.2300542189297986e-08, 'lambda_l2': 0.00216413071597672, 'num_leaves': 182, 'feature_fraction': 0.7106651964256331, 'bagging_fraction': 0.7664877437093628, 'bagging_freq': 4, 'min_child_samples': 54}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:30,798][0m Trial 391 finished with value: 0.8715083798882681 and parameters: {'lambda_l1': 0.36636781452475886, 'lambda_l2': 0.0002999351287089939, 'num_leaves': 78, 'feature_fraction': 0.8844156399246916, 'bagging_fraction': 0.550525041894337, 'bagging_freq': 3, 'min_child_samples': 17}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:30,937][0m Trial 392 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 0.09121083757194702, 'lambda_l2': 0.0007420304882530056, 'num_leaves': 77, 'feature_fraction': 0.8832951106158187, 'bagging_fraction': 0.5644987

[32m[I 2021-12-20 20:02:33,565][0m Trial 413 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 4.36742940321036e-07, 'lambda_l2': 6.665365794372538e-06, 'num_leaves': 73, 'feature_fraction': 0.9019313544779276, 'bagging_fraction': 0.4772319349815054, 'bagging_freq': 4, 'min_child_samples': 55}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:33,679][0m Trial 414 finished with value: 0.8603351955307262 and parameters: {'lambda_l1': 9.048570757454666, 'lambda_l2': 0.0003955802445810544, 'num_leaves': 76, 'feature_fraction': 0.7242652519256954, 'bagging_fraction': 0.929405114278819, 'bagging_freq': 4, 'min_child_samples': 35}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:33,822][0m Trial 415 finished with value: 0.7988826815642458 and parameters: {'lambda_l1': 2.5886602136788844e-07, 'lambda_l2': 6.91267266562612e-05, 'num_leaves': 183, 'feature_fraction': 0.7223036072004969, 'bagging_fraction': 0.8987390

[32m[I 2021-12-20 20:02:36,577][0m Trial 436 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 0.000605884798371405, 'lambda_l2': 2.6496406072321705e-06, 'num_leaves': 66, 'feature_fraction': 0.8124223446750438, 'bagging_fraction': 0.9124097758072496, 'bagging_freq': 6, 'min_child_samples': 25}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:36,701][0m Trial 437 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 0.002153458187473171, 'lambda_l2': 5.115359260700204e-06, 'num_leaves': 243, 'feature_fraction': 0.7479499961700993, 'bagging_fraction': 0.46652896785425385, 'bagging_freq': 6, 'min_child_samples': 27}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:36,842][0m Trial 438 finished with value: 0.8100558659217877 and parameters: {'lambda_l1': 3.0725901037491115e-07, 'lambda_l2': 2.6954269651937525e-07, 'num_leaves': 96, 'feature_fraction': 0.7754684872599384, 'bagging_fraction': 0

[32m[I 2021-12-20 20:02:39,606][0m Trial 459 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 0.00010171014928404067, 'lambda_l2': 1.1360845569899953e-05, 'num_leaves': 86, 'feature_fraction': 0.7456549720164745, 'bagging_fraction': 0.4288776820729634, 'bagging_freq': 3, 'min_child_samples': 21}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:39,732][0m Trial 460 finished with value: 0.8100558659217877 and parameters: {'lambda_l1': 1.02900334551704, 'lambda_l2': 0.06864412016981603, 'num_leaves': 135, 'feature_fraction': 0.8234304557773252, 'bagging_fraction': 0.888822171459236, 'bagging_freq': 6, 'min_child_samples': 57}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:39,869][0m Trial 461 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 8.810843234797962e-07, 'lambda_l2': 6.951382860404754e-06, 'num_leaves': 180, 'feature_fraction': 0.7552201956012419, 'bagging_fraction': 0.874205

[32m[I 2021-12-20 20:02:42,708][0m Trial 482 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 3.1189732078490737e-07, 'lambda_l2': 0.0015030420869063855, 'num_leaves': 81, 'feature_fraction': 0.8688658164729243, 'bagging_fraction': 0.9250451080531774, 'bagging_freq': 4, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:42,834][0m Trial 483 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 3.1021793810486415e-08, 'lambda_l2': 0.0009118851829098199, 'num_leaves': 238, 'feature_fraction': 0.8015126901197417, 'bagging_fraction': 0.43530039053267156, 'bagging_freq': 7, 'min_child_samples': 35}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:42,986][0m Trial 484 finished with value: 0.7988826815642458 and parameters: {'lambda_l1': 2.2580317098261802e-06, 'lambda_l2': 0.00013266793981667953, 'num_leaves': 79, 'feature_fraction': 0.8124932376278422, 'bagging_fraction'

[32m[I 2021-12-20 20:02:45,864][0m Trial 505 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 1.838135425575861e-07, 'lambda_l2': 8.72209535196639e-05, 'num_leaves': 72, 'feature_fraction': 0.9104134383167017, 'bagging_fraction': 0.6460856628483781, 'bagging_freq': 4, 'min_child_samples': 15}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:46,020][0m Trial 506 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 2.612635024756055e-07, 'lambda_l2': 1.1164112957271331e-06, 'num_leaves': 23, 'feature_fraction': 0.761102276961514, 'bagging_fraction': 0.5730517566809026, 'bagging_freq': 3, 'min_child_samples': 19}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:46,146][0m Trial 507 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 5.200874652410686e-08, 'lambda_l2': 3.3963388511605996e-06, 'num_leaves': 167, 'feature_fraction': 0.7242132912479756, 'bagging_fraction': 0.5

[32m[I 2021-12-20 20:02:49,024][0m Trial 528 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 9.149259347306299e-07, 'lambda_l2': 0.001303334373619517, 'num_leaves': 212, 'feature_fraction': 0.7787284403034134, 'bagging_fraction': 0.6186133429822125, 'bagging_freq': 5, 'min_child_samples': 54}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:49,196][0m Trial 529 finished with value: 0.888268156424581 and parameters: {'lambda_l1': 3.4988782738814165e-07, 'lambda_l2': 0.0004953248786849387, 'num_leaves': 95, 'feature_fraction': 0.8190293017672494, 'bagging_fraction': 0.9158457197415198, 'bagging_freq': 4, 'min_child_samples': 26}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:49,357][0m Trial 530 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 3.866615773546991e-07, 'lambda_l2': 0.0005196433932657937, 'num_leaves': 93, 'feature_fraction': 0.8182014501363424, 'bagging_fraction': 0.91

[32m[I 2021-12-20 20:02:52,385][0m Trial 551 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 7.89972132819538e-07, 'lambda_l2': 0.0001824782946048921, 'num_leaves': 84, 'feature_fraction': 0.8342275552399349, 'bagging_fraction': 0.9500906665696811, 'bagging_freq': 2, 'min_child_samples': 29}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:52,539][0m Trial 552 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 1.2957076006236917e-06, 'lambda_l2': 0.0005020574577405717, 'num_leaves': 176, 'feature_fraction': 0.7989221003952587, 'bagging_fraction': 0.9186766979630936, 'bagging_freq': 6, 'min_child_samples': 26}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:52,692][0m Trial 553 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 1.376892621709395e-06, 'lambda_l2': 0.0009799309569309112, 'num_leaves': 182, 'feature_fraction': 0.8538891545177719, 'bagging_fraction': 0.

[32m[I 2021-12-20 20:02:55,741][0m Trial 574 finished with value: 0.8659217877094972 and parameters: {'lambda_l1': 5.292516122708859e-07, 'lambda_l2': 0.00036817929459730743, 'num_leaves': 105, 'feature_fraction': 0.9143238069826031, 'bagging_fraction': 0.5438835666982459, 'bagging_freq': 1, 'min_child_samples': 16}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:55,896][0m Trial 575 finished with value: 0.7877094972067039 and parameters: {'lambda_l1': 0.00029360449698829095, 'lambda_l2': 7.303575270239708e-05, 'num_leaves': 187, 'feature_fraction': 0.8545095477981236, 'bagging_fraction': 0.9246157226881273, 'bagging_freq': 3, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:56,035][0m Trial 576 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 1.776718965322025e-06, 'lambda_l2': 0.0004787068354558333, 'num_leaves': 127, 'feature_fraction': 0.7849233137336655, 'bagging_fraction':

[32m[I 2021-12-20 20:02:59,204][0m Trial 597 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 1.0184777292013777e-06, 'lambda_l2': 0.00019082636241778657, 'num_leaves': 175, 'feature_fraction': 0.8141299323710994, 'bagging_fraction': 0.9561683219990078, 'bagging_freq': 6, 'min_child_samples': 32}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:59,363][0m Trial 598 finished with value: 0.7877094972067039 and parameters: {'lambda_l1': 3.92352392679636e-07, 'lambda_l2': 8.021984946318054e-05, 'num_leaves': 176, 'feature_fraction': 0.81814634631841, 'bagging_fraction': 0.951087250588142, 'bagging_freq': 6, 'min_child_samples': 30}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:02:59,524][0m Trial 599 finished with value: 0.7932960893854749 and parameters: {'lambda_l1': 6.150865863298833e-07, 'lambda_l2': 0.00021178847784382896, 'num_leaves': 183, 'feature_fraction': 0.8267086512989252, 'bagging_fraction': 0.

[32m[I 2021-12-20 20:03:02,675][0m Trial 620 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 1.6611474657557683e-08, 'lambda_l2': 0.00017456967080261814, 'num_leaves': 185, 'feature_fraction': 0.815983953621671, 'bagging_fraction': 0.9049872632274748, 'bagging_freq': 3, 'min_child_samples': 24}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:02,829][0m Trial 621 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 2.2678011080428005e-06, 'lambda_l2': 0.0008613857610496648, 'num_leaves': 157, 'feature_fraction': 0.780634679992948, 'bagging_fraction': 0.9902175856815888, 'bagging_freq': 3, 'min_child_samples': 35}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:02,967][0m Trial 622 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 0.00013773671580261137, 'lambda_l2': 2.7318741504101625e-05, 'num_leaves': 170, 'feature_fraction': 0.8009383574523543, 'bagging_fraction'

[32m[I 2021-12-20 20:03:06,376][0m Trial 643 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 1.5731606477543507e-06, 'lambda_l2': 0.0001397615954639186, 'num_leaves': 196, 'feature_fraction': 0.7698816912001291, 'bagging_fraction': 0.9379832240085735, 'bagging_freq': 6, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:06,532][0m Trial 644 finished with value: 0.7988826815642458 and parameters: {'lambda_l1': 5.559556565474793e-06, 'lambda_l2': 0.0009810503013903623, 'num_leaves': 179, 'feature_fraction': 0.7581243057109236, 'bagging_fraction': 0.912257818169366, 'bagging_freq': 6, 'min_child_samples': 34}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:06,671][0m Trial 645 finished with value: 0.8659217877094972 and parameters: {'lambda_l1': 1.6466303371265252e-06, 'lambda_l2': 0.001333548462684235, 'num_leaves': 167, 'feature_fraction': 0.7434868049971503, 'bagging_fraction': 0

[32m[I 2021-12-20 20:03:09,597][0m Trial 666 finished with value: 0.7988826815642458 and parameters: {'lambda_l1': 0.0025106231140512992, 'lambda_l2': 0.00017102770234658088, 'num_leaves': 254, 'feature_fraction': 0.6998399226278074, 'bagging_fraction': 0.415641265807247, 'bagging_freq': 4, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:09,749][0m Trial 667 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 9.732220960394173e-07, 'lambda_l2': 0.00017299132141003284, 'num_leaves': 154, 'feature_fraction': 0.7697616189579733, 'bagging_fraction': 0.9100061221758773, 'bagging_freq': 4, 'min_child_samples': 35}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:09,884][0m Trial 668 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 4.716293333797337, 'lambda_l2': 2.291215605980328e-06, 'num_leaves': 178, 'feature_fraction': 0.7167588686757163, 'bagging_fraction': 0.42

[32m[I 2021-12-20 20:03:13,053][0m Trial 689 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 0.020799416015397473, 'lambda_l2': 5.157070218374756e-05, 'num_leaves': 211, 'feature_fraction': 0.76973704089703, 'bagging_fraction': 0.8890906044362666, 'bagging_freq': 3, 'min_child_samples': 68}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:13,190][0m Trial 690 finished with value: 0.8044692737430168 and parameters: {'lambda_l1': 0.03412192239804559, 'lambda_l2': 7.44021906048891e-06, 'num_leaves': 232, 'feature_fraction': 0.7475718388653618, 'bagging_fraction': 0.8997394012253414, 'bagging_freq': 3, 'min_child_samples': 89}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:13,321][0m Trial 691 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 4.009621260455306e-07, 'lambda_l2': 5.595364818600536e-05, 'num_leaves': 218, 'feature_fraction': 0.7183095792012401, 'bagging_fraction': 0.87997

[32m[I 2021-12-20 20:03:16,355][0m Trial 712 finished with value: 0.8715083798882681 and parameters: {'lambda_l1': 6.217879492935922e-07, 'lambda_l2': 0.0001302233860573964, 'num_leaves': 187, 'feature_fraction': 0.8228472456437513, 'bagging_fraction': 0.45100734909710505, 'bagging_freq': 4, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:16,492][0m Trial 713 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 6.375794673197627e-07, 'lambda_l2': 0.00012372703315271732, 'num_leaves': 190, 'feature_fraction': 0.8374593842378362, 'bagging_fraction': 0.4175479409752697, 'bagging_freq': 4, 'min_child_samples': 28}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:16,649][0m Trial 714 finished with value: 0.8770949720670391 and parameters: {'lambda_l1': 1.5679295097602886e-06, 'lambda_l2': 0.00011644559405284222, 'num_leaves': 188, 'feature_fraction': 0.8250189986267471, 'bagging_fraction

[32m[I 2021-12-20 20:03:19,803][0m Trial 735 finished with value: 0.8547486033519553 and parameters: {'lambda_l1': 4.2864901533955475e-07, 'lambda_l2': 6.33618123607764e-05, 'num_leaves': 197, 'feature_fraction': 0.7261642869910953, 'bagging_fraction': 0.9495822739549904, 'bagging_freq': 5, 'min_child_samples': 33}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:19,956][0m Trial 736 finished with value: 0.7932960893854749 and parameters: {'lambda_l1': 0.5135856753845883, 'lambda_l2': 1.8455268335852526e-05, 'num_leaves': 233, 'feature_fraction': 0.6959843177947953, 'bagging_fraction': 0.9031115070160227, 'bagging_freq': 4, 'min_child_samples': 35}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:20,086][0m Trial 737 finished with value: 0.7821229050279329 and parameters: {'lambda_l1': 2.7889385431688205, 'lambda_l2': 1.0151428001733137e-05, 'num_leaves': 78, 'feature_fraction': 0.7140199564606234, 'bagging_fraction': 0.4980

[32m[I 2021-12-20 20:03:23,118][0m Trial 758 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 1.4498349823439125e-07, 'lambda_l2': 1.7930636821074288e-06, 'num_leaves': 203, 'feature_fraction': 0.7995151240230673, 'bagging_fraction': 0.9444190025836106, 'bagging_freq': 6, 'min_child_samples': 35}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:23,246][0m Trial 759 finished with value: 0.8044692737430168 and parameters: {'lambda_l1': 4.091701882845227e-07, 'lambda_l2': 0.00015495101224653136, 'num_leaves': 2, 'feature_fraction': 0.7439035677720646, 'bagging_fraction': 0.44579340851486504, 'bagging_freq': 2, 'min_child_samples': 90}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:23,376][0m Trial 760 finished with value: 0.8770949720670391 and parameters: {'lambda_l1': 5.801729165621436e-08, 'lambda_l2': 4.3853614244888075e-05, 'num_leaves': 223, 'feature_fraction': 0.8557905149872571, 'bagging_fraction'

[32m[I 2021-12-20 20:03:26,178][0m Trial 780 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 1.0960745590374481e-07, 'lambda_l2': 2.886434925313738e-06, 'num_leaves': 235, 'feature_fraction': 0.5723298386432318, 'bagging_fraction': 0.5053215286670786, 'bagging_freq': 4, 'min_child_samples': 45}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:26,333][0m Trial 781 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 1.663329530931226e-07, 'lambda_l2': 7.037956825087668e-06, 'num_leaves': 224, 'feature_fraction': 0.6031562097765147, 'bagging_fraction': 0.8858399288118838, 'bagging_freq': 4, 'min_child_samples': 45}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:26,486][0m Trial 782 finished with value: 0.7821229050279329 and parameters: {'lambda_l1': 2.5077117435861295e-08, 'lambda_l2': 4.795228330932654e-05, 'num_leaves': 213, 'feature_fraction': 0.8436455724292284, 'bagging_fraction':

[32m[I 2021-12-20 20:03:29,534][0m Trial 802 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 1.4058658174836947e-06, 'lambda_l2': 4.721043148635314e-07, 'num_leaves': 213, 'feature_fraction': 0.8950999330092374, 'bagging_fraction': 0.9090430908043996, 'bagging_freq': 1, 'min_child_samples': 31}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:29,679][0m Trial 803 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 2.7354644244262596e-06, 'lambda_l2': 1.5749135579805837e-05, 'num_leaves': 256, 'feature_fraction': 0.8856351515811226, 'bagging_fraction': 0.5084606724497005, 'bagging_freq': 7, 'min_child_samples': 34}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:29,815][0m Trial 804 finished with value: 0.7932960893854749 and parameters: {'lambda_l1': 7.038893025130004e-07, 'lambda_l2': 0.00011548636202965152, 'num_leaves': 243, 'feature_fraction': 0.8511334534918941, 'bagging_fraction

[32m[I 2021-12-20 20:03:32,892][0m Trial 824 finished with value: 0.8156424581005587 and parameters: {'lambda_l1': 2.7097445907942323e-07, 'lambda_l2': 1.2134660084940493e-05, 'num_leaves': 213, 'feature_fraction': 0.8731615297371365, 'bagging_fraction': 0.9027007337042836, 'bagging_freq': 4, 'min_child_samples': 61}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:33,041][0m Trial 825 finished with value: 0.8770949720670391 and parameters: {'lambda_l1': 1.889079870391847e-06, 'lambda_l2': 8.786130553923067e-06, 'num_leaves': 207, 'feature_fraction': 0.8586778355998831, 'bagging_fraction': 0.9261383526705818, 'bagging_freq': 4, 'min_child_samples': 66}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:33,185][0m Trial 826 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 1.96193516272407e-06, 'lambda_l2': 7.567139147567202e-06, 'num_leaves': 206, 'feature_fraction': 0.8585147458967406, 'bagging_fraction': 

[32m[I 2021-12-20 20:03:36,313][0m Trial 847 finished with value: 0.7877094972067039 and parameters: {'lambda_l1': 2.0373769164325974e-06, 'lambda_l2': 1.8748071947373973e-06, 'num_leaves': 207, 'feature_fraction': 0.866338900691063, 'bagging_fraction': 0.9539792908341072, 'bagging_freq': 4, 'min_child_samples': 63}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:36,455][0m Trial 848 finished with value: 0.8547486033519553 and parameters: {'lambda_l1': 4.929981520773489e-07, 'lambda_l2': 2.9208320817977272e-05, 'num_leaves': 219, 'feature_fraction': 0.5860959182780284, 'bagging_fraction': 0.8814967768764509, 'bagging_freq': 6, 'min_child_samples': 80}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:36,711][0m Trial 849 finished with value: 0.7821229050279329 and parameters: {'lambda_l1': 4.7189696162977405e-06, 'lambda_l2': 7.567228564571558e-06, 'num_leaves': 212, 'feature_fraction': 0.8575503582156777, 'bagging_fraction'

[32m[I 2021-12-20 20:03:39,718][0m Trial 869 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 3.615983463686707e-07, 'lambda_l2': 7.7696774220524e-05, 'num_leaves': 182, 'feature_fraction': 0.5949894013852483, 'bagging_fraction': 0.8736636781238596, 'bagging_freq': 4, 'min_child_samples': 29}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:39,858][0m Trial 870 finished with value: 0.8044692737430168 and parameters: {'lambda_l1': 1.1334684698040813e-07, 'lambda_l2': 1.243674119645036e-06, 'num_leaves': 211, 'feature_fraction': 0.8536352648247265, 'bagging_fraction': 0.7032356523085468, 'bagging_freq': 4, 'min_child_samples': 68}. Best is trial 17 with value: 0.888268156424581.[0m
[32m[I 2021-12-20 20:03:40,012][0m Trial 871 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 3.555460463247673e-06, 'lambda_l2': 2.2518524859624645e-05, 'num_leaves': 213, 'feature_fraction': 0.8643768312069708, 'bagging_fraction': 0

[32m[I 2021-12-20 20:03:43,127][0m Trial 891 finished with value: 0.8100558659217877 and parameters: {'lambda_l1': 3.5985684143891178e-06, 'lambda_l2': 6.515414082308973e-05, 'num_leaves': 200, 'feature_fraction': 0.8551400035141034, 'bagging_fraction': 0.4803739858936273, 'bagging_freq': 2, 'min_child_samples': 32}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:43,304][0m Trial 892 finished with value: 0.8212290502793296 and parameters: {'lambda_l1': 5.416305646999922e-07, 'lambda_l2': 4.8539932224633144e-05, 'num_leaves': 250, 'feature_fraction': 0.8733646167810598, 'bagging_fraction': 0.9380005477838985, 'bagging_freq': 4, 'min_child_samples': 30}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:43,450][0m Trial 893 finished with value: 0.8547486033519553 and parameters: {'lambda_l1': 4.1228632744642873e-07, 'lambda_l2': 0.00013071868291082424, 'num_leaves': 193, 'feature_fraction': 0.5583626992410219, 'bagging_frac

[32m[I 2021-12-20 20:03:46,604][0m Trial 913 finished with value: 0.8491620111731844 and parameters: {'lambda_l1': 8.684105478390072e-07, 'lambda_l2': 2.3412755242383778e-05, 'num_leaves': 220, 'feature_fraction': 0.911881412640447, 'bagging_fraction': 0.8904839525844483, 'bagging_freq': 6, 'min_child_samples': 34}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:46,742][0m Trial 914 finished with value: 0.8268156424581006 and parameters: {'lambda_l1': 1.7582327557269384e-06, 'lambda_l2': 9.10266497630376e-06, 'num_leaves': 167, 'feature_fraction': 0.6139083157592625, 'bagging_fraction': 0.42258275093971376, 'bagging_freq': 4, 'min_child_samples': 65}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:46,911][0m Trial 915 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 7.535574529659975e-07, 'lambda_l2': 0.00021633345455412815, 'num_leaves': 186, 'feature_fraction': 0.8219478436279973, 'bagging_fracti

[32m[I 2021-12-20 20:03:50,005][0m Trial 935 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 2.564392732996956e-08, 'lambda_l2': 2.442753515993745e-05, 'num_leaves': 247, 'feature_fraction': 0.8404488939505308, 'bagging_fraction': 0.9126365961546246, 'bagging_freq': 4, 'min_child_samples': 21}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:50,171][0m Trial 936 finished with value: 0.8324022346368715 and parameters: {'lambda_l1': 9.918905796420264e-07, 'lambda_l2': 5.871022940038232e-07, 'num_leaves': 181, 'feature_fraction': 0.8032378475147829, 'bagging_fraction': 0.9050027621834802, 'bagging_freq': 6, 'min_child_samples': 32}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:50,317][0m Trial 937 finished with value: 0.8659217877094972 and parameters: {'lambda_l1': 2.171943608919896e-06, 'lambda_l2': 7.730804418663933e-05, 'num_leaves': 199, 'feature_fraction': 0.8464795292438796, 'bagging_fraction

[32m[I 2021-12-20 20:03:53,406][0m Trial 957 finished with value: 0.8435754189944135 and parameters: {'lambda_l1': 6.256497688051129e-08, 'lambda_l2': 3.75010784781064e-05, 'num_leaves': 229, 'feature_fraction': 0.8785852712800358, 'bagging_fraction': 0.8726518106407952, 'bagging_freq': 6, 'min_child_samples': 28}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:53,559][0m Trial 958 finished with value: 0.8659217877094972 and parameters: {'lambda_l1': 4.313059001082937e-06, 'lambda_l2': 1.9724625305171913e-05, 'num_leaves': 213, 'feature_fraction': 0.6377344275925888, 'bagging_fraction': 0.8308658283073886, 'bagging_freq': 6, 'min_child_samples': 39}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:53,699][0m Trial 959 finished with value: 0.8379888268156425 and parameters: {'lambda_l1': 1.332678661531028e-07, 'lambda_l2': 2.151144223231594e-05, 'num_leaves': 216, 'feature_fraction': 0.8753359660824006, 'bagging_fraction

[32m[I 2021-12-20 20:03:57,009][0m Trial 979 finished with value: 0.7821229050279329 and parameters: {'lambda_l1': 8.70302559694915e-07, 'lambda_l2': 0.0001924808872388547, 'num_leaves': 192, 'feature_fraction': 0.8342658179565876, 'bagging_fraction': 0.9339260566620124, 'bagging_freq': 6, 'min_child_samples': 29}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:57,155][0m Trial 980 finished with value: 0.8100558659217877 and parameters: {'lambda_l1': 0.002304336643965012, 'lambda_l2': 3.475482964081446e-05, 'num_leaves': 159, 'feature_fraction': 0.712315287834934, 'bagging_fraction': 0.8860390831427523, 'bagging_freq': 4, 'min_child_samples': 87}. Best is trial 883 with value: 0.8938547486033519.[0m
[32m[I 2021-12-20 20:03:57,320][0m Trial 981 finished with value: 0.8547486033519553 and parameters: {'lambda_l1': 1.9661305469648983e-08, 'lambda_l2': 1.0690472946949061e-05, 'num_leaves': 224, 'feature_fraction': 0.8028828859617084, 'bagging_fraction'

In [63]:
train['Survived'] = train['Survived'].astype(bool)

In [38]:
gbm = lgb.train(study.best_params, lgb.Dataset(train[features], label=train['Survived']))
preds = gbm.predict(test[features])

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 251
[LightGBM] [Info] Number of data points in the train set: 891, number of used features: 21
[LightGBM] [Info] Start training from score 0.383838


In [58]:
preds

array([ 2.21659333e-02,  3.82187672e-01,  6.42803378e-02,  5.07251034e-01,
        5.79591610e-01,  1.18345910e-01,  3.36342599e-01,  1.22575728e-01,
        9.89043988e-01,  2.43881707e-02,  2.35722943e-04,  3.07501256e-02,
        8.81203844e-01, -3.16674225e-02,  8.44469255e-01,  9.01964762e-01,
        1.40891336e-02,  3.96641495e-01,  5.91768275e-01,  6.55683457e-01,
        9.76641670e-02,  6.42143931e-01,  1.07855649e+00,  2.70761765e-01,
        1.11442121e+00, -1.07561684e-01,  1.13918835e+00,  4.33530760e-01,
        7.23131821e-01,  3.57533320e-01,  1.41931785e-02,  2.85584602e-01,
        5.04265215e-01,  2.43546911e-01,  5.24281226e-01,  2.35121588e-01,
        8.96427546e-02,  1.54611515e-01, -1.59076991e-02,  5.87113762e-01,
        5.61627558e-02,  6.64026419e-01,  2.68481435e-01,  1.14920994e+00,
        9.87404625e-01,  1.47149225e-01,  2.69159497e-01,  1.58082097e-01,
        1.03412977e+00,  7.42811739e-01,  4.11175118e-01,  7.65222233e-02,
        9.10101990e-01,  

In [46]:
submission = pd.read_csv('data/submission.csv')

In [48]:
submission['Survived'] = np.abs(np.rint(preds))

In [50]:
submission.to_csv("data/submission.csv", index = False)

In [None]:
train.describe()

#### Here we already have some important information: 
- Low sample dataset: **891** observations (some models are not appropriate for low sample datasets)
- Low dimensionality: 12 columns (11 without the index)
  - Before even checking for redundant columns.
  - Not much information to work on.
- A lot of qualitative features.
- Need for scaling.

# Data Cleaning and Transformation
In this step we transform our current dataframe to a usable one. Here we are looking for *missings*, *corrupted*, or *imprecise* information. Later we try to create other variables to help our model performance.

## Transforming categorical data

**We scraped every title with those two lists comprehension.**

## Encoding

In [None]:
train['Pclass'], test['Pclass'] = train['Pclass'].astype(str), test['Pclass'].astype(str)
features = [col for col in train.columns if col not in ['PassengerId', 'Survived']]
categorical_features = [col for col in train.columns if col in ['Pclass', 'Name','Sex','Ticket','Cabin','Embarked']]
numerical_features = [col for col in train.columns if col in ['Age','SibSp','Parch', 'Fare']]

In [None]:
from category_encoders import TargetEncoder
encoder = TargetEncoder()

train_target_encoding = encoder.fit_transform(train[categorical_features], train['Survived'])
test_target_encoding = encoder.transform(test[categorical_features])

In [None]:
new_train = pd.concat([train['Survived'], train[numerical_features], train_target_encoding], axis = 1)
new_test = pd.concat([test[numerical_features], test_target_encoding], axis = 1)

## Imputer

In [None]:
from sklearn.impute import KNNImputer
knni = KNNImputer()
features = [col for col in new_train.columns if col not in ['PassengerId', 'Survived']]

new_train[features] = knni.fit_transform(new_train[features])
new_test[features] = knni.transform(new_test[features])

## Scaler

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

new_train[features] = std_scaler.fit_transform(new_train[features])
new_test[features] = std_scaler.transform(new_test[features])

# Model Building

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate

gb_classif = GradientBoostingClassifier()

In [None]:
gb_classif.fit(new_train[features], new_train['Survived'])
cv = cross_validate(gb_classif, new_train[features], new_train['Survived'], cv = 5)

In [None]:
cv['test_score'].mean()

In [None]:
pd.DataFrame(gb_classif.predict(new_test[features])).to_csv("submission.csv")

## Categorical features encoding, dealing with missing values and scaling variables
Most models don't work with categorical features. Here I will test two ways: **One-Hot Encoding** and **Target Encoding**. 

For replacing numerical missing values, I will test imputting **mean**, **median** values, as well a **linear regression** algorithm.

For replacing categorical missing values, I will test imputting **mode** values, as well a **kNN** algorithm.

For scaling the variable, I will use test both **MinMaxScaler** and **StandardScaler**. 

**A class will be used so I can easily try different dataprep tools.**

## Encoding

In [None]:




train_name_dummies = pd.get_dummies(train['Name'], columns = categorical_features, dummy_na = True).filter(['Mr','Mrs','Miss','Master'])
test_name_dummies = pd.get_dummies(test['Name'], columns = categorical_features, dummy_na = True).filter(['Mr','Mrs','Miss','Master'])

train_pclass_dummies = pd.get_dummies(train['Pclass'], columns = categorical_features, dummy_na = True)
test_pclass_dummies = pd.get_dummies(test['Pclass'], columns = categorical_features, dummy_na = True)

train_sex_dummies = pd.get_dummies(train['Sex'], columns = categorical_features, dummy_na = True)
test_sex_dummies = pd.get_dummies(test['Sex'], columns = categorical_features, dummy_na = True)


train_ticket_dummies = pd.get_dummies(train['Ticket'], columns = categorical_features, dummy_na = True).filter(['NUMBER_ONLY'])
test_ticket_dummies = pd.get_dummies(test['Ticket'], columns = categorical_features, dummy_na = True).filter(['NUMBER_ONLY'])

train_cabin_dummies = pd.get_dummies(train['Cabin'], columns = categorical_features, dummy_na = True).filter(['NaN'])
test_cabin_dummies = pd.get_dummies(test['Cabin'], columns = categorical_features, dummy_na = True).filter(['NaN'])

train_embarked_dummies = pd.get_dummies(train['Embarked'], columns = categorical_features, dummy_na = True).filter(['Q','S','C'])
test_embarked_dummies = pd.get_dummies(test['Embarked'], columns = categorical_features, dummy_na = True).filter(['Q','S','C'])

## Imputer and Scaler

In [None]:
OneHotEncoder() == OneHotEncoder

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from category_encoders import TargetEncoder


features = [col for col in train.columns if col not in ['PassengerId', 'Survived']]
categorical_features = [col for col in train.columns if col in ['Pclass', 'Name','Sex','Ticket','Cabin','Embarked']]
numerical_features = [col for col in train.columns if col in ['Age','SibSp','Parch', 'Fare']]

class DataPrep():
    
    def __init__(self, imputer, scaler, encoder):
        self.num_imputer = num_imputer
        self.scaler = scaler
        self.encoder = encoder
        
        self.pipe = Pipeline([
            ('numeric_imputer', self.imputer),
            ('scaler', self.scaler)            
        ])
        
        if type(encoder) == OneHotEncoder:
            self.new_features = 
            
    def fit(self, X):
        return(self.pipe.fit(X[numerical_features]))
        
    def transform(self, X):
        new_df = pd.concat(X[cat_features])
        return(self.pipe.transform(X[numerical_features]))
    
    
    def fit_transform(self, X):
        
        new_cat_X = pd.DataFrame(np.array(self.pipe_1.fit_transform(X[categorical_features])))
        new_num_X = pd.DataFrame(np.array(self.pipe_2.fit_transform(X[numerical_features])))
        
        return(pd.concat([new_cat_X.reset_index(drop=True), new_num_X.reset_index(drop=True)], axis=1).reset_index(drop=True))

In [None]:
dataprep = DataPrep(
    imputer = SimpleImputer(strategy = 'mean'),
    scaler = StandardScaler()
)

In [None]:
onehot.fit_transform(train[categorical_features]).shape
onehot.categories_

In [None]:
new_train = dataprep.fit_transform(train)
new_test = dataprep.transform(test)

In [None]:
new_train = pd.concat([train[numerical_features].reset_index(drop=True), new_train_cat_features], axis=1)
new_test = pd.concat([test[numerical_features].reset_index(drop=True), new_test_cat_features], axis=1)

In [None]:
new_train.columns

In [None]:
# Ignore observations of missing values if we are dealing with large data sets and less number of records has missing values
# Ignore variable, if it is not significant
# Develop model to predict missing values
# Treat missing data as just another category

## Dealing with *missings*

In [None]:
train.isna().sum()
test.isna().sum()

In [None]:
import missingno as msno
msno.matrix(new_train);

### Missings

In [None]:
for col in train.columns:
    print(f"Feature: {col} {train[col].isna().sum()}")

We will have to deal with **2** main problems here: Age and Cabin. We will decide what to do with the *Embarked* missing values later.

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='mean', missing_values=np.nan)),
        ("scaler", QuantileTransformer(n_quantiles=64,output_distribution='uniform')),
        ('bin', KBinsDiscretizer(n_bins=64, encode='ordinal',strategy='uniform'))
        ])