# Titanic - Machine Learning from Disaster

### 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place.

This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the ["Join Competition button](https://www.kaggle.com/account/login?returnUrl=%2Fc%2Ftitanic) to create an account and gain access to the [competition data](https://www.kaggle.com/c/titanic/data). Then check out [Alexis Cook’s Titanic Tutorial](https://www.kaggle.com/alexisbcook/titanic-tutorial) that walks you through step by step how to make your first submission!

[![img](https://storage.googleapis.com/kaggle-media/welcome/video_thumbnail.jpg)](https://www.youtube.com/watch?v=8yZMXCaFshs&feature=youtu.be)

### Data Description

The data has been split into two groups:

- training set (train.csv)
- test set (test.csv)

**The training set** should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use [feature engineering ](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)to create new features.

**The test set** should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include **gender_submission.csv**, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

### Data Dictionary

| **Variable** | **Definition**                             | **Key**                                        |
| :----------- | :----------------------------------------- | :--------------------------------------------- |
| survival     | Survival                                   | 0 = No, 1 = Yes                                |
| pclass       | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex          | Sex                                        |                                                |
| Age          | Age in years                               |                                                |
| sibsp        | # of siblings / spouses aboard the Titanic |                                                |
| parch        | # of parents / children aboard the Titanic |                                                |
| ticket       | Ticket number                              |                                                |
| fare         | Passenger fare                             |                                                |
| cabin        | Cabin number                               |                                                |
| embarked     | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

### Variable Notes

**pclass**: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Link: https://www.kaggle.com/c/titanic

In [1]:
import jupyter_black
import lightgbm as lgb
import numpy as np
import pandas as pd
from flaml import AutoML
from sklearn.model_selection import train_test_split



In [2]:
jupyter_black.load()

# Prepare

In [3]:
df = pd.read_csv("./datasets/prepared.csv", index_col="PassengerId")
df

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,isTest,Name_FirstName,Name_Title,Name_LastName,FamilySize,isAlone
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,0.0,3,1,22.0,1,0,0,-0.503595,,2,0,29,6,150,2,0
2,1.0,1,0,38.0,1,0,12,0.734503,C85,0,0,61,7,104,2,0
3,1.0,3,0,26.0,0,0,21,-0.490544,,2,0,175,4,149,1,1
4,1.0,1,0,35.0,1,0,26,0.382925,C123,2,0,88,7,96,2,0
5,0.0,3,1,35.0,0,0,26,-0.488127,,2,0,5,6,182,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,1,,0,0,2,-0.488127,,2,1,175,6,149,1,1
1306,,1,0,39.0,0,0,12,1.461829,C105,0,1,175,9,149,1,1
1307,,3,1,38.5,0,0,19,-0.503595,,2,1,175,6,149,1,1
1308,,3,1,,0,0,26,-0.488127,,2,1,220,6,72,1,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 1 to 1309
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Survived        891 non-null    float64
 1   Pclass          1309 non-null   int64  
 2   Sex             1309 non-null   int64  
 3   Age             1046 non-null   float64
 4   SibSp           1309 non-null   int64  
 5   Parch           1309 non-null   int64  
 6   Ticket          1309 non-null   int64  
 7   Fare            1309 non-null   float64
 8   Cabin           295 non-null    object 
 9   Embarked        1309 non-null   int64  
 10  isTest          1309 non-null   int64  
 11  Name_FirstName  1309 non-null   int64  
 12  Name_Title      1309 non-null   int64  
 13  Name_LastName   1309 non-null   int64  
 14  FamilySize      1309 non-null   int64  
 15  isAlone         1309 non-null   int64  
dtypes: float64(3), int64(12), object(1)
memory usage: 173.9+ KB


In [5]:
df_age = pd.read_csv("./datasets/prepared_age.csv", index_col="PassengerId")
df_age

Unnamed: 0_level_0,Age,isAgePreds
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,22.000000,0
2,38.000000,0
3,26.000000,0
4,35.000000,0
5,35.000000,0
...,...,...
1305,39.934307,1
1306,39.000000,0
1307,38.500000,0
1308,29.646640,1


In [6]:
df_cabin = pd.read_csv("./datasets/prepared_cabin.csv", index_col="PassengerId")
df_cabin

Unnamed: 0_level_0,Cabin,isCabinPreds
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,5,1
2,2,0
3,4,1
4,2,0
5,5,1
...,...,...
1305,5,1
1306,2,0
1307,5,1
1308,5,1


In [7]:
df1 = (
    df.drop(["Age", "Cabin"], axis=1)
    .join([df_age, df_cabin])
    .drop(["isAgePreds", "isCabinPreds"], axis=1)
    .copy()
)
df1

Unnamed: 0_level_0,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,Embarked,isTest,Name_FirstName,Name_Title,Name_LastName,FamilySize,isAlone,Age,Cabin
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,0.0,3,1,1,0,0,-0.503595,2,0,29,6,150,2,0,22.000000,5
2,1.0,1,0,1,0,12,0.734503,0,0,61,7,104,2,0,38.000000,2
3,1.0,3,0,0,0,21,-0.490544,2,0,175,4,149,1,1,26.000000,4
4,1.0,1,0,1,0,26,0.382925,2,0,88,7,96,2,0,35.000000,2
5,0.0,3,1,0,0,26,-0.488127,2,0,5,6,182,1,1,35.000000,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,,3,1,0,0,2,-0.488127,2,1,175,6,149,1,1,39.934307,5
1306,,1,0,0,0,12,1.461829,0,1,175,9,149,1,1,39.000000,2
1307,,3,1,0,0,19,-0.503595,2,1,175,6,149,1,1,38.500000,5
1308,,3,1,0,0,26,-0.488127,2,1,220,6,72,1,1,29.646640,5


In [8]:
y_train = df1[df1["isTest"] == 0][["Survived"]].astype(int)
y_train

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
1,0
2,1
3,1
4,1
5,0
...,...
887,0
888,1
889,0
890,1


In [9]:
X = df1.drop(["Survived", "Age", "Cabin"], axis=1)
X

Unnamed: 0_level_0,Pclass,Sex,SibSp,Parch,Ticket,Fare,Embarked,isTest,Name_FirstName,Name_Title,Name_LastName,FamilySize,isAlone
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,3,1,1,0,0,-0.503595,2,0,29,6,150,2,0
2,1,0,1,0,12,0.734503,0,0,61,7,104,2,0
3,3,0,0,0,21,-0.490544,2,0,175,4,149,1,1
4,1,0,1,0,26,0.382925,2,0,88,7,96,2,0
5,3,1,0,0,26,-0.488127,2,0,5,6,182,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,3,1,0,0,2,-0.488127,2,1,175,6,149,1,1
1306,1,0,0,0,12,1.461829,0,1,175,9,149,1,1
1307,3,1,0,0,19,-0.503595,2,1,175,6,149,1,1
1308,3,1,0,0,26,-0.488127,2,1,220,6,72,1,1


In [10]:
X_train = X[X["isTest"] == 0].drop("isTest", axis=1)
X_test = X[X["isTest"] == 1].drop("isTest", axis=1)
X_train.shape, X_test.shape

((891, 12), (418, 12))

In [11]:
"""
X_train, X_true, y_train, y_true = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_train.shape, X_true.shape, y_train.shape, y_true.shape
"""

'\nX_train, X_true, y_train, y_true = train_test_split(\n    X_train, y_train, test_size=0.1, random_state=42\n)\nX_train.shape, X_true.shape, y_train.shape, y_true.shape\n'

# Train

In [12]:
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((801, 12), (90, 12), (801, 1), (90, 1))

In [13]:
train_data = lgb.Dataset(X_train, y_train)
valid_data = train_data.create_valid(X_valid, y_valid)

train_data, valid_data

(<lightgbm.basic.Dataset at 0x7f0950700090>,
 <lightgbm.basic.Dataset at 0x7f0950a7d450>)

## FLAML

In [14]:
automl = AutoML()
automl.fit(X_train, y_train["Survived"], task="classification", estimator_list=["lgbm"])

[flaml.automl.logger: 12-17 19:36:01] {1679} INFO - task = classification
[flaml.automl.logger: 12-17 19:36:01] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 12-17 19:36:01] {1788} INFO - Minimizing error metric: 1-roc_auc


INFO:flaml.default.suggest:metafeature distance: 0.06161343708140018
INFO:flaml.default.suggest:metafeature distance: 0.06161343708140018


[flaml.automl.logger: 12-17 19:36:01] {1900} INFO - List of ML learners in AutoML Run: ['lgbm']
[flaml.automl.logger: 12-17 19:36:01] {2627} INFO - retrain lgbm for 0.3s
[flaml.automl.logger: 12-17 19:36:01] {2630} INFO - retrained model: LGBMClassifier(colsample_bytree=0.37915528071680865,
               learning_rate=0.02070742242160566, max_bin=15,
               min_child_samples=8, n_estimators=1, n_jobs=-1, num_leaves=1208,
               reg_alpha=0.002982599447751338, reg_lambda=1.136605174453919,
               verbose=-1)
[flaml.automl.logger: 12-17 19:36:01] {1930} INFO - fit succeeded
[flaml.automl.logger: 12-17 19:36:01] {1931} INFO - Time taken to find the best model: 0


In [15]:
automl.model.params

{'n_estimators': 362,
 'num_leaves': 1208,
 'min_child_samples': 8,
 'learning_rate': 0.02070742242160566,
 'colsample_bytree': 0.37915528071680865,
 'reg_alpha': 0.002982599447751338,
 'reg_lambda': 1.136605174453919,
 'n_jobs': -1,
 'max_bin': 15,
 'verbose': -1}

## LightGBM

In [16]:
params = {"objective": "binary"} | automl.model.params
params

{'objective': 'binary',
 'n_estimators': 362,
 'num_leaves': 1208,
 'min_child_samples': 8,
 'learning_rate': 0.02070742242160566,
 'colsample_bytree': 0.37915528071680865,
 'reg_alpha': 0.002982599447751338,
 'reg_lambda': 1.136605174453919,
 'n_jobs': -1,
 'max_bin': 15,
 'verbose': -1}

In [17]:
# https://lightgbm.readthedocs.io/en/stable/Python-Intro.html
# https://www.kdnuggets.com/2023/07/lgbmclassifier-gettingstarted-guide.html
bst = lgb.train(
    params,
    train_data,
    50,
    valid_sets=valid_data,
    callbacks=[lgb.early_stopping(stopping_rounds=5)],
)
bst

Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[176]	valid_0's binary_logloss: 0.403917




<lightgbm.basic.Booster at 0x7f0950700b10>

# Predict

In [18]:
# y_preds = np.round(bst.predict(X_true))
# y_preds

In [19]:
# (y_true["Survived"] == y_preds).sum() / len(y_true)

In [20]:
# y_preds_1 = np.round(bst.predict(X_test))
y_preds_1 = automl.predict(X_test)
y_preds_1

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,

# Submission

In [21]:
submission = pd.DataFrame(
    {"PassengerId": X_test.index, "Survived": y_preds_1.astype(int)}
).set_index("PassengerId")
submission

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,1
894,0
895,0
896,1
...,...
1305,0
1306,1
1307,0
1308,0


In [22]:
submission.to_csv("./datasets/submission.csv")