# Titanic Kaggle challenge as seen by Xingu

Objective is to create a training pipeline using Xingu framework. But first we'll make a small exploratory data analisys, use conclusions to write a DataProvider and then train and use it for batch predictions.

## EDA

In [None]:
import numpy
import pandas
import sklearn.preprocessing

In [None]:
df=pandas.read_csv('data/train.csv')
df

In [None]:
for c in df.columns:
    print(c)

In [None]:
df.Age.plot.hist()

In [None]:
age_ranges     = [0,    15,      50,      100,      numpy.inf]
age_categories = [   1,       2,       3,       4            ]
# 1=child, 2=adult, 3=elder, 4=unknown

In [None]:
encode=['Sex','Embarked']
encoders={
    col: sklearn.preprocessing.OrdinalEncoder().fit(df[[col]].dropna())
    for col in encode
}

(
    df
    .set_index('PassengerId')
    .dropna(subset=encode)
    .assign(
        Sex_encoded      = lambda table: encoders['Sex'].transform(table[['Sex']]).astype(int),
        Embarked_encoded = lambda table: encoders['Embarked'].transform(table[['Embarked']]).astype(int),
        Age_encoded      = lambda table: pandas.cut(
            x=table.Age.fillna(numpy.inf),
            bins=age_ranges,
            labels=age_categories
        )
    )
)

In [None]:
x_features="""
    Pclass
    Sex
    Age
    SibSp
    Parch
    Fare
    Embarked
"""

x_estimator_features="""
    Pclass
    Age_encoded
    SibSp
    Parch
    Fare
    Sex_encoded
    Embarked_encoded
"""


In [None]:
df.Embarked.value_counts(dropna=False)

In [None]:
(
    df
    .pipe(
        lambda table: table.join(table.sample(frac=0.2,random_state=42).assign(split='test').split)
    )
    .assign(
        split=lambda table: table.split.fillna('train')
    )
    .split.value_counts()
)

## Train with Xingu

We used the conclusions above to write a `xingu.DataProvider` in class `DPTitanicSurvivor`. We've also implemented a quite advanced `xingu.Estimator` which uses sklearn's StratifiedKFold with Optuna to optimize, with its genetic algorithms, an XGBoostClassifier ensamble of 3 members.

Let's start by installing and configuring Xingu... 

In [None]:
!pip install -U xingu xgboost optuna

In [1]:
!rm models/* plots/* xingu.db

In [None]:
import os
import logging
import xingu

# Configure logging for Xingu
logger=logging.getLogger('xingu')
FORMATTER = logging.Formatter("%(asctime)s|%(levelname)s|%(name)s|%(message)s")
HANDLER = logging.StreamHandler()
HANDLER.setFormatter(FORMATTER)
logger.addHandler(HANDLER)
logger.setLevel(logging.DEBUG)

os.environ.update(
    dict(
        HYPEROPT_STRATEGY     = 'dp',
        BATCH_PREDICT         = 'true',
        BATCH_PREDICT_SAVE_ESTIMATIONS = 'true',
        DATAPROVIDER_FOLDER   = 'dataproviders',
        TRAINED_MODELS_PATH   = 'models',
        SAVE_SETS             = 'true',
        PLOTS_PATH            = 'plots',
        PLOTS_FORMAT          = 'png,svg',
        XINGU_DB_URL          = "sqlite:///xingu.db?check_same_thread=False",
        # DATASOURCE_CACHE_PATH = 's3://pan-dl-prd-sdbx-user-modcredito-sens/xingu-datasource-cache/',
        # DATABASES="dl-modcredito-sens|awsathena+rest://athena.us-east-1.amazonaws.com:443/db_pan_dl_sdbx_user_modcredito_sens?work_group=wg_dl_sdbx_user_modcredito_sens&compression=snappy",
        POST_PROCESS          = 'true',
        DEBUG                 = 'true',
    )
)

Train

In [None]:
coach=xingu.Coach(xingu.DataProviderFactory())

In [None]:
coach.team_train()

Get best parameters of a certain optimization trial 933

In [None]:
import optuna

storage="sqlite:///models/optimizer.db"

optuna.study.get_all_study_names(storage)

In [None]:
lovely_trial=933

optuna.load_study(
    study_name='titanic • warm-gatekeeper',
    storage=storage
).trials[lovely_trial]

The resulting parameters should be now included in the DataProvider in `estimator_hyperparams`.
Then retrain the model with `HYPEROPT_STRATEGY="dp"`

## Predict in Batch
Now get the trained model, use same methods for data cleaning and feature engineering.

In [None]:
model=coach.trained['titanic']

# Following line is here just to force use of cached parquet, if available
model.context='batch_predict'

# Get DP’s batch predict SQL queriesp
dict_with_queries     = model.dp.get_dataset_sources_for_batch_predict()

# Use queries to get multiple DataFrames
dict_with_dataframes  = model.data_sources_to_data(dict_with_queries)

# Integrate into one DataFrame and apply logic to clean data
df                    = model.dp.clean_data_for_batch_predict(dict_with_dataframes)

# Feature engineering
df                    = model.dp.feature_engineering_for_batch_predict(df)

# Resulting DataFrame used for batch predict
df

Now predict and prepare for submission to Kaggle.

Instead of `.predict()` method, you can also use `.predict_proba()` and you'll get the probabilities for each classifier class.

In [None]:
(
    model
    .predict(df)
    .rename(columns=dict(estimation=model.dp.get_target()))
    .to_csv('data/submission.csv')
)