## [Setup](#Setup_) 
## [Imports](#Imports_)
## [Catboost](#Catboost_)
## [Autosklearn](#Autosklearn_)

## Todo:
- [ ] Try auto sklearn
- [ ] Try auto gluon

## Setup <span id=Setup_></span>

In [10]:
COMPETITION = "titanic"

import glob
import os
import subprocess

%load_ext nb_black


def mkdir(path, error_if_exists=False):
    !mkdir {"-p" if not error_if_exists else ""} {path}


def kaggle_competitions_search(search_term):
    !kaggle competitions list -s {search_term}


def kaggle_competitions_files(competition):
    !kaggle competitions files {competition}


def kaggle_competitions_download(competition, save_path="data", filename=None):
    mkdir(save_path)
    !kaggle competitions download -p {save_path} {"-f " + filename if filename else ""} {competition}


def kaggle_competitions_submit(competition, filename, message="submit"):
    !kaggle competitions submit -f {filename} -m {message} {competition}


def kaggle_competitions_submissions(competition):
    !kaggle competitions submissions {competition}


def unzip(zip_path, save_path=None, delete_zip=False):
    !unzip {zip_path} {"-d "+ save_path if save_path else ""}
    if delete_zip:
        for path in glob.glob(zip_path):
            if path.endswith(".zip"):
                !trash {path}


def make_new_markdown_section_with_link(section, header="##"):
    section_id = section.replace(" ", "_") + "_"
    section_link = f"{header} [{section}](#{section_id})"
    section_header = f"{header} {section} <span id={section_id}></span>"
    section = section_link + "\n" + section_header
    print(section)

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

In [112]:
kaggle_competitions_search(COMPETITION)

ref                                                    deadline             category            reward  teamCount  userHasEntered  
-----------------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
https://www.kaggle.com/competitions/titanic            2030-01-01 00:00:00  Getting Started  Knowledge      15322           False  
https://www.kaggle.com/competitions/spaceship-titanic  2030-01-01 00:00:00  Getting Started  Knowledge       2369           False  


<IPython.core.display.Javascript object>

In [108]:
kaggle_competitions_download(COMPETITION)

Downloading titanic.zip to data
  0%|                                               | 0.00/34.1k [00:00<?, ?B/s]
100%|███████████████████████████████████████| 34.1k/34.1k [00:00<00:00, 744kB/s]


<IPython.core.display.Javascript object>

In [109]:
unzip("data/*", "data", delete_zip=True)

Archive:  data/titanic.zip
  inflating: data/gender_submission.csv  
  inflating: data/test.csv           
  inflating: data/train.csv          


<IPython.core.display.Javascript object>

## Imports <span id=Imports_></span>

In [1]:
from autosklearn import classification
import catboost
import functools
import numpy as np
import pandas as pd
import kaggle
import torch

In [None]:
@functools.cache
def get_df_train():
    return pd.read_csv("data/train.csv")


@functools.cache
def get_df_test():
    return pd.read_csv("data/test.csv")


def get_X_train_y_train():
    return get_df_train().drop("Survived", axis="columns"), get_df_train()[["Survived"]]


def select_numeric_dtypes(df):
    return df.select_dtypes(include=np.number)

def build_predictions_df(passenger_ids, predictions):
    return pd.DataFrame(
        dict(Survived=predictions), index=pd.Index(passenger_ids, name="PassengerId")
    )

## Catboost <span id=Catboost_></span>

In [2]:
def catboost_preprocess_dataframe(df, cat_features):
    cat_features = list(df.columns.drop(["Age", "Fare", "Name"]))
    df[cat_features] = df[cat_features].astype(str)
    return df


def catboost_predict(verbose=100):
    catboost_cls = catboost.CatBoostClassifier()
    X_train, y_train = get_X_train_y_train()
    cat_features = list(X_train.columns.drop(["Age", "Fare", "Name"]))
    text_features = ["Name"]

    catboost_cls.fit(
        catboost_preprocess_dataframe(X_train, cat_features),
        y_train,
        cat_features=cat_features,
        text_features=text_features,
        verbose=verbose,
    )

    df_test = get_df_test()

    predictions = catboost_cls.predict(
        catboost_preprocess_dataframe(df_test, cat_features)
    )

    return build_predictions_df(df_test["PassengerId"], predictions)


def catboost_submit():
    catboost_predict().to_csv("predictions.csv")
    kaggle_competitions_submit(COMPETITION, "predictions.csv", message="catboost")

## Autosklearn <span id=Autosklearn_></span>

In [3]:
def autosklearn_submit():
    """todo: add cat, text features"""
    X_train, y_train = get_X_train_y_train()
    auto_sklearn_cls = classification.AutoSklearnClassifier(memory_limit=20_000)
    auto_sklearn_cls.fit(
        select_numeric_dtypes(X_train).values,
        select_numeric_dtypes(y_train).values,
    )

    predictions = auto_sklearn_cls.predict(select_numeric_dtypes(df_test).values)

    build_predictions_df(df_test["PassengerId"], predictions).to_csv("predictions.csv")
    kaggle_competitions_submit(
        COMPETITION, "predictions.csv", message="'autosklearn only numeric features'"
    )

In [47]:
catboost_submit()

Learning rate set to 0.009807
0:	learn: 0.6868800	total: 104ms	remaining: 1m 44s
100:	learn: 0.4530608	total: 1.05s	remaining: 9.39s
200:	learn: 0.3909539	total: 2.04s	remaining: 8.1s
300:	learn: 0.3646291	total: 2.96s	remaining: 6.87s
400:	learn: 0.3501591	total: 3.78s	remaining: 5.64s
500:	learn: 0.3381096	total: 4.59s	remaining: 4.58s
600:	learn: 0.3283102	total: 5.46s	remaining: 3.62s
700:	learn: 0.3170069	total: 6.43s	remaining: 2.74s
800:	learn: 0.3080718	total: 7.35s	remaining: 1.82s
900:	learn: 0.2980580	total: 8.38s	remaining: 921ms
999:	learn: 0.2887910	total: 9.43s	remaining: 0us


<IPython.core.display.Javascript object>

In [49]:
autosklearn_submit()

[ERROR] [2023-03-28 15:26:24,088:Client-AutoML(1):e5569b63-cd7c-11ed-90d9-33a86f954b57] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 20000 MB).', 'configuration_origin': 'DUMMY'}.",)
[ERROR] [2023-03-28 15:26:24,088:Client-AutoML(1):e5569b63-cd7c-11ed-90d9-33a86f954b57] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 20000 MB).', 'configuration_origin': 'DUMMY'}.",)
Traceback (most recent call last):
  File "/home/dkkoshman/YSDA/python3.10/lib/python3.10/site-packages/autosklearn/automl.py", line 765, in fit
    self._do_dummy_prediction()
  File "/home/dkkoshman/YSDA/python3.10/lib/python3.10/site-packages/autosklearn/automl.py", line 489, in _do_dummy_prediction
    raise ValueError(msg)
ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 20000 MB).', 'configuration_ori

ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 20000 MB).', 'configuration_origin': 'DUMMY'}.",)

<IPython.core.display.Javascript object>

In [8]:
kaggle_competitions_submissions(COMPETITION)

fileName         date                 description                        status    publicScore  privateScore  
---------------  -------------------  ---------------------------------  --------  -----------  ------------  
predictions.csv  2023-03-28 15:18:17  autosklearn only numeric features  complete  0.67703                    
predictions.csv  2023-03-28 14:59:39  catboost                           complete  0.79186                    
predictions.csv  2023-03-28 14:48:47  catboost                           complete  0.79186                    
predictions.csv  2023-03-28 14:42:26  submit                             complete  0.78947                    


<IPython.core.display.Javascript object>