# Notebook History

2021-10-13:
- Added stacking with sklearn and mlxtend
- Moved Watermark at the end of the notebook

2021-10-11:
- Initial version.

# Open tasks
- TODO: Add LogisticRegression classifier
- TODO: Add better function to submit best baseline results (stacked or non-stacked)
- TODO: Write function to write scores into a dataframe/csv-file for better documentation and tracking
- TODO: Refactor write_scores_to_json() or replace it with df/csv-file approach

# Purpose

The objective of **stage two** of my machine learning process is to calculate the **baseline score of un-optimized ML algorithms** as a baseline for future optimization, including feature selection/reduction, feature normalization/transformation, and hyper-parameters tuning. The workflow will use the optimized dataframes (reduced storage and memory usage). For this competition, we are focusing on the **ROC AUC score**.

# Setup Environment

## Import Basic Modules

In [1]:
# The rest of the modules are loaded when required.
# To ensure a standalone character (for easier reusability).
import joblib
import pandas as pd


## Define Parameters

In [41]:
import os
import configparser

# Load external config file
config = configparser.ConfigParser()
config.read("../resources/config.ini")

PATH_DATA_RAW = config["PATHS"]["PATH_DATA_RAW"]
PATH_DATA_INT = config["PATHS"]["PATH_DATA_INT"]
PATH_DATA_PRO = config["PATHS"]["PATH_DATA_PRO"]
PATH_REPORTS = config["PATHS"]["PATH_REPORTS"]
PATH_MODELS = config["PATHS"]["PATH_MODELS"]
PATH_SUB = config["PATHS"]["PATH_SUB"]

# Telegram Bot
token = config["TELEGRAM"]["token"]
chat_id = config["TELEGRAM"]["chat_id"]
FILENAME_NB = "02_baseline_models" # for Telegram messages

# Set global randome state
rnd_state = 42

# Define available cpu cores
n_cpu = os.cpu_count()
print("Number of CPUs used:", n_cpu)


Number of CPUs used: 8


## Global Functions

In [3]:
import urllib, requests #for Telegram notifications

def send_telegram_message(message):
    """Sending messages to Telegram bot via requests.get()."""
    
    message = f"{FILENAME_NB}:\n{message}"

    # Using "try and except" to ensure that the notebook execution will not be stopped only because of problems with the bot.
    # Example: No network connection.
    # ISSUE: Be careful, an error messages will leak your Telegram Bot Token when uploaded to GitHub.
    try:
        url = 'https://api.telegram.org/bot%s/sendMessage?chat_id=%s&text=%s'%(token, chat_id, urllib.parse.quote_plus(message))
        _ = requests.get(url, timeout=10)
    
    except Exception as e:
        print('\n\nSending message to Telegram Bot was not successful.\n\n')
        print(e)
        
    return None
    

In [4]:
def calculate_train_scores(model, X_train, y_train, y_train_pred):
    # Training set performance
    train_accuracy = accuracy_score(y_train, y_train_pred)  # Calculate Accuracy
    train_mcc = matthews_corrcoef(y_train, y_train_pred)  # Calculate MCC
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")  # Calculate F1-score
    train_rocauc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
    
    return train_accuracy, train_mcc, train_f1, train_rocauc


In [5]:
def calculate_valid_scores(model, X_valid, y_valid, y_valid_pred):
    # Validation set performance
    valid_accuracy = accuracy_score(y_valid, y_valid_pred)  # Calculate Accuracy
    valid_mcc = matthews_corrcoef(y_valid, y_valid_pred)  # Calculate MCC
    valid_f1 = f1_score(y_valid, y_valid_pred, average="weighted")  # Calculate F1-score
    valid_rocauc = roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1])

    return valid_accuracy, valid_mcc, valid_f1, valid_rocauc
    

In [6]:
def print_scores():
    print("Model performance for Training set")
    print("- Accuracy: %s" % train_accuracy)
    print("- MCC: %s" % train_mcc)
    print("- F1 score: %s" % train_f1)
    print("- ROC AUC score: %s" % train_rocauc)
    print("----------------------------------")
    print("Model performance for Validation set")
    print("- Accuracy: %s" % valid_accuracy)
    print("- MCC: %s" % valid_mcc)
    print("- F1 score: %s" % valid_f1)
    print("- ROC AUC score: %s" % valid_rocauc)

    return None
    

In [7]:
import json
from datetime import datetime

def write_scores_to_json(filename):
    dummy_scores_dict = {}
    dummy_scores_list = []

    dummy_scores_dict['Accuracy Train'] = train_accuracy
    dummy_scores_dict['MCC Train'] = train_mcc
    dummy_scores_dict['F1 Train'] = train_f1
    dummy_scores_dict['ROC AUC Train'] = train_rocauc
    dummy_scores_list.append(dummy_scores_dict)

    dummy_scores_dict = {}
    dummy_scores_dict['Accuracy Valid'] = valid_accuracy
    dummy_scores_dict['MCC Valid'] = valid_mcc
    dummy_scores_dict['F1 Valid'] = valid_f1
    dummy_scores_dict['ROC AUC Valid'] = valid_rocauc
    dummy_scores_list.append(dummy_scores_dict)

    # datetime object containing current date and time
    now = datetime.now()
    now = now.strftime("%Y-%m-%d")

    # Serializing and write json file
    json_object = json.dumps(dummy_scores_list, indent = 4) 
    filename =  now+'_'+filename
    with open(PATH_REPORTS + filename, "w") as outfile: 
        outfile.write(json_object)

    return None
    

# Load Data

In [71]:
# train_df = pd.read_pickle(PATH_DATA_INT + "train.pkl")
# train_df = pd.read_pickle(PATH_DATA_INT + "train_features.pkl")
train_df = pd.read_pickle(PATH_DATA_INT + "train-opt.pkl")  


In [72]:
train_df.shape

(1000000, 287)

In [73]:
# Reducing sample size
# sample_size = 500000
# X = train_df[:sample_size]
# y = train_df[:sample_size]['target']
# assert y.index.tolist() == X.index.tolist()
# X = X.drop(['id','target'], axis=1)

# Using full dataset
#X = train_df.drop(["id", "target"], axis=1).to_numpy()
#y = train_df["target"].to_numpy()


# Using numpy arrays: https://vitalflux.com/pandas-dataframe-vs-numpy-array-what-to-use/
X = train_df.drop(["id", "target"], axis=1).values # using numpy array
y = train_df["target"].values # using numpy array


In [74]:
X.shape, y.shape

((1000000, 285), (1000000,))

# Run Baseline Classifiers

In [12]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score


In [13]:
# Enable / disable baseline classifiers
# Do not forget to add/remove classifiers in the stacking section, accordingly
dummy_enabled = "yes"
xgbc_enabled = "yes"
lgbc_enabled = "yes"
ctbc_enabled = "yes"
rfc_enabled = "yes" # Carefurl: tree grows big: .pkl file is around 1GB
dtc_enabled = "yes"
knnc_enabled = "no" # disable when sample size > 100.000
mlpc_enabled = "yes"

# Evaluation Metric
eval_metric = "AUC"


In [75]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.33, random_state=rnd_state, stratify=y)


In [76]:
X_train.shape, X_valid.shape

((670000, 285), (330000, 285))

In [16]:
# Sending for first bot message
message = "------------START------------"
send_telegram_message(message)


## Dummy Classifier

In [17]:
from sklearn.dummy import DummyClassifier

# Define model
duc = DummyClassifier(strategy="stratified")

if dummy_enabled == "yes":
    try:
        # Train model
        duc.fit(X_train, y_train)

        # Make predictions
        y_train_pred = duc.predict(X_train)
        y_valid_pred = duc.predict(X_valid)

        # Training set performance
        train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(
            duc, X_train, y_train, y_train_pred
        )

        # Validation set performance
        valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(
            duc, X_valid, y_valid, y_valid_pred
        )

        print_scores()
        filename = "dummy_baseline_scores.json"
        write_scores_to_json(filename)

        message = f"DummyClassifier finished. Validation AUC ROC Score: {valid_rocauc}"
        send_telegram_message(message)

    except Exception as e:
        message = f"\nFitting of {duc} failed: {e}\n"
        send_telegram_message(message)
        print(f"\n{e}\n")


Model performance for Training set
- Accuracy: 0.4998268656716418
- MCC: -0.0003463354277785677
- F1 score: 0.49982698277168613
- ROC AUC score: 0.49949268018636084
----------------------------------
Model performance for Validation set
- Accuracy: 0.5001787878787879
- MCC: 0.00035527903417098533
- F1 score: 0.5001782036921849
- ROC AUC score: 0.5014336197135231


## XGBoost

In [18]:
import xgboost as xgb

if xgbc_enabled == "yes":
    try:
        # Define model
        xgbc = xgb.XGBClassifier(
            random_state=rnd_state,
            n_jobs=n_cpu,
            eval_metric=eval_metric.lower()
        )

        # Train model
        xgbc.fit(X_train, y_train)

        # Make predictions
        y_train_pred = xgbc.predict(X_train)
        y_valid_pred = xgbc.predict(X_valid)

        # Training set performance
        train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(
            xgbc, X_train, y_train, y_train_pred
        )

        # Validation set performance
        valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(
            xgbc, X_valid, y_valid, y_valid_pred
        )

        # Print and write scores
        print_scores()
        filename = "xgbc_baseline_scores.json"
        write_scores_to_json(filename)

        # Store base model
        filename = "xgbc_baseline_model.pkl"
        joblib.dump(xgbc, PATH_MODELS + filename)

        # Send messages
        message = f"XGBClassifier finished. Validation AUC ROC Score: {valid_rocauc}"
        send_telegram_message(message)

    except Exception as e:
        message = f"\nFitting of {xgbc} failed: {e}\n"
        send_telegram_message(message)
        print(f"\n{e}\n")


Model performance for Training set
- Accuracy: 0.7885149253731343
- MCC: 0.5783763916849866
- F1 score: 0.7882806509204838
- ROC AUC score: 0.8778781226320046
----------------------------------
Model performance for Validation set
- Accuracy: 0.7651151515151515
- MCC: 0.5315687193750236
- F1 score: 0.7648343751609188
- ROC AUC score: 0.8506886352480831


## LightGBM

In [19]:
import lightgbm as lgb

if lgbc_enabled == "yes":
    try:
        # Define model
        lgbc = lgb.LGBMClassifier(
            random_state=rnd_state,
            n_jobs=n_cpu,
            eval_metric=eval_metric
        )

        # Train model
        lgbc.fit(X_train, y_train)

        # Make predictions
        y_train_pred = lgbc.predict(X_train)
        y_valid_pred = lgbc.predict(X_valid)

        # Training set performance
        train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(lgbc,
            X_train, y_train, y_train_pred
        )

        # Validation set performance
        valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(lgbc,
            X_valid, y_valid, y_valid_pred
        )

        # Print and write scores
        print_scores()
        filename = "lgbc_baseline_scores.json"
        write_scores_to_json(filename)

        # Store base model
        filename = "lgbc_baseline_model.pkl"
        joblib.dump(lgbc, PATH_MODELS+filename)

        # Send messages
        message = f"LGBClassifier finished. Validation AUC ROC Score: {valid_rocauc}"
        send_telegram_message(message)

    except Exception as e:
        message = f"\nFitting of {lgbc} failed: {e}\n"
        send_telegram_message(message)
        print(f'\n{e}\n')


Model performance for Training set
- Accuracy: 0.7692074626865671
- MCC: 0.5407198263980905
- F1 score: 0.7687344285456974
- ROC AUC score: 0.854384504730348
----------------------------------
Model performance for Validation set
- Accuracy: 0.7650333333333333
- MCC: 0.5323016233554332
- F1 score: 0.7645593989397934
- ROC AUC score: 0.8488486880634127


## Catboost

In [77]:
import catboost as ctb

def run_catboost_calssifier():
    if ctbc_enabled == "yes":
        try:
            # Define model
            ctbc = ctb.CatBoostClassifier(
                random_state=rnd_state,
                verbose=0,
                eval_metric=eval_metric,
                task_type="GPU"
            )

            # Train model
            ctbc.fit(X_train, y_train)

            # Make predictions
            y_train_pred = ctbc.predict(X_train)
            y_valid_pred = ctbc.predict(X_valid)

            # Training set performance
            train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(ctbc,
                X_train, y_train, y_train_pred
            )

            # Validation set performance
            valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(ctbc,
                X_valid, y_valid, y_valid_pred
            )

            # Print and write scores
            print_scores()
            filename = "ctbc_baseline_scores.json"
            write_scores_to_json(filename)

            # Store base model
            filename = "ctbc_baseline_model.pkl"
            joblib.dump(ctbc, PATH_MODELS+filename)

            # Send messages
            message = (f"CatBoostClassifier finished. Validation AUC ROC Score: {valid_rocauc}")
            send_telegram_message(message)

        except Exception as e:
            message = f"\nFitting of {ctbc} failed: {e}\n"
            send_telegram_message(message)
            print(f'\n{e}\n')

    return None
    
run_catboost_calssifier()


Model performance for Training set
- Accuracy: 0.9036059701492537
- MCC: 0.8073152161203432
- F1 score: 0.903600707651649
- ROC AUC score: 0.8583491905558767
----------------------------------
Model performance for Validation set
- Accuracy: 0.7678545454545455
- MCC: 0.5373560296446442
- F1 score: 0.7675149302948873
- ROC AUC score: 0.8528049101122387


## Random Forest

In [21]:
from sklearn.ensemble import RandomForestClassifier

if rfc_enabled == "yes":
    try:
        # Define model
        rfc = RandomForestClassifier(random_state=rnd_state, n_jobs=n_cpu)

        # Train model
        rfc.fit(X_train, y_train)

        # Make predictions
        y_train_pred = rfc.predict(X_train)
        y_valid_pred = rfc.predict(X_valid)

        # Training set performance
        train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(rfc,
            X_train, y_train, y_train_pred
        )

        # Validation set performance
        valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(rfc,
            X_valid, y_valid, y_valid_pred
        )

        # Print and write scores
        print_scores()
        filename = "rfc_baseline_scores.json"
        write_scores_to_json(filename)

        # Store base model
        filename = "rfc_baseline_model.pkl"
        joblib.dump(rfc, PATH_MODELS+filename)

        # Send messages
        message = f"RandomForestClassifier finished. Validation AUC ROC Score: {valid_rocauc}"
        send_telegram_message(message)

    except Exception as e:
        message = f"\nFitting of {rfc} failed: {e}\n"
        send_telegram_message(message)
        print(f'\n{e}\n')
		

Model performance for Training set
- Accuracy: 1.0
- MCC: 1.0
- F1 score: 1.0
- ROC AUC score: 1.0
----------------------------------
Model performance for Validation set
- Accuracy: 0.7576484848484848
- MCC: 0.5178946298618594
- F1 score: 0.7570631117247801
- ROC AUC score: 0.8309505821426962


##  Decision Tree

In [22]:
from sklearn.tree import DecisionTreeClassifier

if dtc_enabled == "yes":
    try:
        # Define model
        dtc = DecisionTreeClassifier(max_depth=5, random_state=rnd_state)

        # Train model
        dtc.fit(X_train, y_train)

        # Make predictions
        y_train_pred = dtc.predict(X_train)
        y_valid_pred = dtc.predict(X_valid)

        # Training set performance
        train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(dtc,
            X_train, y_train, y_train_pred
        )

        # Validation set performance
        valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(dtc,
            X_valid, y_valid, y_valid_pred
        )

        # Print and write scores
        print_scores()
        filename = "dtc_baseline_scores.json"
        write_scores_to_json(filename)

        # Store base model
        filename = "dtc_baseline_model.pkl"
        joblib.dump(dtc, PATH_MODELS+filename)

        # Send messages
        message = f"DecisionTreeClassifier finished. Validation AUC ROC Score: {valid_rocauc}"
        send_telegram_message(message)

    except Exception as e:
        message = f"\nFitting of {dtc} failed: {e}\n"
        send_telegram_message(message)
        print(f'\n{e}\n')

Model performance for Training set
- Accuracy: 0.7572373134328358
- MCC: 0.5195936906271718
- F1 score: 0.7560729947797653
- ROC AUC score: 0.8203079886926381
----------------------------------
Model performance for Validation set
- Accuracy: 0.7559848484848485
- MCC: 0.5169942047093701
- F1 score: 0.7548307158115667
- ROC AUC score: 0.8191484611827587


## KNN

In [23]:
from sklearn.neighbors import KNeighborsClassifier

if knnc_enabled == "yes":
    try:
        # Define model
        knnc = KNeighborsClassifier(n_neighbors=3, n_jobs=n_cpu)

        # Train model
        knnc.fit(X_train, y_train)

        # Make predictions
        y_train_pred = knnc.predict(X_train)
        y_valid_pred = knnc.predict(X_valid)

        # Training set performance
        train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(knnc,
            X_train, y_train, y_train_pred
        )

        # Validation set performance
        valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(knnc,
            X_valid, y_valid, y_valid_pred
        )

        # Print and write scores
        print_scores()
        filename = "knnc_baseline_scores.json"
        write_scores_to_json(filename)

        # Store base model
        filename = "knnc_baseline_model.pkl"
        joblib.dump(knnc, PATH_MODELS+filename)

        # Send messages
        message = f"KNNClassifier finished. Validation AUC ROC Score: {valid_rocauc}"
        send_telegram_message(message)

    except Exception as e:
        message = f"\nFitting of {knnc} failed: {e}\n"
        send_telegram_message(message)
        print(f'\n{e}\n')

## Neural Network

In [24]:
from sklearn.neural_network import MLPClassifier

if mlpc_enabled == "yes":
    try:
        # Define model
        mlpc = MLPClassifier(alpha=1, max_iter=200, random_state=rnd_state)

        # Train model
        mlpc.fit(X_train, y_train)

        # Make predictions
        y_train_pred = mlpc.predict(X_train)
        y_valid_pred = mlpc.predict(X_valid)

        # Training set performance
        train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(mlpc,
            X_train, y_train, y_train_pred
        )

        # Validation set performance
        valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(mlpc,
            X_valid, y_valid, y_valid_pred
        )

        # Print and write scores
        print_scores()
        filename = "mlpc_baseline_scores.json"
        write_scores_to_json(filename)

        # Store base model
        filename = "mlpc_baseline_model.pkl"
        joblib.dump(mlpc, PATH_MODELS+filename)

        # Send messages
        message = f"MLPClassifier finished. Validation AUC ROC Score: {valid_rocauc}"
        send_telegram_message(message)

    except Exception as e:
        message = f"\nFitting of {mlpc} failed: {e}\n"
        send_telegram_message(message)
        print(f'\n{e}\n')

Model performance for Training set
- Accuracy: 0.7634343283582089
- MCC: 0.5281439309986606
- F1 score: 0.7631635345656835
- ROC AUC score: 0.8453865350764336
----------------------------------
Model performance for Validation set
- Accuracy: 0.7625575757575758
- MCC: 0.5264326747700689
- F1 score: 0.762275633123674
- ROC AUC score: 0.8438438724482327


# Stacking

## StackingClassifier (sklearn)

- https://towardsdatascience.com/stacking-made-easy-with-sklearn-e27a0793c92b
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html

In [25]:
from sklearn.ensemble import StackingClassifier # only works with estimators from sklearn
from sklearn.linear_model import LogisticRegression

estimators = [
    ("rfc", rfc),
    ("dtc", dtc),
    # ("knnc", knnc),
    ("mlpc", mlpc),
]

final_estimator = LogisticRegression(random_state=rnd_state)

# Build stack model
stack_skl_model = StackingClassifier(estimators=estimators, final_estimator=final_estimator) 

try:
    # Train stacked model
    stack_skl_model.fit(X_train, y_train)

    # Make predictions
    y_train_pred = stack_skl_model.predict(X_train)
    y_valid_pred = stack_skl_model.predict(X_valid)

    # Training set performance
    train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(
        stack_skl_model, X_train, y_train, y_train_pred
    )

    # Validation set performance
    valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(
        stack_skl_model, X_valid, y_valid, y_valid_pred
    )

    # Print and write scores
    print_scores()
    filename = "stack_sklearn_baseline_scores.json"
    write_scores_to_json(filename)

    # Store base model
    filename = "stack_sklearn_baseline_model.pkl"
    joblib.dump(stack_skl_model, PATH_MODELS+filename)

    # Send messages
    message = f"Stacking (sklearn) finished. Validation AUC ROC Score: {valid_rocauc}"
    send_telegram_message(message)

except Exception as e:
        message = f"\nFitting of {stack_skl_model} failed: {e}\n"
        send_telegram_message(message)
        print(f'\n{e}\n')


Model performance for Training set
- Accuracy: 0.9036059701492537
- MCC: 0.8073152161203432
- F1 score: 0.903600707651649
- ROC AUC score: 0.9798443734467536
----------------------------------
Model performance for Validation set
- Accuracy: 0.7640212121212121
- MCC: 0.5294831336086517
- F1 score: 0.7637158266869781
- ROC AUC score: 0.8463956069688718


## StackingCVClassifier (mlxtend)

- https://developer.ibm.com/articles/stack-machine-learning-models-get-better-results/
- http://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/

In [None]:
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingCVClassifier

lr = LogisticRegression(random_state=rnd_state)

stack_mlx_model = StackingCVClassifier(
    classifiers=[xgbc, lgbc, ctbc],
    meta_classifier=lr
    cv=5,
    use_features_in_secondary=True,
    store_train_meta_features=True,
    shuffle=True,
    random_state=rnd_state,
    verbose=1,
    n_jobs=n_cpu
)

try:
    stack_mlx_model.fit(X_train, y_train)
    y_valid_pred = stack_mlx_model.predict(X_valid)

    # Training set performance
    train_accuracy, train_mcc, train_f1, train_rocauc = calculate_train_scores(
        stack_mlx_model, X_train, y_train, y_train_pred
    )

    # Validation set performance
    valid_accuracy, valid_mcc, valid_f1, valid_rocauc = calculate_valid_scores(
        stack_mlx_model, X_valid, y_valid, y_valid_pred
    )

    # Print and write scores
    print_scores()
    filename = "stack_sklearn_baseline_scores.json"
    write_scores_to_json(filename)

    # Store base model
    filename = "stack_mlxtend_baseline_model.pkl"
    joblib.dump(stack_mlx_model, PATH_MODELS+filename)

    # Send messages
    message = f"Stacking (mlxtend) finished. Validation AUC ROC Score: {valid_rocauc}"
    send_telegram_message(message)

except Exception as e:
        message = f"\nFitting of {stack_mlx_model} failed: {e}\n"
        send_telegram_message(message)
        print(f'\n{e}\n')


Fitting 3 classifiers...
Fitting classifier1: xgbclassifier (1/3)


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   5 | elapsed: 21.3min remaining: 32.0min
[Parallel(n_jobs=8)]: Done   5 out of   5 | elapsed: 21.4min finished


Fitting classifier2: lgbmclassifier (2/3)


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   5 | elapsed:  1.8min remaining:  2.7min
[Parallel(n_jobs=8)]: Done   5 out of   5 | elapsed:  1.8min finished


Fitting classifier3: catboostclassifier (3/3)


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   5 | elapsed:  3.3min remaining:  5.0min
[Parallel(n_jobs=8)]: Done   5 out of   5 | elapsed:  3.7min finished


Model performance for Training set
- Accuracy: 0.9036059701492537
- MCC: 0.8073152161203432
- F1 score: 0.903600707651649
- ROC AUC score: 0.8583491905558767
----------------------------------
Model performance for Validation set
- Accuracy: 0.7678545454545455
- MCC: 0.5373560296446442
- F1 score: 0.7675149302948873
- ROC AUC score: 0.8528049101122387


# Make final predications

In [78]:
test_df = pd.read_pickle(PATH_DATA_INT + "test-opt.pkl")
X_test = test_df.drop("id", axis=1).values
X_test.shape


(500000, 285)

In [None]:
# Train (best) model on "full" data set
# _ = stack_skl_model.fit(X, y)

In [87]:
# Make predictions
y_test_pred = stack_skl_model.predict(X_test)


# Submit (Best) Baseline Results

This section needs to be finalized.

In [84]:
from datetime import datetime

# datetime object containing current date and time
now = datetime.now()
now = now.strftime("%Y-%m-%d")

objective = "stack_mlx_model-baseline"

curr_submission_fn = f"{now}_submission_{objective}.csv"

my_submission = pd.DataFrame({"id": test_df["id"], "target": y_test_pred})
my_submission.to_csv(PATH_SUB + curr_submission_fn, index=False)

print(curr_submission_fn)


2021-10-14_submission_stacked-sklearn-baseline.csv


In [85]:
# !kaggle competitions submit tabular-playground-series-oct-2021 -f {PATH_SUB+curr_submission_fn} -m {curr_submission_fn}

Successfully submitted to Tabular Playground Series - Oct 2021



  0%|          | 0.00/5.25M [00:00<?, ?B/s]
  0%|          | 8.00k/5.25M [00:00<01:38, 55.9kB/s]
  2%|▏         | 112k/5.25M [00:00<00:10, 536kB/s]  
  4%|▍         | 208k/5.25M [00:00<00:07, 693kB/s]
  5%|▌         | 288k/5.25M [00:00<00:10, 506kB/s]
  7%|▋         | 352k/5.25M [00:00<00:11, 455kB/s]
  8%|▊         | 408k/5.25M [00:00<00:12, 419kB/s]
  8%|▊         | 456k/5.25M [00:01<00:12, 408kB/s]
  9%|▉         | 504k/5.25M [00:01<00:12, 400kB/s]
 10%|█         | 552k/5.25M [00:01<00:12, 393kB/s]
 11%|█         | 592k/5.25M [00:01<00:12, 379kB/s]
 12%|█▏        | 632k/5.25M [00:01<00:12, 385kB/s]
 13%|█▎        | 672k/5.25M [00:01<00:12, 378kB/s]
 13%|█▎        | 712k/5.25M [00:01<00:12, 371kB/s]
 14%|█▍        | 752k/5.25M [00:01<00:12, 381kB/s]
 15%|█▍        | 792k/5.25M [00:01<00:12, 373kB/s]
 15%|█▌        | 832k/5.25M [00:02<00:12, 373kB/s]
 16%|█▌        | 872k/5.25M [00:02<00:12, 369kB/s]
 19%|█▉        | 1.02M/5.25M [00:02<00:06, 731kB/s]
 21%|██        | 1.09M/5.25M [00

# Watermark

In [27]:
%load_ext watermark

In [28]:
%watermark

Last updated: 2021-10-14T20:08:15.118277+02:00

Python implementation: CPython
Python version       : 3.8.8
IPython version      : 7.28.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
CPU cores   : 8
Architecture: 64bit



In [29]:
%watermark --iversions

lightgbm: 2.3.1
pandas  : 1.0.5
requests: 2.26.0
sys     : 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
joblib  : 1.0.0
catboost: 1.0.0
xgboost : 1.1.1
json    : 2.0.9

