# Catboost Tuning
## Summary
In this notebook I will be tuning a Catboost model using [Optuna](https://optuna.readthedocs.io/en/stable/index.html)


### Importing Data and Required Packages

In [8]:
# Imports
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from catboost import CatBoostClassifier, metrics, cv
from sklearn.metrics import roc_auc_score, f1_score, recall_score, accuracy_score
import optuna

from functions import metrics as custom_metric

In [11]:
# Training Data
X_train = pd.read_csv('../Data/train/X_train.csv', index_col=0)
y_train = pd.read_csv('../Data/train/y_train.csv', index_col=0)

# Testing Data
X_test = pd.read_csv('../Data/test/X_test.csv', index_col=0)
y_test = pd.read_csv('../Data/test/y_test.csv', index_col=0)

In [12]:
# Initiate Over sampler
ros = RandomOverSampler(random_state=15)

# Applying ONLY to training set to prevent data leakage.
X_train_os, y_train_os = ros.fit_resample(X_train, y_train)

## Optimizing with the Optuna
For the hyperparameter tuning process, I'll be using the [Optuna](https://optuna.readthedocs.io/en/stable/index.html) library.

In [13]:
# Optuna requires us to define the "objective function". Which will be called upon each iteration during our "trials"
def objective(trial):
    # Dict of Parameters to check
    param = {
        # Metric used for model optimization
        'loss_function':trial.suggest_categorical('loss_function', ['Logloss', 'CrossEntropy']),

        # The maximum number of trees that can be built.
        'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000]),

        # learning rate for gradient descent calculations.
        'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3),

        # Coefficient at the L2 regularization term of the cost function.
        'l2_leaf_reg': trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100),

        # Affects the speed and regularization of tree
        'bootstrap_type': trial.suggest_categorical('bootstrap_type', ['Bayesian', 'Bernoulli', 'MVS']),

        # The amount of randomness to use for scoring splits.
        'random_strength':trial.suggest_int("random_strength", 1,10),

        # The number of splits for numerical features.
        'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]),

        # Allowed depth of tree.
        "depth": trial.suggest_int("max_depth", 2,16),

        # Defines how to perform greedy tree construction.
        'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),

        # The minimum number of training samples in a leaf.
        'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10),

        # Only OHE encodes features if the number of unique values will be <= the parameter vale.
        'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,100]),
    }

    # Certain parameters are "subparameters" and can only be set if their parent parameter has a certain value.

    # Bootstrap types
    if param['bootstrap_type'] == "Bayesian":

        # Use Baysesian bootstrapping to assign random weights to objects.
        param['bagging_temperature'] = trial.suggest_float('bagging_temperature', 0, 10)

    elif param['bootstrap_type'] in ['Bernoulli', 'MVS']:
        # Sample rate for bagging using Bernoulli/MVS type
        param['subsample'] = trial.suggest_float('subsample', 0.1, 1)

    # Grow policy params
    if param['grow_policy'] != 'SymmetricTree':

        # The minimum number of training samples in a leaf.
        param['min_data_in_leaf'] = trial.suggest_int('min_data_in_leaf', 1, 10)

        if param['grow_policy'] == 'LossGuide':

            # The maximum number of leafs in the tree.
            param['max_leaves'] = trial.suggest_int('max_leaves', 16, 64)


    # Creates the trial model with parameters specified above.
    trial_model = CatBoostClassifier(**param)

    # Fit the training model on training data
    trial_model.fit(X_train_os,
                    y_train_os,
                    eval_set=[(X_test, y_test)],
                    verbose=0, # Stops Catboost from printing training results.
                    early_stopping_rounds=10 # Specify rounds of no improvement needed before stopping
                    )

    # Create predictions for test set
    preds = trial_model.predict(X_test)

    # Calculate recall score
    roc_auc = roc_auc_score(y_test, preds)

    return roc_auc

In [14]:
# Instantiate a "trial" object and specify we want to MAXIMIZE the value being returned by the obj function
study = optuna.create_study(direction="maximize")

# Running 100 trials, setting a timeout value of 15 minutes to prevent my computer from exploding.
study.optimize(objective, n_trials=100, timeout=900)

print("Number of finished trials: {}".format(len(study.trials)))
trial = study.best_trial

# "Prettify" our trial results
print("Best trial:")

# Print metric value achieved from best trial
print("  Value: {}".format(trial.value))

# Print all parameters from best trial
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[32m[I 2021-12-01 13:15:20,262][0m A new study created in memory with name: no-name-89e1714a-9d70-4350-8f0a-4f6333145aec[0m
Custom logger is already specified. Specify more than one logger at same time is not thread safe.[32m[I 2021-12-01 13:16:11,912][0m Trial 0 finished with value: 0.8137827975469601 and parameters: {'loss_function': 'Logloss', 'iterations': 1000, 'learning_rate': 0.27853655059569205, 'l2_leaf_reg': 0.06366426493797718, 'bootstrap_type': 'Bernoulli', 'random_strength': 4, 'max_bin': 4, 'max_depth': 9, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 3, 'one_hot_max_size': 12, 'subsample': 0.7397134390129195}. Best is trial 0 with value: 0.8137827975469601.[0m
[32m[I 2021-12-01 13:20:32,637][0m Trial 1 finished with value: 0.8458848061596351 and parameters: {'loss_function': 'CrossEntropy', 'iterations': 200, 'learning_rate': 0.06788936356049051, 'l2_leaf_reg': 19.178848802339562, 'bootstrap_type': 'Bayesian', 'random_strength': 7, 'max_bin': 20, 'max_depth': 1

Number of finished trials: 6
Best trial:
  Value: 0.8704508964409955
  Params: 
    loss_function: Logloss
    iterations: 500
    learning_rate: 0.20789147289515256
    l2_leaf_reg: 1.5182207444827891e-05
    bootstrap_type: Bayesian
    random_strength: 1
    max_bin: 6
    max_depth: 2
    grow_policy: Lossguide
    min_data_in_leaf: 7
    one_hot_max_size: 12
    bagging_temperature: 1.9649020176256882


In [15]:
# Create a model with the parameters from our best trial
final_model = CatBoostClassifier(verbose=False, **trial.params)
final_model.fit(X_train_os, y_train_os)

<catboost.core.CatBoostClassifier at 0x12708fd90>

In [16]:
# Show custom metrics
final_results = custom_metric(y_test, final_model.predict(X_test))

Model Results
Accuracy: 0.89
Precision: 0.47
Recall: 0.83
F1: 0.60
ROC AUC: 0.86


## Analysis
I chose to optimize on ROC score here, since I'm planning on using the predicted probabilities that a person has undiagnosed ADHD. We gained recall and AUC, however we lost more precision after tuning. We can see that the test set only missed **133 kids** that were diagnosed with ADHD.

I would say this is the prefered scenario. If a child is identified as having undiagnosed ADHD, and they don't, then the worst that would happen is they'd see a doctor and have their worries put to rest. However, if the model says a child doesn't have ADHD, and later they **do** have ADHD, then that means more time, frustration, and confusion before they get diagnosed. Not even to mention the frustration of being told that you don't have ADHD, only to find out later that you do.