# Catboost Tuning with Weighted samples instead of Oversampling
## Summary
In this notebook I will be tuning a Catboost model using [Optuna](https://optuna.readthedocs.io/en/stable/index.html)


### Importing Data and Required Packages

In [1]:
# Imports
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from catboost import CatBoostClassifier, metrics, cv
from sklearn.metrics import roc_auc_score, f1_score, recall_score, accuracy_score
import optuna

from functions import metrics as custom_metric

In [6]:
# Training Data
X_train = pd.read_csv('../Data/train/X_train_alt.csv', index_col=0)
y_train = pd.read_csv('../Data/train/y_train_alt.csv', index_col=0)

# Testing Data
X_test = pd.read_csv('../Data/test/X_test_alt.csv', index_col=0)
y_test = pd.read_csv('../Data/test/y_test_alt.csv', index_col=0)

## Optimizing with the Optuna
For the hyperparameter tuning process, I'll be using the [Optuna](https://optuna.readthedocs.io/en/stable/index.html) library.

In [17]:
# Optuna requires us to define the "objective function". Which will be called upon each iteration during our "trials"
def objective(trial):
    # Dict of Parameters to check
    param = {
        # Metric used for model optimization
        # 'loss_function':trial.suggest_categorical('loss_function', ['Logloss', 'CrossEntropy']),

        # The maximum number of trees that can be built.
        'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000]),

        # learning rate for gradient descent calculations.
        'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3),

        # Coefficient at the L2 regularization term of the cost function.
        'l2_leaf_reg': trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100),

        # Affects the speed and regularization of tree
        'bootstrap_type': trial.suggest_categorical('bootstrap_type', ['Bayesian', 'Bernoulli', 'MVS']),

        # The amount of randomness to use for scoring splits.
        'random_strength':trial.suggest_int("random_strength", 1,10),

        # The number of splits for numerical features.
        'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]),

        # Allowed depth of tree.
        "depth": trial.suggest_int("max_depth", 2,16),

        # Defines how to perform greedy tree construction.
        'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),

        # The minimum number of training samples in a leaf.
        'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10),

        # Only OHE encodes features if the number of unique values will be <= the parameter vale.
        'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,100]),

        # Class weights setting
        'auto_class_weights': trial.suggest_categorical('auto_class_weights', ['Balanced', 'SqrtBalanced'])
    }

    # Certain parameters are "subparameters" and can only be set if their parent parameter has a certain value.

    # Bootstrap types
    if param['bootstrap_type'] == "Bayesian":

        # Use Baysesian bootstrapping to assign random weights to objects.
        param['bagging_temperature'] = trial.suggest_float('bagging_temperature', 0, 10)

    elif param['bootstrap_type'] in ['Bernoulli', 'MVS']:
        # Sample rate for bagging using Bernoulli/MVS type
        param['subsample'] = trial.suggest_float('subsample', 0.1, 1)

    # Grow policy params
    if param['grow_policy'] != 'SymmetricTree':

        # The minimum number of training samples in a leaf.
        param['min_data_in_leaf'] = trial.suggest_int('min_data_in_leaf', 1, 10)

        if param['grow_policy'] == 'LossGuide':

            # The maximum number of leafs in the tree.
            param['max_leaves'] = trial.suggest_int('max_leaves', 16, 64)


    # Creates the trial model with parameters specified above.
    trial_model = CatBoostClassifier(**param)

    # Fit the training model on training data
    trial_model.fit(X_train,
                    y_train,
                    eval_set=[(X_test, y_test)],
                    verbose=0, # Stops Catboost from printing training results.
                    early_stopping_rounds=10 # Specify rounds of no improvement needed before stopping
                    )

    # Create predictions for test set
    preds = trial_model.predict(X_test)

    # Calculate recall score
    recall = recall_score(y_test, preds)

    return recall

In [18]:
# Instantiate a "trial" object and specify we want to MAXIMIZE the value being returned by the obj function
study = optuna.create_study(direction="maximize")

# Running 100 trials, setting a timeout value of 15 minutes to prevent my computer from exploding.
study.optimize(objective, n_trials=100, timeout=900)

print("Number of finished trials: {}".format(len(study.trials)))
trial = study.best_trial

# "Prettify" our trial results
print("Best trial:")

# Print metric value achieved from best trial
print("  Value: {}".format(trial.value))

# Print all parameters from best trial
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[32m[I 2021-12-02 14:48:29,094][0m A new study created in memory with name: no-name-e6ef0dc4-547b-4d91-85eb-77466f046926[0m
[32m[I 2021-12-02 14:48:34,784][0m Trial 0 finished with value: 0.7178082191780822 and parameters: {'iterations': 100, 'learning_rate': 0.1232319791499092, 'l2_leaf_reg': 8.710220289772402e-08, 'bootstrap_type': 'MVS', 'random_strength': 6, 'max_bin': 10, 'max_depth': 3, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced', 'subsample': 0.8689688383326206}. Best is trial 0 with value: 0.7178082191780822.[0m
[32m[I 2021-12-02 14:48:43,427][0m Trial 1 finished with value: 0.836986301369863 and parameters: {'iterations': 300, 'learning_rate': 0.26016856539483757, 'l2_leaf_reg': 32.971469910104446, 'bootstrap_type': 'Bernoulli', 'random_strength': 10, 'max_bin': 4, 'max_depth': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 3, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced', 'subs

Number of finished trials: 68
Best trial:
  Value: 0.8767123287671232
  Params: 
    iterations: 300
    learning_rate: 0.04922586124412725
    l2_leaf_reg: 1.7549284900699275e-05
    bootstrap_type: MVS
    random_strength: 4
    max_bin: 8
    max_depth: 3
    grow_policy: Depthwise
    min_data_in_leaf: 8
    one_hot_max_size: 10
    auto_class_weights: Balanced
    subsample: 0.6141662871526945


In [19]:
# Create a model with the parameters from our best trial
final_model = CatBoostClassifier(verbose=False, **trial.params)
final_model.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x12a9f1310>

In [20]:
# Show custom metrics
final_results = custom_metric(X_test, y_test, final_model)

Model Results
Accuracy: 0.89
Precision: 0.44
Recall: 0.88
F1: 0.59
ROC AUC: 0.95


In [21]:
X_val = pd.read_csv('../Data/val/X_val_alt.csv', index_col=0)
y_val = pd.read_csv('../Data/val/y_val_alt.csv', index_col=0)

In [22]:
custom_metric(X_val, y_val, final_model)

Model Results
Accuracy: 0.88
Precision: 0.43
Recall: 0.89
F1: 0.58
ROC AUC: 0.95


{'Accuracy': 0.8840688107703815,
 'Precision': 0.43286573146292584,
 'Recall': 0.8888888888888888,
 'F1': 0.5822102425876011,
 'ROCAUC': 0.9540164507484769}

## Analysis
This performs almost the same as the oversampling. It loses 1 point of recall.