# Fitting GBT candidate(s)

This notebook contains core functions used for fitting the GBT candidates. The code behind these functions is in the [lgbm.py](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/lgbm.py) file.
The cells below are executed by clicking on them and pressing "shift+enter".

## Adjust path!

Please make sure to execute the cell below. This will adjust your current path. Additionally "SCRIPT_DIR" is later used to cache the [fits](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/lgbm_fits) and [results](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/lgbm_results) in it.

In [None]:
import sys
import os

SCRIPT_DIR = os.path.dirname(os.path.abspath("__file__"))
sys.path.append(os.path.dirname(SCRIPT_DIR[:-11]))

## 1. Execute one M-HPTL and get&save the best fit

The following cells will run you through one Model-Hyperparameter-Tuning-Loop. 

### 1.1 Import the LGBM class

In [None]:
from churn_modelling.modelling.lgbm import LGBM

### 1.2 Call the class

The class automatically loads all datasets from the [data](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/data) folder as [attributes](https://github.com/abdumaa/master_thesis_churn_modelling/blob/14077d754b48e962f494e3eb4b335f6b88945f6f/churn_modelling/modelling/lgbm.py#L39).

In [None]:
gbt_modelling = LGBM()

### 1.3 Seperate training data from validation data
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/14077d754b48e962f494e3eb4b335f6b88945f6f/churn_modelling/modelling/lgbm.py#L70)

In [None]:
df_train, df_val = gbt_modelling.create_train_val()
print(f"Length of training set: {len(df_train)}\nLength of validation set: {len(df_val)}")

### 1.4 Sample data

Supported parameters for sampling: "up", "down", "smote" \
Supported parameters for frac: a float which is larger than 0 \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/14077d754b48e962f494e3eb4b335f6b88945f6f/churn_modelling/modelling/lgbm.py#L84)

In [None]:
df_train_sampled = gbt_modelling.create_sampling(
    df_to_sample=df_train,
    sampling="up",
    frac=10,
)
print(df_train_sampled["churn"].value_counts())

### 1.5 Detect the best set of features

This function returns the best set of features. It performs [MRMR](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/preprocessing/mrmr.py) on the quotation variables to reduce redundant variables. \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/lgbm.py#L113)

In [None]:
best_feats = gbt_modelling.get_best_quot_features(
    df_to_dimreduce=df_train_sampled,
    cv=5,
    return_fix_features=True,
    return_target=True
)
print(f"Best set of features: {best_feats}")

### 1.6 Define Hyperparameter dictionaries

These dictionaries will be used for fitting.

In [None]:
import lightgbm as lgb
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# Define fix hyperparameters passed into the lightgbm.Classifier
hp_fix_dict = {
    "objective": "binary",
    "max_depth": -1,
    "n_estimators": 1000,
    "importance_type": "split",
}
# Define the to be tuned hyperparameters and their value spaces for GBT
hp_tune_dict = {
    "num_leaves": sp_randint(6, 50),
    "min_child_weight": [1e-5, 1e-2, 1e-1, 1, 1e1, 1e4],
    "min_child_samples": sp_randint(100, 500),
    "subsample": sp_uniform(loc=0.4, scale=0.6),
    "colsample_bytree": sp_uniform(loc=0.6, scale=0.4),
    "reg_alpha": [0, 1, 5, 10, 100],
    "reg_lambda": [0, 1, 5, 10, 100],
}
# Define parameters regarding evaluation and early stoping
### Note that 'eval_metric' is automatically overwritten when a custom-loss is used ###
hp_eval_dict = {
    "eval_metric": "logloss",
    "callbacks": [lgb.log_evaluation(100), lgb.early_stopping(30)],
}
# Define parameters regarding the tuning process
rscv_params = {
    "n_iter": 100,
    "n_jobs": -1,
    "cv": 3,
    "verbose": 100,
}

### 1.7 Fit models

By defining "cl_alpha" and "cl_gamma" other than "None", [Weighted Loss](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/custom_loss.py#L86) or [Focal Loss](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/custom_loss.py#L6) are used as objective loss and evalution functions. \
The model is saved as "cache_model_name" in [lgbm_fits](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/lgbm_fits). \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/lgbm.py#L164)

In [None]:
gbt_fit = gbt_modelling.fit_lgbm(
    df_train=df_train_sampled,
    df_val=df_val,
    hp_fix_dict=hp_fix_dict,
    hp_tune_dict=hp_tune_dict,
    hp_eval_dict=hp_eval_dict,
    rscv_params=rscv_params,
    feature_set=best_feats,
    cl_alpha=None,
    cl_gamma=None,
    save_model=True,
    cache_model_name="lgbm_fit_gbt_up1_best_quot_aNone_gNone",
    path_to_folder=SCRIPT_DIR,
)

### 1.8 Predict on OOS data using fit

[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/lgbm.py#L307)

In [None]:
preds, preds_proba = gbt_modelling.predict(
    gbt_modelling.df_oos,
    predict_from_cached_fit=False,
    fit=gbt_fit,
)

### 1.9 Evaluate predictions

In [None]:
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    roc_auc_score,
    average_precision_score,
)
print('Accuracy_OOS:', round(accuracy_score(gbt_modelling.df_oos["churn"], preds), 4),
'\nPrecision_OOS:', round(precision_score(gbt_modelling.df_oos["churn"], preds), 4),
'\nRecall_OOS:', round(recall_score(gbt_modelling.df_oos["churn"], preds), 4),
'\nF1_Score_OOS:', round(f1_score(gbt_modelling.df_oos["churn"], preds), 4),
'\nAUROC_OOS:', round(roc_auc_score(gbt_modelling.df_oos["churn"], preds_proba), 4),
'\nAUPRC_OOS:', round(average_precision_score(gbt_modelling.df_oos["churn"], preds_proba), 4))

## 2. Execute entire S-HPTL and get&save the best fits

The following cells perform the Structural-Hyperparameter-Tuning-Loop. Note that M-HPTL nests in S-HPTL, so in each iteration of S-HPTL the entire M-HPTL is executed. This increases the computational time drastically.

### 2.1 Define Hyperparameter dictionaries

These dictionaries will be used for fitting.

In [None]:
import lightgbm as lgb
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# Define fix hyperparameters passed into the lightgbm.Classifier
hp_fix_dict = {
    "objective": "binary",
    "max_depth": -1,
    "n_estimators": 1000,
    "importance_type": "split",
}
# Define the to be tuned hyperparameters and their value spaces for GBT
hp_tune_dict = {
    "num_leaves": sp_randint(6, 50),
    "min_child_weight": [1e-5, 1e-2, 1e-1, 1, 1e1, 1e4],
    "min_child_samples": sp_randint(100, 500),
    "subsample": sp_uniform(loc=0.4, scale=0.6),
    "colsample_bytree": sp_uniform(loc=0.6, scale=0.4),
    "reg_alpha": [0, 1, 5, 10, 100],
    "reg_lambda": [0, 1, 5, 10, 100],
}
# Define parameters regarding evaluation and early stoping
### Note that 'eval_metric' is automatically overwritten when a custom-loss is used ###
hp_eval_dict = {
    "eval_metric": "logloss",
    "callbacks": [lgb.log_evaluation(100), lgb.early_stopping(30)],
}
# Define parameters regarding the tuning process
rscv_params = {
    "n_iter": 100,
    "n_jobs": -1,
    "cv": 3,
    "verbose": 100,
}
# Define hyperparameter spaces for S-HPTL
# The function expects the same presented structure
# Note that by adding more values the computational time increases drastically
hp_struct_dict = {
    'sampling': {
        'down1': 0.1,
        'down2': 0.5,
        # 'down3': x,
        # 'up1': y,
        # 'smote': True,
    },
    'dr_method': ['no_quot', 'best_quot'],
    'cl_alpha': [None, 0.6],
    'cl_gamma': [None],
}

### 2.2 Fit and Evaluate models

The best models of each S-HPTL iteration are saved in [lgbm_fits](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/lgbm_fits). \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/lgbm.py#L380)

In [None]:
results_df = gbt_modelling.fit_and_eval_lgbm_candidates(
    hp_struct_dict=hp_struct_dict,
    hp_fix_dict=hp_fix_dict,
    hp_tune_dict=hp_tune_dict,
    hp_eval_dict=hp_eval_dict,
    rscv_params=rscv_params,
    path_to_folder=SCRIPT_DIR,
    feature_set_from_last_fits=False,
)

### 2.3 Review results

This table is also saved in [lgbm_results](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/modelling/lgbm_results)

In [None]:
results_df