# Fitting EBM candidate(s)

This notebook contains core functions used for fitting the EBM candidates. The code behind these functions is in the [ebm.py](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/ebm.py) file.
The cells below are executed by clicking on them and pressing "shift+enter".

## Adjust path!

Please make sure to execute the cell below. This will adjust your current path. Additionally "SCRIPT_DIR" is later used to cache the [fits](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/ebm_fits) and [results](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/ebm_results) in it.

In [None]:
import sys
import os

SCRIPT_DIR = os.path.dirname(os.path.abspath("__file__"))
sys.path.append(os.path.dirname(SCRIPT_DIR[:-11]))

## 1. Execute one M-HPTL and get&save the best fit

The following cells will run you through one Model-Hyperparameter-Tuning-Loop. 

### 1.1 Import the EBM class

In [None]:
from churn_modelling.modelling.ebm import EBM

### 1.2 Call the class

The class automatically loads all datasets from the [data](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/data) folder as [attributes](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/ebm.py#L37).

In [None]:
ebm_modelling = EBM()

### 1.3 Sample data

Supported parameters for sampling: "up", "down", "smote" \
Supported parameters for frac: a float which is larger than 0 \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/ebm.py#L68)

In [None]:
df_train_sampled = ebm_modelling.create_sampling(
    df_to_sample=ebm_modelling.df,
    sampling="down",
    frac=0.5,
)
print(df_train_sampled["churn"].value_counts())

### 1.4 Detect the best set of features

This function returns the best set of features. It performs [MRMR](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/preprocessing/mrmr.py) on the quotation variables to reduce redundant variables. \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/ebm.py#L97)

In [None]:
best_feats = ebm_modelling.get_best_quot_features(
    df_to_dimreduce=df_train_sampled,
    cv=5,
    return_fix_features=True,
    return_target=True
)
print(f"Best set of features: {best_feats}")

### 1.5 Define Hyperparameter dictionaries

These dictionaries will be used for fitting.

In [None]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# Define fix hyperparameters passed into the ebm.Classifier
hp_fix_dict = {
    "validation_size": 0.1111, # to achieve 80/20/20
    "early_stopping_rounds": 30,
    "early_stopping_tolerance": 1e-4,
    "max_rounds": 5000,
}
# Define the to be tuned hyperparameters and their value spaces for EBM
hp_tune_dict = {
    "interactions": sp_randint(5, 10),
    "outer_bags": sp_randint(10, 20), # computationally very costly
    "inner_bags": sp_randint(0, 10), # computationally very costly
    "learning_rate": sp_uniform(loc=0.009, scale=0.006),
    "min_samples_leaf": sp_randint(2, 5),
    "max_leaves": sp_randint(2, 5),
}
# Define parameters regarding the tuning process
rscv_params = {
    "n_iter": 10,
    "n_jobs": -1,
    "cv": 3,
    "verbose": 100,
}

### 1.6 Fit models

By defining "cl_alpha" and "cl_gamma" other than "None", [Weighted Loss](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/custom_loss.py#L86) or [Focal Loss](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/custom_loss.py#L6) are used as objective loss and evalution functions. \
The model is saved as "cache_model_name" in [ebm_fits](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/ebm_fits). \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/ebm.py#L148)

In [None]:
ebm_fit = ebm_modelling.fit_ebm(
    df_train=df_train_sampled,
    hp_fix_dict=hp_fix_dict,
    hp_tune_dict=hp_tune_dict,
    rscv_params=rscv_params,
    feature_set=best_feats,
    save_model=True,
    cache_model_name="test",
    path_to_folder=SCRIPT_DIR,
)

### 1.7 Predict on OOS data using fit

[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/ebm.py#L231)

In [None]:
preds, preds_proba = ebm_modelling.predict(
    ebm_modelling.df_oos,
    predict_from_cached_fit=False,
    fit=ebm_fit,
)

### 1.8 Evaluate predictions

In [None]:
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    roc_auc_score,
    average_precision_score,
)
print('Accuracy_OOS:', round(accuracy_score(ebm_modelling.df_oos["churn"], preds), 4),
'\nPrecision_OOS:', round(precision_score(ebm_modelling.df_oos["churn"], preds), 4),
'\nRecall_OOS:', round(recall_score(ebm_modelling.df_oos["churn"], preds), 4),
'\nF1_Score_OOS:', round(f1_score(ebm_modelling.df_oos["churn"], preds), 4),
'\nAUROC_OOS:', round(roc_auc_score(ebm_modelling.df_oos["churn"], preds_proba), 4),
'\nAUPRC_OOS:', round(average_precision_score(ebm_modelling.df_oos["churn"], preds_proba), 4))

## 2. Execute entire S-HPTL and get&save the best fits

The following cells perform the Structural-Hyperparameter-Tuning-Loop. Note that M-HPTL nests in S-HPTL, so in each iteration of S-HPTL the entire M-HPTL is executed. This increases the computational time drastically.

### 2.1 Define Hyperparameter dictionaries

These dictionaries will be used for fitting.

In [None]:
import lightgbm as lgb
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# Define fix hyperparameters passed into the ebm.Classifier
hp_fix_dict = {
    "validation_size": 0.1111,
    "early_stopping_rounds": 30,
    "early_stopping_tolerance": 1e-4,
    "max_rounds": 5000,
}
# Define the to be tuned hyperparameters and their value spaces for EBM
hp_tune_dict = {
    "interactions": sp_randint(5, 10),
    "outer_bags": sp_randint(10, 20), # computationally very costly
    "inner_bags": sp_randint(0, 10), # computationally very costly
    "learning_rate": sp_uniform(loc=0.009, scale=0.006),
    "min_samples_leaf": sp_randint(2, 5),
    "max_leaves": sp_randint(2, 5),
}
# Define parameters regarding the tuning process
rscv_params = {
    "n_iter": 10,
    "n_jobs": -1,
    "cv": 3,
    "verbose": 100,
}
# Define hyperparameter spaces for S-HPTL
# The function expects the same presented structure
# Note that by adding more values the computational time increases drastically
hp_struct_dict = {
    'sampling': {
        'down1': 0.1,
        'down2': 0.5,
        # 'down3': x,
        # 'up1': y,
        # 'smote': True,
    },
    'dr_method': ['no_quot', 'best_quot'],
}

### 2.2 Fit and Evaluate models

The best models of each S-HPTL iteration are saved in [ebm_fits](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/playground/ebm_fits). \
[code](https://github.com/abdumaa/master_thesis_churn_modelling/blob/main/churn_modelling/modelling/ebm.py#L296)

In [None]:
results_df = ebm_modelling.fit_and_eval_ebm_candidates(
    hp_struct_dict=hp_struct_dict,
    hp_fix_dict=hp_fix_dict,
    hp_tune_dict=hp_tune_dict,
    rscv_params=rscv_params,
    path_to_folder=SCRIPT_DIR,
    feature_set_from_last_fits=False,
)

### 2.3 Review results

This table is also saved in [lgbm_results](https://github.com/abdumaa/master_thesis_churn_modelling/tree/main/churn_modelling/modelling/ebm_results)

In [None]:
results_df