## Variable Descriptions Guide

- **encd_df**: This is the one-hot encoded dataframe used for model training.
- **val_set**: The validation set used to validate the performance of the models during training.
- **train_set_splitted**: The remaining part of the training set after splitting out the validation set.
- **train_set**: The final training set used for training the models.
- **test_set**: The final test set used to evaluate the performance of the trained models.
- **X_train_smoted, y_train_smoted**: The training sets after applying SMOTE to handle class imbalance.


### Transformed Sets
- **transformed_train_set, transformed_test_set**: These are the transformed training and validation sets without feature engineering.
i.e transformed datasets on one hot encoded dataframe.
- **transformed_featured_train_set, transformed_featured_val_set**: These are the transformed training and validation sets after feature engineering and transformation.
- **transformed_featured_final_train_set, transformed_featured_test_set**: The transformed training set (without splitting) and the test set.
- **transformed_featured_smoted_train_set, transformed_featured_smoted_test_set**: The transformed SMOTEd training and test sets.

### Optimization

- **featured_lgb_study, featured_xgb_study, featured_cat_study, featured_ann_study**: These are the optimized studies of the models on the feature-engineered sets.
- **org_lgb_study, org_xgb_study, org_cat_study, org_nn_study**: These are the optimized studies of the models on the original sets (i.e., without feature engineering).


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:

import sys
import os

# Add the src directory to the Python path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))


In [7]:
# This cell imports all the necessary libraries and modules required 
import numpy as np
import pandas as pd
import tensorflow as tf
import optuna
import warnings
from pathlib import Path
from dotenv import load_dotenv

from features.generate_and_transform_features import FeatureTransformer
from optimization.model_optimizer import ModelOptimizer,save_results,load_results

In [8]:
# Suppress all warnings for a cleaner output
# Set seed for numpy and tensorflow for reproducibility
warnings.filterwarnings("ignore")
optuna.logging.set_verbosity(optuna.logging.CRITICAL)
np.random.seed(42)
tf.random.set_seed(42)

env_path = Path('.env')
load_dotenv(env_path)

root_dir = Path(os.getenv('ROOT_DIRECTORY'))

# Training and Optimization

 let's optimize the models using the engineered dataset and oiginal(without engineered)
 I am not runnig the below code as i have already optimized and saved the result if you want to run the code please change the variables **run_featured_trials,run_org_trials** to ***True***

### ***Warning*** :- It may take few  Hours to run trials.

In [5]:
run_featured_trials = False
run_org_trials = False

In [9]:
transformed_featured_train_set = pd.read_csv(root_dir/'data'/'processed'/"transformed_featured_train_set.csv")
transformed_featured_test_set = pd.read_csv(root_dir/'data'/'processed'/"transformed_featured_test_set.csv")
transformed_featured_final_train_set = pd.read_csv(root_dir/'data'/'processed'/"transformed_featured_final_train_set.csv")
transformed_featured_smoted_train_set = pd.read_csv(root_dir/'data'/'processed'/"transformed_featured_smoted_train_set.csv")
transformed_featured_val_set = pd.read_csv(root_dir/'data'/'processed'/"transformed_featured_val_set.csv")
val_set = pd.read_csv(root_dir/'data'/'interim'/"val_set.csv")
train_set_splitted = pd.read_csv(root_dir/'data'/'interim'/"train_set_splitted.csv")

In [10]:
if run_featured_trials:
    optimizer = ModelOptimizer(transformed_featured_train_set,transformed_featured_val_set)
    featured_xgb_study = optimizer.optimize_xgb()
    featured_cat_study = optimizer.optimize_catboost()
    featured_lgb_study = optimizer.optimize_lgb()
    featured_nn_study = optimizer.optimize_nn()
    
    save_results([featured_lgb_study,featured_xgb_study,featured_cat_study,featured_nn_study],[r"../reports/optemization-study-reports/lgb_featured_study.csv",r"../reports/optemization-study-reports/xgb_featured_study.csv",r"../reports/optemization-study-reports/catboost_featured_study.csv",r"../reports/optemization-study-reports/nn_featured_study.csv"])
else:
    # loading the study which was done using featured dataset
    featured_lgb_study , featured_xgb_study , featured_cat_study ,featured_nn_study = load_results([r"../reports/optemization-study-reports/lgb_featured_study.csv",r"../reports/optemization-study-reports/xgb_featured_study.csv",r"../reports/optemization-study-reports/catboost_featured_study.csv",r"../reports/optemization-study-reports/nn_featured_study.csv"])

The `ModelOptimizer` class is designed for optimizing machine learning models using various algorithms like CatBoost, LightGBM, XGBoost, and neural networks. The optimization is performed using the Optuna framework, focusing on maximizing a custom metric called weighted recall. This metric is a combination of recall and F1 score, providing a balanced evaluation of model performance. The class also logs additional evaluation metrics like accuracy, precision, recall, F1 score, and ROC AUC to give a comprehensive view of model effectiveness.


For more detailed information, refer to the [documentation](../docs/ModelOptimizer.md) or check out the [source code](../src/optimization/model_optimizer.py).

In [11]:
if run_org_trials:
    transfm  = FeatureTransformer(train_set_splitted,val_set)
    transformed_train_set,transformed_test_set = transfm.transform()
    org_optimizer = ModelOptimizer(transformed_featured_train_set,transformed_featured_val_set)
    org_xgb_study = org_optimizer.optimize_xgb()
    org_cat_study = org_optimizer.optimize_catboost()
    org_lgb_study = org_optimizer.optimize_lgb()
    org_nn_study = org_optimizer.optimize_nn()

    # save studies to reports directory
    save_results([org_lgb_study,org_xgb_study,org_cat_study,org_nn_study],[r"../reports/optemization-study-reports/lgb_org_study.csv",r"../reports/optemization-study-reports/xgb_org_study.csv",r"../reports/optemization-study-reports/catboost_org_study.csv",r"../reports/optemization-study-reports/nn_org_study.csv"])

else:
    # loading the study which was done using original dataset
    org_lgb_study , org_xgb_study,org_cat_study,org_nn_study = load_results([r"../reports/optemization-study-reports/lgb_org_study.csv",r"../reports/optemization-study-reports/xgb_org_study.csv",r"../reports/optemization-study-reports/catboost_org_study.csv",r"../reports/optemization-study-reports/nn_org_study.csv"])

In [12]:
featured_lgb_study.iloc[featured_lgb_study['user_attrs_recall'].idxmax()]

number                                             71
value                                        0.641108
datetime_start             2024-08-03 10:41:01.338554
datetime_complete          2024-08-03 10:41:02.153274
duration                       0 days 00:00:00.814720
params_bagging_fraction                      0.582108
params_bagging_freq                                 4
params_feature_fraction                      0.607216
params_lambda_l1                             0.050764
params_lambda_l2                                  0.0
params_learning_rate                         0.059274
params_max_depth                                    4
params_min_data_in_leaf                            97
params_num_leaves                                 108
user_attrs_accuracy                            0.7891
user_attrs_f1                                0.621277
user_attrs_recall                            0.651786
user_attrs_roc                               0.838148
state                       

In [13]:
featured_nn_study.iloc[featured_nn_study['user_attrs_recall'].idxmax()]

number                                           71
value                                      0.672981
datetime_start           2024-08-03 15:20:05.386512
datetime_complete        2024-08-03 15:20:16.565764
duration                     0 days 00:00:11.179252
params_batch_size                               114
params_dropout_layer1                      0.429005
params_dropout_layer2                      0.263354
params_learning_rate                       0.000634
params_units_layer1                             283
params_units_layer2                              77
user_attrs_accuracy                        0.753555
user_attrs_f1                              0.604563
user_attrs_precision                        0.52649
user_attrs_recall                          0.709821
user_attrs_roc                             0.837637
state                                      COMPLETE
Name: 71, dtype: object

In [14]:
featured_cat_study.iloc[featured_cat_study['user_attrs_recall'].idxmax()]

number                                                10
value                                           0.796816
datetime_start                2024-08-03 14:48:10.632449
datetime_complete             2024-08-03 14:48:12.696147
duration                          0 days 00:00:02.063698
params_bagging_temperature                      0.040039
params_border_count                                  172
params_depth                                           3
params_l2_leaf_reg                              1.177218
params_learning_rate                            0.000102
params_scale_pos_weight                         0.999552
user_attrs_accuracy                             0.265403
user_attrs_f1                                   0.419476
user_attrs_precision                            0.265403
user_attrs_recall                                    1.0
user_attrs_roc                                  0.824842
state                                           COMPLETE
Name: 10, dtype: object

In [15]:
featured_xgb_study.iloc[featured_xgb_study['user_attrs_recall'].idxmax()]

number                                             77
value                                        0.562901
datetime_start             2024-08-03 14:38:10.007691
datetime_complete          2024-08-03 14:38:10.566995
duration                       0 days 00:00:00.559304
params_alpha                                 0.000006
params_colsample_bytree                       0.79048
params_gamma                                      0.0
params_lambda                                0.000001
params_learning_rate                         0.099155
params_max_depth                                   10
params_min_child_weight                             5
params_subsample                             0.791583
user_attrs_accuracy                          0.796209
user_attrs_f1                                0.588517
user_attrs_precision                         0.634021
user_attrs_recall                            0.549107
user_attrs_roc                               0.819452
state                       

**Results are imporved**.You can explore more as per your wish.