# (phase 1) 01 Class imbalance modelling

The population of phase 1 presents a high level of overlap, thus in this notebook the textbook approaches for class overlap are implemented:
- Resampling techniques (4 different resampling techniques)
- Various model classes: logistic regression, random forest and XGBoost
- Cost sensitive learning and moving threshold is applied in every case

## Import packages and load the data for phase 1

In [95]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split, KFold
from sklearn.metrics import f1_score, make_scorer, confusion_matrix, recall_score, roc_auc_score, roc_curve, average_precision_score
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import importlib
import xgboost as xgb
import lightgbm as lgb

# import from custom package
from auxFuns.EDA import *
from auxFuns.modelling import *

In [82]:
import auxFuns.modelling 
importlib.reload(auxFuns.modelling)

import auxFuns.EDA 
importlib.reload(auxFuns.EDA)

<module 'auxFuns.EDA' from 'c:\\Users\\angel\\Documents\\VSCode\\rsv_modelling_transfer_learning\\auxFuns\\EDA.py'>

In [5]:
# Load phase 1 data
raw_datasets_path = os.getcwd() + '/datasets/raw'
processed_datasets_path = os.getcwd() + '/datasets/processed'

# Phase 1 data
rsv_predictors_df_v2 = pd.read_csv(processed_datasets_path + '/rsv_predictors_phase1_daysDedup_seasons_prevTest_v2.csv',low_memory=False)
rsv_predictors_phase1_df = make_it_categorical_v2(rsv_predictors_df_v2)


In [9]:
# Following the model selection step, these are the final features taken for modelling
selected_features_v2 = ['n_tests_that_day', 'sine','cosine', 'previous_test_daydiff',
                     'Bronchitis', 'CCI',
                     'Acute_upper_respiratory_infection', 'n_immunodeficiencies', 'n_symptoms',
                     'healthcare_seeking', 
                     'General_symptoms_and_signs', 'prev_positive_rsv', 'Influenza',
                     'key_comorbidities','Pneumonia',
                     'season','month_of_the_test','multiple_tests',
                     'BPA','BPAI']
selected_features_v2.append('RSV_test_result')

df_modelling_phase1 = rsv_predictors_phase1_df[selected_features_v2]

df_modelling_phase1.shape

(86058, 21)

## 1: Resampling of the data

Four different techniques: 
- Random Undersampling: This involves removing some of the majority class instances. While it can help balance the classes, it may also remove important information, thereby affecting the model's ability to generalize.
- Random Oversampling: This technique involves duplicating some minority class instances. Although this can balance the classes, it may lead to overfitting as the same instances are repeated.
- SMOTE-NC, i.e. SMOTE (Synthetic Minority Oversampling Technique) adapted for mixed data (continuous and categorical variables). It generates synthetic instances of the minority class. While it adds diversity and avoids overfitting to some extent, it can also introduce noise.
- Undersampling and Upweighting: This approach combines undersampling with the assignment of greater importance to the minority class. By upweighting the remaining majority instances, the model learns the imbalance of the actual data. It attempts to find a balance but may still risk losing information from the majority class or introducing bias.

In [84]:
resampling_techniques = ['None', 'over', 'under', 'smotenc', 'downsample_upweight']
input_test_size = 0.2
random_seed = 42

resampled_data = {'None': {},
                  'over': {},
                  'under': {},
                  'smotenc': {},
                  'downsample_upweight':{}}

for r in resampling_techniques:
    print('\n----')
    print(f'Resampling using {r}')

    X_train, y_train, X_test, y_test, sample_weights, preprocessor_rsv = preprocess_and_resample_rsv(
        df_modelling_phase1, input_test_size = input_test_size, random_seed = random_seed, resampling_technique = r)
    
    resampled_data[r]['X_train'] = X_train
    resampled_data[r]['y_train'] = y_train
    resampled_data[r]['X_test'] = X_test
    resampled_data[r]['y_test'] = y_test
    resampled_data[r]['sample_weights'] = sample_weights
    resampled_data[r]['preprocessor_rsv'] = preprocessor_rsv
    
    IR_train = y_train.value_counts()['Negative'] / y_train.value_counts()['Positive']
    IR_test = y_test.value_counts()['Negative'] / y_test.value_counts()['Positive']

    print(f'y_TRAIN imbalance ratio: {IR_train}')
    print(f'y_TEST imbalance ratio: {IR_test}')


----
Resampling using None
Resampling method chosen:

None
y_TRAIN imbalance ratio: 31.428638718794158
y_TEST imbalance ratio: 31.41431261770245

----
Resampling using over
Resampling method chosen:

Oversampling
y_TRAIN imbalance ratio: 1.2500093671550077
y_TEST imbalance ratio: 31.41431261770245

----
Resampling using under
Resampling method chosen:

Undersampling
y_TRAIN imbalance ratio: 1.249646726330664
y_TEST imbalance ratio: 31.41431261770245

----
Resampling using smotenc
Resampling method chosen:

SMOTE-sampling
y_TRAIN imbalance ratio: 1.2500093671550077
y_TEST imbalance ratio: 31.41431261770245

----
Resampling using downsample_upweight
Resampling method chosen:

Downsampling and Upweighting
y_TRAIN imbalance ratio: 0.7998115873763542
y_TEST imbalance ratio: 31.41431261770245


## 2. Fitting of all of ML models and performance evaluaton
- All with cost senstive learning
- All with moving threshold

### 2.1. Logistic regression

In [98]:
model_class = LogisticRegression(random_state= random_seed, 
                                class_weight= {'Negative':1, 'Positive': 10})
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'max_iter': [20, 50],
    'solver':['liblinear']
}

target_scorer = make_scorer(f1_score, average='binary', pos_label = 'Positive')
n_cv_folds = 5


for r in resampling_techniques:
    print('\n-------------')
    print(f'Model fitted using Logistic regression and resampling: {r}')

    X_train = resampled_data[r]['X_train']
    y_train = resampled_data[r]['y_train']
    X_test = resampled_data[r]['X_test']
    y_test = resampled_data[r]['y_test']
    sample_weights = resampled_data[r]['sample_weights'] 

    model1 = train_model_rsv(model = model_class, param_grid = param_grid, target_scorer = target_scorer, n_cv_folds = n_cv_folds,
                        X_train = X_train, y_train = y_train)
    optimal_threshold = find_optimal_moving_threshold(trained_model = model1, X_test = X_test, y_test = y_test)


    __,__,__,__,__,__,__,__ = calculate_performance_metrics_rsv(trained_model = model1, X_test = X_test, y_test = y_test,
                                                            threshold = optimal_threshold)


-------------
Model fitted using Logistic regression and resampling: None
Training model ... LogisticRegression(class_weight={'Negative': 1, 'Positive': 10},
                   random_state=42)




Best training parameters:  {'C': 0.01, 'max_iter': 20, 'penalty': 'l2', 'solver': 'liblinear'}
Best training f1-score:  0.33188139088157315
Optimal threshold: 0.58
Optimal f1: 0.36569987389659525


AUC Score: 0.7912467594253123
Precision / Positive predictive value: 0.5534351145038168
Specificity: 0.9929860320124693
Recall / sensitivity: 0.2730696798493409
Negative predictive value: 0.9772271386430679
Accuracy: 0.9707762026493144
F-1: 0.36569987389659525
Precision-Recall AUC: 0.34566211546801523

-------------
Model fitted using Logistic regression and resampling: over
Training model ... LogisticRegression(class_weight={'Negative': 1, 'Positive': 10},
                   random_state=42)




Best training parameters:  {'C': 10, 'max_iter': 20, 'penalty': 'l2', 'solver': 'liblinear'}
Best training f1-score:  0.6313159827750442
Optimal threshold: 0.99
Optimal f1: 0.3490136570561457


AUC Score: 0.7891314034901736
Precision / Positive predictive value: 0.8984375
Specificity: 0.9992206702236077
Recall / sensitivity: 0.21657250470809794
Negative predictive value: 0.975649730742215
Accuracy: 0.9750755287009063
F-1: 0.3490136570561457
Precision-Recall AUC: 0.3387452372786378

-------------
Model fitted using Logistic regression and resampling: under
Training model ... LogisticRegression(class_weight={'Negative': 1, 'Positive': 10},
                   random_state=42)




Best training parameters:  {'C': 1, 'max_iter': 20, 'penalty': 'l2', 'solver': 'liblinear'}
Best training f1-score:  0.6312970449812979
Optimal threshold: 0.99
Optimal f1: 0.3538461538461539


AUC Score: 0.788229241496381
Precision / Positive predictive value: 0.9663865546218487
Specificity: 0.9997602062226485
Recall / sensitivity: 0.21657250470809794
Negative predictive value: 0.9756625519218394
Accuracy: 0.9755984197071811
F-1: 0.3538461538461539
Precision-Recall AUC: 0.336536343481789

-------------
Model fitted using Logistic regression and resampling: smotenc
Training model ... LogisticRegression(class_weight={'Negative': 1, 'Positive': 10},
                   random_state=42)




Best training parameters:  {'C': 10, 'max_iter': 50, 'penalty': 'l1', 'solver': 'liblinear'}
Best training f1-score:  0.6580577533397802
Optimal threshold: 0.99
Optimal f1: 0.35082458770614694


AUC Score: 0.7873677225156985
Precision / Positive predictive value: 0.8602941176470589
Specificity: 0.9988609795575805
Recall / sensitivity: 0.22033898305084745
Negative predictive value: 0.9757554462403373
Accuracy: 0.9748431326981176
F-1: 0.35082458770614694
Precision-Recall AUC: 0.3423451242729777

-------------
Model fitted using Logistic regression and resampling: downsample_upweight
Training model ... LogisticRegression(class_weight={'Negative': 1, 'Positive': 10},
                   random_state=42)




Best training parameters:  {'C': 10, 'max_iter': 20, 'penalty': 'l1', 'solver': 'liblinear'}
Best training f1-score:  0.7217919072810398
Optimal threshold: 0.99
Optimal f1: 0.33568406205923834


AUC Score: 0.7858928327288249
Precision / Positive predictive value: 0.6685393258426966
Specificity: 0.9964630417840658
Recall / sensitivity: 0.224105461393597
Negative predictive value: 0.9758130797229071
Accuracy: 0.9726353706716244
F-1: 0.33568406205923834
Precision-Recall AUC: 0.3299868333571074


###  2.2. Random forest

In [94]:
cost_sensitive = True
if cost_sensitive:
    weight_dict = {'Negative': 1, 'Positive': 10}
    model_class = RandomForestClassifier(class_weight= weight_dict, random_state= random_seed)
else:
    model_class = RandomForestClassifier(class_weight= None, random_state= random_seed)
    
param_grid = {
    'n_estimators': [7, 14],
    'max_depth': [10, 20],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [1, 4]
}


target_scorer = make_scorer(f1_score, average='binary', pos_label = 'Positive')
n_cv_folds = 5


for r in resampling_techniques:
    print('\n-------------')
    print(f'Model fitted using Random Forest and resampling: {r}')

    X_train = resampled_data[r]['X_train']
    y_train = resampled_data[r]['y_train']
    X_test = resampled_data[r]['X_test']
    y_test = resampled_data[r]['y_test']
    sample_weights = resampled_data[r]['sample_weights'] 

    model1 = train_model_rsv(model = model_class, param_grid = param_grid, target_scorer = target_scorer, n_cv_folds = n_cv_folds,
                        X_train = X_train, y_train = y_train, sample_weights = sample_weights)
    optimal_threshold = find_optimal_moving_threshold(trained_model = model1, X_test = X_test, y_test = y_test)


    __,__,__,__,__,__,__,__ = calculate_performance_metrics_rsv(trained_model = model1, X_test = X_test, y_test = y_test,
                                                            threshold = optimal_threshold)


-------------
Model fitted using Logistic regression and resampling: None
Training model ... RandomForestClassifier(class_weight={'Negative': 1, 'Positive': 10},
                       random_state=42)
Best training parameters:  {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 14}
Best training f1-score:  0.36654162066416374
Optimal threshold: 0.46
Optimal f1: 0.35469107551487417


AUC Score: 0.789289516100899
Precision / Positive predictive value: 0.4518950437317784
Specificity: 0.9887296924644805
Recall / sensitivity: 0.2919020715630885
Negative predictive value: 0.9777105933961705
Accuracy: 0.967232163606786
F-1: 0.35469107551487417
Precision-Recall AUC: 0.3366943872200664

-------------
Model fitted using Logistic regression and resampling: over
Training model ... RandomForestClassifier(class_weight={'Negative': 1, 'Positive': 10},
                       random_state=42)
Best training parameters:  {'max_depth': 20, 'min_samples_leaf': 1, 'min_sample

### 2.3. LightGBM Classifier

In [97]:
cost_sensitive = True
random_seed = 42  # As before, set your own seed

if cost_sensitive:
    weight_dict = {'Negative': 1, 'Positive': 10}
    model_class = lgb.LGBMClassifier(class_weight=weight_dict, random_state=random_seed, objective='binary')
else:
    model_class = lgb.LGBMClassifier(random_state=random_seed, objective='binary')

# LightGBM-specific parameters
param_grid = {
    'n_estimators': [10, 15, 25],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.7, 0.9, 1]
}

target_scorer = make_scorer(f1_score, average='binary', pos_label='Positive')
n_cv_folds = 5

# Assuming resampled_data and other functions are defined elsewhere
for r in resampling_techniques:
    print('\n-------------')
    print(f'Model fitted using LightGBM and resampling: {r}')

    X_train = resampled_data[r]['X_train']
    y_train = resampled_data[r]['y_train']
    X_test = resampled_data[r]['X_test']
    y_test = resampled_data[r]['y_test']
    sample_weights = resampled_data[r]['sample_weights']

    model1 = train_model_rsv(model=model_class, param_grid=param_grid, target_scorer=target_scorer, n_cv_folds=n_cv_folds,
                             X_train=X_train, y_train=y_train)
    optimal_threshold = find_optimal_moving_threshold(trained_model=model1, X_test=X_test, y_test=y_test)

    __, __, __, __, __, __, __, __ = calculate_performance_metrics_rsv(trained_model=model1, X_test=X_test, y_test=y_test,
                                                                       threshold=optimal_threshold)



-------------
Model fitted using LightGBM and resampling: None
Training model ... LGBMClassifier(class_weight={'Negative': 1, 'Positive': 10}, objective='binary',
               random_state=42)
Best training parameters:  {'learning_rate': 0.05, 'max_depth': 9, 'n_estimators': 15, 'subsample': 0.7}
Best training f1-score:  0.37566572907957885
Optimal threshold: 0.45
Optimal f1: 0.35857805255023184


AUC Score: 0.7882138874692058
Precision / Positive predictive value: 1.0
Specificity: 1.0
Recall / sensitivity: 0.2184557438794727
Negative predictive value: 0.9757253158633599
Accuracy: 0.975888914710667
F-1: 0.35857805255023184
Precision-Recall AUC: 0.34406210133077086

-------------
Model fitted using LightGBM and resampling: over
Training model ... LGBMClassifier(class_weight={'Negative': 1, 'Positive': 10}, objective='binary',
               random_state=42)
Best training parameters:  {'learning_rate': 0.1, 'max_depth': 9, 'n_estimators': 25, 'subsample': 0.7}
Best training f1-score: 

In [93]:
cost_sensitive = True

if cost_sensitive:
    weight_scale = 10  
    model_class = xgb.XGBClassifier(scale_pos_weight=weight_scale, random_state=random_seed, objective='binary:logistic')
else:
    model_class = xgb.XGBClassifier(random_state=random_seed, objective='binary:logistic')

# XGBoost-specific parameters
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.9, 1],
    'colsample_bytree': [0.7, 0.9, 1]
}

target_scorer = make_scorer(f1_score, average='binary', pos_label= 1)
n_cv_folds = 5

# Assuming resampled_data and other functions are defined elsewhere
for r in resampling_techniques:
    print('\n-------------')
    print(f'Model fitted using XGBoost and resampling: {r}')

    X_train = resampled_data[r]['X_train']
    y_train = [1 if label=='Positive' else 0 for label in resampled_data[r]['y_train']] # needed for XGBoost to be fitted
    X_test = resampled_data[r]['X_test']
    y_test = [1 if label=='Positive' else 0 for label in resampled_data[r]['y_test']]
    sample_weights = resampled_data[r]['sample_weights']

    model1 = train_model_rsv(model=model_class, param_grid=param_grid, target_scorer=target_scorer, n_cv_folds=n_cv_folds,
                             X_train=X_train, y_train=y_train, sample_weights = sample_weights)
    optimal_threshold = find_optimal_moving_threshold(trained_model=model1, X_test=X_test, y_test=y_test)

    __, __, __, __, __, __, __, __ = calculate_performance_metrics_rsv(trained_model=model1, X_test=X_test, y_test=y_test,
                                                                       threshold=optimal_threshold)



-------------
Model fitted using XGBoost and resampling: None
Training model ... XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=42, ...)


KeyboardInterrupt: 