# 3. Run ML Models
This notebook executes both Logistic Regression and XGBoost models on the full set of features as well as a reduced set of top two features, aiming to compare the performance across these scenarios. The results are then visualized, specifically focusing on the ROC curves, and relevant performance metrics are stored for future reference and analysis.

### Objective:
To execute various machine learning models using the dataset provided and assess their performance on predicting the given outcome.

### Data Overview:

* Source: The data for this notebook is sourced from various CSV files located within directories defined in the notebook.
* Features: The dataset contains a mix of numerical and categorical features. Some key features include 'start_glc', 'duration', and many others.
* Target Variable: The prediction target is 'y_3', which is possibly a binary outcome indicating a certain event or condition.

### Sections:

1. Setup: Importing necessary libraries and defining paths.
2. Data Loading: Reading the required datasets from their respective directories.
3. Data Preparation: Setting up dataframes to store results and setting up predictor variables and target variable.
4. Model Execution:
        All Features:
            Logistic Regression: Execution of logistic regression using all features, hyperparameter tuning, and storing of results.
            XGBoost: Execution of XGBoost using all features and storing of results.
        Top Two Features:
            Logistic Regression: Execution of logistic regression using only the top two features, 'start_glc' and 'duration', and storing of results.
            XGBoost: Execution of XGBoost using only the top two features and storing of results.
5. Results Compilation: Storing of model results, calculation of mean results, and appending of results to dataframes.
6. Data Saving: Storing results in specified directories.

## 3.0. Packages and data

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)
import random
random.seed(42)
from sklearn.metrics import roc_curve, auc
import ml_helper as ml_help
import matplotlib.pyplot as plt
from IPython.display import clear_output
import pickle
import sys
path = "../../diametrics"
sys.path.append(path)

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


In [2]:
# Directories
directory = '../../data/tidy_data/final_df/'
probability_results_directory = '../../results/probability_results/'
k_fold_results_directory = '../../results/k_fold_results/'
threshold_results_directory = '../../results/threshold_results/'
mean_results_directory = '../../results/mean_results/'
dict_directory = '../../results/dict_results/'

In [3]:
# Read data
df = pd.read_csv(directory + 'df.csv')
strat = df['stratify'] 
X = pd.read_csv(directory + 'X.csv')
y = df['y'] 

In [4]:
y.mean()

0.09394914122716513

In [6]:
X.drop_duplicates().shape

(16477, 414)

In [7]:
# Set up dfs for results
df_with_probas = df.copy()
mean_results = pd.DataFrame()

## 3.1. All features

### 3.1.1. Logistic regression

In [None]:
# Run the LR model, uses hyperopt for HP tuning, get accuracy, indices and probabilities for each fold
lr_all_k_fold_results, lr_all_test_sets_index, lr_all_predicted_probas, lr_all_observed, lr_all_shap_values, lr_all_coeffs, lr_all_hps = ml_help.k_fold_accuracies(X, y, strat, True, 'all')

In [None]:
# Store results in dictionary
lr_all_results = {'X':X,
              'probas':lr_all_predicted_probas, 
              'observed':lr_all_observed, 
              'shap':lr_all_shap_values, 
              'coeffs':lr_all_coeffs}

with open(dict_directory+"lr_results_all", "wb") as fp:   
    #Pickling 
    pickle.dump(lr_all_results, fp) 

In [None]:
# Add the mean accuracy to a table for easy perusal
mean_results = ml_help.add_mean_to_df(mean_results, lr_all_k_fold_results, 'lr', 'all')

In [None]:
# Add a probability column to the whole dataset to ensure 
df_with_probas = ml_help.add_proba_col(df_with_probas, lr_all_test_sets_index, lr_all_predicted_probas, 'probas_lr_all')

In [None]:
# Save k-fold results
lr_all_k_fold_results.to_csv(k_fold_results_directory+'lr_all.csv', index=False)

In [None]:
# Save hyperparameters
pd.DataFrame(lr_all_hps).to_csv('../../results/hyperparameters/lr_all.csv', index=False)

### 3.1.2. XGBoost

In [32]:
# Run the XGBoost model, uses optuna for HP tuning, get accuracy, indices and probabilities for each fold
xgb_ts_k_fold_results, xgb_ts_test_sets_index, xgb_ts_predicted_probas, xgb_ts_observed, xgb_ts_shap, _, xgb_ts_hps = ml_help.k_fold_accuracies(X, y, strat, False, 'all')

[32m[I 2023-10-25 15:31:35,438][0m A new study created in memory with name: no-name-3e9b4530-1fe4-43e6-8228-81da8aae9d7f[0m


KeyboardInterrupt: 

In [None]:
# Store results in dictionary
xgb_ts_results = {'X':X,
                'probas':xgb_ts_predicted_probas, 
              'observed':xgb_ts_observed, 
              'shap':xgb_ts_shap
              }

with open(dict_directory+"xgb_ts", "wb") as fp:   
    #Pickling 
    pickle.dump(xgb_ts_results, fp)

In [None]:
# Add the mean accuracy to a table for easy perusal
mean_results = ml_help.add_mean_to_df(mean_results, xgb_ts_k_fold_results, 'xgb', 'all')

In [None]:
# Add a probability column to the whole dataset to ensure 
df_with_probas = ml_help.add_proba_col(df_with_probas, xgb_ts_test_sets_index, xgb_ts_predicted_probas , 'probas_xgb_ts')

In [None]:
# Save k-fold results
xgb_ts_k_fold_results.to_csv(k_fold_results_directory+'xgb_ts.csv', index=False)

In [None]:
# Save hyperparameters
pd.DataFrame(xgb_ts_hps).to_csv('../../results/hyperparameters/xgb_ts.csv', index=False)

## 3.2. Two features
Two features shown in the feature selection process to be the most important, start glucose and duration of exercise bout

In [8]:
# Select the two features from feature selection
X_two = X[['start_glc','duration']]

### 3.2.1. Logistic regression

In [None]:
# Run the LR model, uses hyperopt for HP tuning, get accuracy, indices and probabilities for each fold
lr_two_k_fold_results, lr_two_test_sets_index, lr_two_predicted_probas, lr_two_observed, lr_two_shap_values, lr_two_coeffs, lr_two_hps = ml_help.k_fold_accuracies(X_two, y, strat, True, 'two')

100%|██████████| 60/60 [00:03<00:00, 19.39trial/s, best loss: -0.828102589572673] 
100%|██████████| 60/60 [00:01<00:00, 39.14trial/s, best loss: -0.8214115328602904]
100%|██████████| 60/60 [00:01<00:00, 32.46trial/s, best loss: -0.8245795345759641]
100%|██████████| 60/60 [00:01<00:00, 31.50trial/s, best loss: -0.8253699257912561]
100%|██████████| 60/60 [00:01<00:00, 35.68trial/s, best loss: -0.8236815065259309]
100%|██████████| 60/60 [00:02<00:00, 27.62trial/s, best loss: -0.8257703790871129]
100%|██████████| 60/60 [00:02<00:00, 27.51trial/s, best loss: -0.8275854856569197]
100%|██████████| 60/60 [00:02<00:00, 28.64trial/s, best loss: -0.8282459525585395]
100%|██████████| 60/60 [00:02<00:00, 29.69trial/s, best loss: -0.8226750204137498]
100%|██████████| 60/60 [00:02<00:00, 24.88trial/s, best loss: -0.8263983203357084]


In [11]:
# Store results in dictionary
lr_two_results = {'X':X_two,
              'probas':lr_two_predicted_probas, 
              'observed':lr_two_observed, 
              'shap':lr_two_shap_values, 
              'coeffs':lr_two_coeffs}

with open(dict_directory+"lr_two", "wb") as fp:   
    #Pickling 
    pickle.dump(lr_two_results, fp) 

In [12]:
# Add the mean accuracy to a table for easy perusal
mean_results = ml_help.add_mean_to_df(mean_results, lr_two_k_fold_results, 'lr', 'two')

In [13]:
# Add a probability column to the whole dataset to ensure 
df_with_probas = ml_help.add_proba_col(df_with_probas, lr_two_test_sets_index, lr_two_predicted_probas, 'probas_lr_two')

In [14]:
# Save k-fold results
lr_two_k_fold_results.to_csv(k_fold_results_directory+'lr_two.csv', index=False)

In [15]:
# Save hyperparameters
pd.DataFrame(lr_two_hps).to_csv('../../results/hyperparameters/lr_two.csv', index=False)

### 3.2.2. XGB Two feat

In [9]:
# Run the XGBoost model, uses optuna for HP tuning, get accuracy, indices and probabilities for each fold
xgb_two_k_fold_results, xgb_two_test_sets_index, xgb_two_predicted_probas, xgb_two_observed, xgb_two_shap, _, xgb_two_hps = ml_help.k_fold_accuracies(X_two, y, strat, False, 'two')

[32m[I 2023-10-26 13:49:51,935][0m A new study created in memory with name: no-name-8d3db598-f09d-4ecc-a4fb-8d6abec35267[0m
[32m[I 2023-10-26 13:49:59,753][0m Trial 0 finished with value: 0.8442211000000001 and parameters: {'n_estimators': 435, 'max_depth': 5, 'min_child_weight': 10, 'subsample': 0.5861685334492945, 'colsample_bytree': 0.5813641858449137, 'eta': 0.14828502813262978, 'learning_rate': 0.36330793003393125, 'reg_alpha': 5, 'reg_lambda': 2, 'gamma': 1}. Best is trial 0 with value: 0.8442211000000001.[0m
[32m[I 2023-10-26 13:50:05,773][0m Trial 1 finished with value: 0.8406878000000001 and parameters: {'n_estimators': 40, 'max_depth': 7, 'min_child_weight': 7, 'subsample': 0.6602619082639372, 'colsample_bytree': 0.9243931970038819, 'eta': 0.2942641546509594, 'learning_rate': 0.020584057548178984, 'reg_alpha': 3, 'reg_lambda': 2, 'gamma': 5}. Best is trial 0 with value: 0.8442211000000001.[0m
[32m[I 2023-10-26 13:50:11,421][0m Trial 2 finished with value: 0.8442734

In [10]:
# Store results in dictionary
xgb_two_results = {'X':X_two, 
                'probas':xgb_two_predicted_probas, 
                'observed':xgb_two_observed, 
                'shap':xgb_two_shap
                }

with open(dict_directory+"xgb_results_two", "wb") as fp:   
    #Pickling 
    pickle.dump(xgb_two_results, fp)

In [11]:
# Add the mean accuracy to a table for easy perusal
mean_results = ml_help.add_mean_to_df(mean_results, xgb_two_k_fold_results, 'xgb', 'two')

In [12]:
mean_results

Unnamed: 0,roc_auc,mae,logloss,brier,threshold,accuracy,precision,recall,f1,predicted_positive_rate,observed_positive_rate,tpr,fpr,specificity,balanced_accuracy,model,features
mean,0.849698,0.127742,0.227546,0.063908,0.084,0.755355,0.249717,0.773999,0.375254,0.295988,0.093949,0.773999,0.246469,0.753531,0.763765,xgb,two


In [13]:
# Add a probability column to the whole dataset to ensure 
df_with_probas = ml_help.add_proba_col(df_with_probas, xgb_two_test_sets_index, xgb_two_predicted_probas , 'probas_xgb_two')

In [14]:
# Save k-fold results
xgb_two_k_fold_results.to_csv(k_fold_results_directory+'xgb_two.csv', index=False)

In [15]:
# Save hyperparameters
pd.DataFrame(xgb_two_hps).to_csv('../../results/hyperparameters/xgb_two.csv', index=False)

In [16]:
# Save dataframe with all predicted probas
df_with_probas.to_csv('../../results/probability_results/xgb_two.csv')