<a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_3/Labs/Lab3_2_Handling_Imbalanced_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3.2: Handling Imbalanced Data

In this lab, you'll perform the following tasks:

1. Build a Random Forest model to classify an imbalanced dataset without making any modifications.
1. Examine the results and evaluate the performance using appropriate metrics.
1. Use sampling to balance the dataset and rebuild and retest the model.
1. Use bagging with sampling and rebuild and retest the model.
1. Use boosting with sampling and rebuild and restest the model.
1. Calibrate the model and restest it.
1. Build another model using the original imbalanced dataset, then calibrate and evaluate the model.
1. Combine models using a VotingClassifier and evaluate the results.

## Scenario

Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

The Behavioral Risk Factor Surveillance System (BRFSS) is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

The dataset contains the following columns:

Input variables:
* HighBP. 0=no high BP, 1=high BP
* HighChol. 0=no high cholesterol, 1=high cholesterol
* CholCheck. Has the pateint had a cholesterol check in the last 5 years? 0=no, 1=yes
* BMI. Body Mass Index
* Smoker. Has the patient smoked at least 100 cigarettes in their entire life? [Note: 5 packs = 100 cigarettes] 0=no, 1=yes
* Stroke. Has the patient ever had a stroke? 0=no, 1=yes
* HeartDiseaseorAttack. Does the patient have coronary heart disease (CHD) or myocardial infarction (MI)? 0=no, 1=yes
* PhysActivity. Has the patient performed any physical activity in past 30 days, not including job? 0=no, 1=yes
* Fruits. Does the patient consume fruit 1 or more times per day? 0=no, 1=yes
* Veggies. Does the patient consume vegetables 1 or more times per day? 0=no, 1=yes
* HvyAlcoholConsump. Is the patient a heavy drinkerer (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)? 0=no, 1=yes
* AnyHealthcare. Does the patient have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc? 0=no, 1=yes
* NoDocbcCost. Was there a time in the past 12 months when the patient needed to see a doctor but could not because of cost? 0=no, 1=yes
* GenHlth. Would the pateint say that in general their health is: scale 1-5 1=excellent, 2=very good, 3=good, 4=fair, 5=poor
* MentHlth. Including stress, depression, and problems with emotions, for how many days during the past 30 days was the patient's mental health not good? scale 1-30 days
* PhysHlth. Inlcuding physical illness and injury, for how many days during the past 30 days was the patient's physical health not good? scale 1-30 days
* DiffWalk. Does the patient have serious difficulty walking or climbing stairs? 0=no, 1=yes
* Sex. 0=female, 1=male
* Age. 13-level age category (_AGEG5YR see codebook) 1=18-24, 9=60-64, 13=80 or older
* Education. Education level (EDUCA see codebook) scale 1-6 1=Never attended school or only kindergarten, 2=Grades 1 through 8 (Elementary), 3=Grades 9 through 11 (Some high school), 4=Grade 12 or GED (High school graduate), 5=College 1 year to 3 years (Some college or technical school), 6=College 4 years or more (College graduate)
* Income. Income scale (INCOME2 see codebook) scale 1-8. 1=less than \$10,000, 5=less than \$35,000, 8=\$75,000 or more

Output variable:
* Diabetes (0=No Risk, 1=At Risk)

## Requirements
The aim of this lab is to construct a machine learning classification model that can detect whether a patient is at risk of diabetes. The model must minimize the number of false negatives.

## Acknowledgements:
This dataset was released by the CDC.

The solution code for this lab is available <a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_3/Labs/Lab3_2_Handling_Imbalanced_Data_Solution.ipynb" target="_parent">here</a>

In [None]:
# Install the imbalanced_learn library

!pip install -U imbalanced-learn

In [None]:
# Upload the diabetes_data.csv file from Github
# This step is complete

!wget 'https://raw.githubusercontent.com/cm-int/machine-learning-fundamentals/main/module_3/Labs/diabetes_data.csv'

In [None]:
# Load the data and create the diabetes_data DataFrame
# This step is complete
import numpy as np
import pandas as pd

diabetes_data = pd.read_csv('diabetes_data.csv')
diabetes_data

In [None]:
# Remove any observations with missing data


In [None]:
# Examine the structure of the data using the info() method


In [None]:
# Look at the statistics for the DataFrame with the describe() method


In [None]:
# Extract the class ('Diabetes') and calculate the amount of imbalance between the positive and naegative class labels

has_diabetes = diabetes_data['Diabetes']


In [None]:
# Remove the class from the DataFrame


In [None]:
# Scale the data

from sklearn.preprocessing import MinMaxScaler 


In [None]:
# Split the data into train and test datasets named features_train, features_test, predictions_train and predictions_test

from sklearn.model_selection import train_test_split


#Create and fit an initial model using a Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay

# Create the model. Name it forest_model

# Make predictions with the test data and examine the confusion matrix


**What do these results indicate?**


In [None]:
# Plot the calibration curve (create 20 bins)

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 


In [None]:
# Calculate the accuracy of the model 


**What does the calibration curve imply for this model?**


In [None]:
# Make predictions and calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 


In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score


In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 



**What do these metrics indicate?**


#Try sampling to balance the class labels

In [None]:
# Create and fit a BalancedRandomForestClassifier estimator
from imblearn.ensemble import BalancedRandomForestClassifier

# Name the model ensemble_model


In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay


**How does the false positive and false negative rate of this model compare to the previous one?**


In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 


**What does this curve show?**


In [None]:
# Calculate the accuracy of the model 


In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 


In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score


In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 


In [None]:
# Compare the skill level of this model to the Random Forest model


**What do these metrics tell you?**


#Compare sampling to bagging

In [None]:
# Reuse the Random Forest classifier created earlier 
from imblearn.ensemble import BalancedBaggingClassifier

# Name the model bag_model


In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay



**What does this confusion matrix show?**


In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 


In [None]:
# Calculate the accuracy of the model 


In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 


In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score


In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 


In [None]:
# Compare the skill level of this model to the Random Forest model


**What do these metrics show?**


# Try sampling with a different classifier - the Random Undersampler with AdaBoost (RUSBoostClasifier)

In [None]:
# Again, reuse the Random Forest classifier created earlier
from imblearn.ensemble import RUSBoostClassifier

# Name the model rus_model

In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay


In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 


**What can you tell about this model?**


In [None]:
# Calculate the accuracy of the model 


In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 


In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score


In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 


In [None]:
# Compare the skill level of this model to the previous models


**What do these metrics show?**



# Tune the threshold for the BalancedRandomForestClassifier

The BalancedRandomForestClassifier model had the lowest false negative rate of the models seen so far.

In [None]:
from sklearn.metrics import roc_curve

# Find the FPR, TPR, and thresholds


In [None]:
# Calculate Youden's J Statistic

# Find the threshold at this point. Name it optimal_threshold


In [None]:
import matplotlib.pyplot as plt

# Plot the curve of FPR versus TPR and highlight the threshold


In [None]:
# Create a new predictions test set named adjusted_predictions_test.
# Set the predicted values to 1 for all predictions in this new test dataset with a threshold >= the optimal threshold

# Calculate and display how many predictions have changed


In [None]:
# Find the Precision, Recall, F1 Score, AUC, and Accuracy for the model when using the adjusted threshold
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score


In [None]:
# Plot the ROC curve

from sklearn import metrics


**How has this adjustment changed the false negative rate of the model?**


In [None]:
# Plot the calibration curve for the predictions made using the adjusted probability threshold

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 



**What does this curve show?**


In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 


In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score


In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 


In [None]:
# Compare the skill level of this model to the Random Forest model


**What do these metrics show?**


# Try the same strategy with a different algorithm

Bagging with Logistic Regression. This is just for comparison.

In [None]:
# Create and fit a Balanced Bagging classifier based on a Logistic Regression classifier with the Newton CG solver (lbfgs tends not to converge with this dataset)
from sklearn.linear_model import LogisticRegression

# Name the mode lg_model


In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay


In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 


In [None]:
from sklearn.metrics import roc_curve

# Find the FPR, TPR, and thresholds for this model


In [None]:
# Calculate Youden's J Statistic

# Find the threshold at this point


In [None]:
import matplotlib.pyplot as plt

# Plot the FPR versus TPR and highlight the threshold


In [None]:
# Create a new predictions test set named adjusted_predictions_test.
# Set the predicted values to 1 in this dataset for all predictions with a threshold >= the optimal threshold

# Calculate the number of predictions that have been changed

In [None]:
# Find the Precision, Recall, F1 Score, AUC, and Accuracy for the model when using the adjusted threshold
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score


**What do these results show?**


# Combine the original Random Forest and Logistic Regression models with a Voting Classifier

This is for comparison with the other models. This model aims to reduce any variance that might be caused by overfitting.

In [None]:
from sklearn.ensemble import VotingClassifier

# Create an array containing the forest_model and lg_model estimators


# Create and fit a voting classifier with soft voting using the array of estimators

In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay


In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 


In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 


In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score


In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 


In [None]:
# Compare the skill level of this model to the original Random Forest model

# Generate the Brier Score for the Logistic Regression model 
# and compare the skill level of the Voting Classifier model to the Logistic Regression model



**How does this model compare to those that used sampling?**


##If time allows

Try creating a voting model combining classifiers for Gaussian Naive Bayes and K-Nearest Neighbors with the Random Forest model.