<a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_3/Labs/Lab3_2_Handling_Imbalanced_Data_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3.2: Handling Imbalanced Data

In this lab, you'll perform the following tasks:

1. Build a Random Forest model to classify an imbalanced dataset without making any modifications.
1. Examine the results and evaluate the performance using appropriate metrics.
1. Use sampling to balance the dataset and rebuild and retest the model.
1. Use bagging with sampling and rebuild and retest the model.
1. Use boosting with sampling and rebuild and restest the model.
1. Calibrate the model and restest it.
1. Build another model using the original imbalanced dataset, then calibrate and evaluate the model.
1. Combine models using a VotingClassifier and evaluate the results.

## Scenario

Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

The Behavioral Risk Factor Surveillance System (BRFSS) is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

The dataset contains the following columns:

Input variables:
* HighBP. 0=no high BP, 1=high BP
* HighChol. 0=no high cholesterol, 1=high cholesterol
* CholCheck. Has the pateint had a cholesterol check in the last 5 years? 0=no, 1=yes
* BMI. Body Mass Index
* Smoker. Has the patient smoked at least 100 cigarettes in their entire life? [Note: 5 packs = 100 cigarettes] 0=no, 1=yes
* Stroke. Has the patient ever had a stroke? 0=no, 1=yes
* HeartDiseaseorAttack. Does the patient have coronary heart disease (CHD) or myocardial infarction (MI)? 0=no, 1=yes
* PhysActivity. Has the patient performed any physical activity in past 30 days, not including job? 0=no, 1=yes
* Fruits. Does the patient consume fruit 1 or more times per day? 0=no, 1=yes
* Veggies. Does the patient consume vegetables 1 or more times per day? 0=no, 1=yes
* HvyAlcoholConsump. Is the patient a heavy drinkerer (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)? 0=no, 1=yes
* AnyHealthcare. Does the patient have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc? 0=no, 1=yes
* NoDocbcCost. Was there a time in the past 12 months when the patient needed to see a doctor but could not because of cost? 0=no, 1=yes
* GenHlth. Would the pateint say that in general their health is: scale 1-5 1=excellent, 2=very good, 3=good, 4=fair, 5=poor
* MentHlth. Including stress, depression, and problems with emotions, for how many days during the past 30 days was the patient's mental health not good? scale 1-30 days
* PhysHlth. Inlcuding physical illness and injury, for how many days during the past 30 days was the patient's physical health not good? scale 1-30 days
* DiffWalk. Does the patient have serious difficulty walking or climbing stairs? 0=no, 1=yes
* Sex. 0=female, 1=male
* Age. 13-level age category (_AGEG5YR see codebook) 1=18-24, 9=60-64, 13=80 or older
* Education. Education level (EDUCA see codebook) scale 1-6 1=Never attended school or only kindergarten, 2=Grades 1 through 8 (Elementary), 3=Grades 9 through 11 (Some high school), 4=Grade 12 or GED (High school graduate), 5=College 1 year to 3 years (Some college or technical school), 6=College 4 years or more (College graduate)
* Income. Income scale (INCOME2 see codebook) scale 1-8. 1=less than \$10,000, 5=less than \$35,000, 8=\$75,000 or more

Output variable:
* Diabetes (0=No Risk, 1=At Risk)

## Requirements
The aim of this lab is to construct a machine learning classification model that can detect whether a patient is at risk of diabetes. The model must minimize the number of false negatives.

## Acknowledgements:
This dataset was released by the CDC.

In [None]:
# Install the imbalanced_learn library

!pip install -U imbalanced-learn

In [None]:
# Upload the diabetes_data.csv file from Github

!wget 'https://raw.githubusercontent.com/cm-int/machine-learning-fundamentals/main/module_3/Labs/diabetes_data.csv'

In [None]:
# Load the data and create the diabetes_data DataFrame
import numpy as np
import pandas as pd

diabetes_data = pd.read_csv('diabetes_data.csv')
diabetes_data

In [None]:
# Remove any observations with missing data

diabetes_data = diabetes_data.dropna()

In [None]:
# Examine the structure of the data

diabetes_data.info()

In [None]:
# Look at the statistics for the DataFrame

diabetes_data.describe()

In [None]:
# Extract the class ('Diabetes') and calculate the amount of imbalance

has_diabetes = diabetes_data['Diabetes']
values = has_diabetes.value_counts()
positive = values[1]
negative = values[0]
print(f'Positive labels: {positive}\nNegative labels: {negative}\nRatio: {round(negative/positive)}:1')

In [None]:
# Remove the class from the DataFrame

diabetes_data = diabetes_data.drop(['Diabetes'], axis=1)

In [None]:
# Scale the data

from sklearn.preprocessing import MinMaxScaler 

scaler = MinMaxScaler(feature_range=(0, 5))
column_names = diabetes_data.columns
diabetes_data = pd.DataFrame(scaler.fit_transform(diabetes_data), columns=column_names)
diabetes_data

In [None]:
# Split the data into train and test datasets

from sklearn.model_selection import train_test_split

features_train, features_test, predictions_train, predictions_test = train_test_split(diabetes_data, has_diabetes, test_size=0.33, random_state=13)

#Create and fit an initial model using a Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay

# Create the model
forest_model = RandomForestClassifier(n_estimators=100).fit(features_train, predictions_train)

# Examine the confusion matrix
test_results = forest_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results, display_labels=["Negative", "Positive"])

**What do these results indicate?**

The model has a significant bias towards making negative predictions. The number of false negatives is high. The model is missing a lot of patients who might have diabetes.

In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 

probs = forest_model.predict_proba(features_test)[:,1] 
p, m = calibration_curve(predictions_test, probs, n_bins=20) 

plt.plot([0, 1], [0, 1], linestyle='--') 
plt.plot(m, p, marker='.', c='red') 
plt.xlabel('Mean') 
plt.ylabel('Proportion') 
plt.show() 

In [None]:
# Test the accuracy of the model 

print(f'Accuracy score: {forest_model.score(features_test, predictions_test)}')

**What does the calibration curve imply for this model?**

The calbration curve shows that the model is overpredicting negative classes. The curve is below the diagonal for probabilities up to 0.9. However, it is close to the diagonal resulting in the high accuracy score. But accuracy is not the best metric for this scenario. To reduce the false negative rate you need the recall to be high rather than overall accuracy.

In [None]:
# Make predictions and calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 

test_results = forest_model.predict(features_test)

forest_model_gscore = geometric_mean_score(predictions_test, test_results)
print(f'G-Mean: {forest_model_gscore}') 

In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score

for beta in (0.5, 1, 2):
  print(f'F{beta} score: {fbeta_score(predictions_test, test_results, beta=beta)}') 

In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 

probs = forest_model.predict_proba(features_test) 
probs = probs[:, 1] # Take the probabilities for the positive class label 

forest_model_bscore = brier_score_loss(predictions_test, probs)
print(f'Brier score: {forest_model_bscore}')

**What do these metrics indicate?**

The geometric mean indicates that the combined precision and recall are poor. The F0.5 and F2 scores show that precision is better that recall. Ideally for this scenario you want recall to be high, even if precision is reduced. Taken by itself, the Brier score doesn't really convey much information but it will become useful for comparison with other models later in the lab.

#Try sampling to balance the class labels

In [None]:
# Create and fit a BalancedRandomForestClassifier estimator
from imblearn.ensemble import BalancedRandomForestClassifier

ensemble_model = BalancedRandomForestClassifier(n_estimators=100) 
_ = ensemble_model.fit(features_train, predictions_train)

In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

test_results = ensemble_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results, display_labels=["Negative", "Positive"])

**How does the false positive and false negative rate of this model compare to the previous one?**

The number of false positives has increased very significantly and number of false negatives has decreased although the false negative rate is still too high. The number of true positives is much higher than previously.

In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 

probs = ensemble_model.predict_proba(features_test)[:,1] 
p, m = calibration_curve(predictions_test, probs, n_bins=20) 

plt.plot([0, 1], [0, 1], linestyle='--') 
plt.plot(m, p, marker='.', c='red') 
plt.xlabel('Mean')
plt.ylabel('Proportion') 
plt.show()

**What does this curve show?**

The model has a bias towards the negative class label.

In [None]:
# Test the accuracy of the model 

print(f'Accuracy score: {ensemble_model.score(features_test, predictions_test)}')

In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 

ensemble_model_gscore = geometric_mean_score(predictions_test, test_results)
print(f'G-Mean: {ensemble_model_gscore}') 

In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score

for beta in (0.5, 1, 2):
  print(f'F{beta} score: {fbeta_score(predictions_test, test_results, beta=beta)}') 

In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 

probs = ensemble_model.predict_proba(features_test) 
probs = probs[:, 1] # Take the probabilities for the positive class label 

ensemble_model_bscore = brier_score_loss(predictions_test, probs)
print(f'Brier score: {ensemble_model_bscore}')

In [None]:
# Compare the skill level of this model to the Random Forest model

skill = 1-(ensemble_model_bscore/forest_model_bscore)
print(f'Brier Skill score: {skill}')

**What do these metrics tell you?**

The G-Mean and F2 scores are both better, indicating that the false negative rate has decreased. The F0.5 score has dropped slightly meaning that the false positive rate has gone up. The Brier Skill score implies that overall this model is not as good as the previous one, but this is mainly due to the large increase in false positives so it may not be critical for this model.

#Compare sampling to bagging

In [None]:
# Reuse the Random Forest classifier created earlier 
from imblearn.ensemble import BalancedBaggingClassifier

bag_model = BalancedBaggingClassifier(estimator=forest_model)
_ = bag_model.fit(features_train, predictions_train)

In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

test_results = bag_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results, display_labels=["Negative", "Positive"])

**What does this confusion matrix show?**

The false negative rate has increased and the false positive rate has dropped a little. The model has moved in the wrong direction!

In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 

probs = bag_model.predict_proba(features_test)[:,1] 
p, m = calibration_curve(predictions_test, probs, n_bins=20) 

plt.plot([0, 1], [0, 1], linestyle='--') 
plt.plot(m, p, marker='.', c='red') 
plt.xlabel('Mean') 
plt.ylabel('Proportion') 
plt.show() 

In [None]:
# Test the accuracy of the model 

print(f'Accuracy score: {bag_model.score(features_test, predictions_test)}')

In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 

bag_model_gscore = geometric_mean_score(predictions_test, test_results)
print(f'G-Mean: {bag_model_gscore}') 

In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score

for beta in (0.5, 1, 2):
  print(f'F{beta} score: {fbeta_score(predictions_test, test_results, beta=beta)}') 

In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 

probs = bag_model.predict_proba(features_test) 
probs = probs[:, 1] # Take the probabilities for the positive class label 

bag_model_bscore = brier_score_loss(predictions_test, probs)
print(f'Brier score: {bag_model_bscore}')

In [None]:
# Compare the skill level of this model to the Random Forest model

skill = 1-(bag_model_bscore/forest_model_bscore)
print(f'Brier Skill score: {skill}')

**What do these metrics show?**

The F0.5 score has increased because the number of false positives has dropped leading to slightly better precision. Recall has diminished. The Brier Skill scores imply that this model is better than the sampling model - the skill is less negative when compared to the random forest model. However, bear in mind that the false negative rate has increased, so although the combination of precision and recall have improved, recall itself has dropped.

# Try sampling with a different classifier - the Random Undersampler with AdaBoost (RUSBoostClasifier)

In [None]:
# Again, reuse the Random Forest classifier created earlier
from imblearn.ensemble import RUSBoostClassifier

rus_model = RUSBoostClassifier(estimator=forest_model)
_ = rus_model.fit(features_train, predictions_train)

In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

test_results = rus_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results, display_labels=["Negative", "Positive"])

In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 

probs = rus_model.predict_proba(features_test)[:,1] 
p, m = calibration_curve(predictions_test, probs, n_bins=20) 

plt.plot([0, 1], [0, 1], linestyle='--') 
plt.plot(m, p, marker='.', c='red') 
plt.xlabel('Mean') 
plt.ylabel('Proportion') 
plt.show() 

**What can you tell about this model?**

The number of false positives have dropped as have the number of true positives. Observations which were previsously classified as positive are now being wrongly classified as negative.

In [None]:
# Test the accuracy of the model 

print(f'Accuracy score: {rus_model.score(features_test, predictions_test)}')

In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 

rus_model_gscore = geometric_mean_score(predictions_test, test_results)
print(f'G-Mean: {rus_model_gscore}') 

In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score

for beta in (0.5, 1, 2):
  print(f'F{beta} score: {fbeta_score(predictions_test, test_results, beta=beta)}') 

In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 

probs = rus_model.predict_proba(features_test) 
probs = probs[:, 1] # Take the probabilities for the positive class label 

rus_model_bscore = brier_score_loss(predictions_test, probs)
print(f'Brier score: {rus_model_bscore}')

In [None]:
# Compare the skill level of this model to the previous models

skill = 1-(rus_model_bscore/forest_model_bscore)
print(f'Brier Skill score: {skill}')

**What do these metrics show?**

The F0.5 and F2 scores have both dropped. The G-Mean is also lower than the previous model. However the Brier skill score shows an improvment although you should read this score as being *less bad* rather than *good* - it is still negative.

# Tune the threshold for the BalancedRandomForestClassifier

The BalancedRandomForestClassifier model had the lowest false negative rate of the models seen so far.

In [None]:
from sklearn.metrics import roc_curve

# Find the FPR, TPR, and thresholds
probs = ensemble_model.predict_proba(features_test)
fpr, tpr, thresholds = roc_curve(predictions_test, probs[:,1])

In [None]:
# Calculate Youden's J Statistic
J = tpr - fpr

# Find the threshold at this point
idx = np.argmax(J)
optimal_threshold = thresholds[idx]

In [None]:
import matplotlib.pyplot as plt

# Plot the results and highlight the threshold
plt.plot(fpr, tpr, c='blue')
plt.scatter(fpr[idx], tpr[idx], c='red', s=200)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

print(f'Optimal Threshold is {optimal_threshold}')

In [None]:
# Set the predicted values to 1 for all predictions with a threshold >= the optimal threshold
adjusted_predictions_test = (probs[:,1] >= optimal_threshold).astype('int')

print(f'Number of test predictions affected: {np.sum(adjusted_predictions_test != predictions_test)}')

In [None]:
# Find the Precision, Recall, F1 Score, AUC, and Accuracy for the model when using the adjusted threshold
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score

test_results = ensemble_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(adjusted_predictions_test, test_results)

print(f'Precision: {precision_score(adjusted_predictions_test, test_results, average="macro", zero_division=0)}\n')
print(f'Recall: {recall_score(adjusted_predictions_test, test_results, average="macro", zero_division=0)}\n')
print(f'F1 Score: {f1_score(adjusted_predictions_test, test_results, average="macro", zero_division=0)}\n')
print(f'AUC: {roc_auc_score(adjusted_predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(adjusted_predictions_test, test_results)}\n')

In [None]:
# Plot the ROC curve

from sklearn import metrics

_ = metrics.RocCurveDisplay.from_predictions(adjusted_predictions_test, test_results)

**How has this adjustment changed the false negative rate of the model?**

The false negative rate has shrunk close to, if not actually, zero. In addition, the false positive rate is now also miniscule.

In [None]:
# Plot the calibration curve for the predictions made using the adjusted probability threshold

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 

probs = ensemble_model.predict_proba(features_test)[:,1] 
p, m = calibration_curve(adjusted_predictions_test, probs, n_bins=20) 

plt.plot([0, 1], [0, 1], linestyle='--') 
plt.plot(m, p, marker='.', c='red') 
plt.xlabel('Mean') 
plt.ylabel('Proportion') 
plt.show() 

**What does this curve show?**

This is a classic Sigmoid Curve. The vast majority of observations with the negative class label have a probability well below the diagonal, while those with the posistive class label are above the line. The stats indicate that precision and recall are both high (99%+)

In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 

ensemble_model_gscore = geometric_mean_score(adjusted_predictions_test, test_results)
print(f'G-Mean: {ensemble_model_gscore}') 

In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score

for beta in (0.5, 1, 2):
  print(f'F{beta} score: {fbeta_score(adjusted_predictions_test, test_results, beta=beta)}') 

In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 

probs = ensemble_model.predict_proba(features_test) 
probs = probs[:, 1] # Take the probabilities for the positive class label 

ensemble_model_bscore = brier_score_loss(adjusted_predictions_test, probs)
print(f'Brier score: {ensemble_model_bscore}')

In [None]:
# Compare the skill level of this model to the Random Forest model

skill = 1-(ensemble_model_bscore/forest_model_bscore)
print(f'Brier Skill score: {skill}')

**What do these metrics show?**

The G-Mean and F scores are all high indicating excellent precision, recall, and accuracy. The Brier Skill score indicates that this is a better model than the original random forest.

While this is an excellent result, you must be cautious and perform further testing to ensure that the model has been overfitted to the data.

# Try the same strategy with a different algorithm

Bagging with Logistic Regression. This is just for comparison.

In [None]:
# Create and fit a Logistic Regression classifier with the Newton CG solver (lbfgs tends not to converge with this dataset)
from sklearn.linear_model import LogisticRegression

lg_model = BalancedBaggingClassifier(estimator=LogisticRegression(solver='newton-cg'))
_ = lg_model.fit(features_train, predictions_train)

In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

test_results = lg_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results, display_labels=["Negative", "Positive"])

In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 

probs = lg_model.predict_proba(features_test)[:,1] 
p, m = calibration_curve(predictions_test, probs, n_bins=20) 

plt.plot([0, 1], [0, 1], linestyle='--') 
plt.plot(m, p, marker='.', c='red') 
plt.xlabel('Mean') 
plt.ylabel('Proportion') 
plt.show() 

In [None]:
from sklearn.metrics import roc_curve

# Find the FPR, TPR, and thresholds for this model
probs = lg_model.predict_proba(features_test)
fpr, tpr, thresholds = roc_curve(predictions_test, probs[:,1])

In [None]:
# Calculate Youden's J Statistic
J = tpr - fpr

# Find the threshold at this point
idx = np.argmax(J)
optimal_threshold = thresholds[idx]

In [None]:
import matplotlib.pyplot as plt

# Plot the results and highlight the threshold
plt.plot(fpr, tpr, c='blue')
plt.scatter(fpr[idx], tpr[idx], c='red', s=200)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

print(f'Optimal Threshold is {optimal_threshold}')

In [None]:
# Set the predicted values to 1 for all predictions with a threshold >= the optimal threshold
adjusted_predictions_test = (probs[:,1] >= optimal_threshold).astype('int')

print(f'Number of test predictions affected: {np.sum(adjusted_predictions_test != predictions_test)}')

In [None]:
# Find the Precision, Recall, F1 Score, AUC, and Accuracy for the model when using the adjusted threshold
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score

test_results = lg_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(adjusted_predictions_test, test_results)

print(f'Precision: {precision_score(adjusted_predictions_test, test_results, average="macro", zero_division=0)}\n')
print(f'Recall: {recall_score(adjusted_predictions_test, test_results, average="macro", zero_division=0)}\n')
print(f'F1 Score: {f1_score(adjusted_predictions_test, test_results, average="macro", zero_division=0)}\n')
print(f'AUC: {roc_auc_score(adjusted_predictions_test, test_results)}\n')
print(f'Accuracy: {accuracy_score(adjusted_predictions_test, test_results)}\n')

**What do these results show?**

The threshold was lowered rather than raised, so more observations will be classified as negative resulting in a false negative rate that isn't as good as the previous model.

# Combine the original Random Forest and Logistic Regression models with a Voting Classifier

This is for comparison with the other models. This model aims to reduce any variance that might be caused by overfitting.

In [None]:
from sklearn.ensemble import VotingClassifier

# Create an array containing the forest_model and lg_model estimator
estimators = [('RF', forest_model), ('LG', lg_model)]

# Create and fit a voting classifier with soft voting using the array of estimators
vote_soft_model = VotingClassifier(estimators=estimators, voting='soft')
_ = vote_soft_model.fit(features_train, predictions_train)

In [None]:
# Examine the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

test_results = vote_soft_model.predict(features_test)
_ = ConfusionMatrixDisplay.from_predictions(predictions_test, test_results, display_labels=["Negative", "Positive"])

In [None]:
# Plot the calibration curve

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve 

probs = vote_soft_model.predict_proba(features_test)[:,1] 
p, m = calibration_curve(predictions_test, probs, n_bins=20) 

plt.plot([0, 1], [0, 1], linestyle='--') 
plt.plot(m, p, marker='.', c='red') 
plt.xlabel('Mean') 
plt.ylabel('Proportion') 
plt.show() 

In [None]:
# Calculate the G-Mean 

from imblearn.metrics import geometric_mean_score 

vote_soft_model_gscore = geometric_mean_score(predictions_test, test_results)
print(f'G-Mean: {vote_soft_model_gscore}') 

In [None]:
# Calculate the F0.5, F1, and F2 scores

from sklearn.metrics import fbeta_score

for beta in (0.5, 1, 2):
  print(f'F{beta} score: {fbeta_score(predictions_test, test_results, beta=beta)}') 

In [None]:
# Calculate the Brier score 

from sklearn.metrics import brier_score_loss 

probs = vote_soft_model.predict_proba(features_test) 
probs = probs[:, 1] # Take the probabilities for the positive class label 

vote_soft_model_bscore = brier_score_loss(predictions_test, probs)
print(f'Brier score: {vote_soft_model_bscore}')

In [None]:
# Compare the skill level of this model to the original Random Forest model
skill = 1-(vote_soft_model_bscore/forest_model_bscore)
print(f'Brier Skill score compared to Random Forest: {skill}')

# Generate the Brier Score for the Logistic Regression model 
# and compare the skill level of the Voting Classifier model to the Logistic Regression model

probs = lg_model.predict_proba(features_test) 
probs = probs[:, 1]

lg_model_bscore = brier_score_loss(predictions_test, probs)
skill = 1-(vote_soft_model_bscore/lg_model_bscore)
print(f'Brier Skill score compared to Logistic Regression: {skill}')

**How does this model compare to those that used sampling?**

In this example, the Logistic Regression model has dragged the false positive and false negative rates up.


##If time allows

Try creating a voting model combining classifiers for Gaussian Naive Bayes and K-Nearest Neighbors with the Random Forest model.