In [None]:
import imblearn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from imblearn.over_sampling import SMOTE, ADASYN

from sklearn import metrics, svm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
full_df = pd.read_csv("../data/zonal-means-aggregate-200910-201912.csv")
full_df

In [None]:
class_counts = full_df["outbreak"].value_counts()
print("Class Distribution:\n", class_counts)

The SMOTE algorithm for treating imbalanced datasets cannot deal with missing values (NaNs) for Feature columns, so we need to impute the missing data. Following the methodology used by Campbell et al (2020) we will keep only those districts and months that have data for all of the environmental parameters. 

In [None]:
cleaned_df = full_df.dropna()
cleaned_df

By removing these rows, we now have fewer outbreak (and non outbreak months) overall.

In [None]:
class_counts = cleaned_df["outbreak"].value_counts()
print("Class Distribution:\n", class_counts)

In [None]:
# dropping unnecessary columns for analysis and moving "outbreak" to "y"
# variable as it is the feature we are trying to predict
X_cln = cleaned_df.drop(
    columns=["outbreak", "location_period_id", "month", "year"]
)  # all other columns are our feature (predicting) variables

y = cleaned_df["outbreak"]  # our predicted variable

In [None]:
X_cln.describe()

## Removal of correlated variables

A simple model is preferable. Correlated variables will not always make your model worse, but they will not always improve it. In general, it is good practice to remove correlated features because: 
* they make the algorithm learn faster
* they decrease harmful bias (i.e., if variables are correlated with each other and not with the predicted target [e.g., outbreaks]) they may confound other interactions
* they improve the interpretability of the model

Random forest models (like the one we'll explore below) can be good at detecting interactions between different features, but highly correlated features can mask these interactions. Explore for yourself and compare results with the full dataset vs. reduced variables and see how model performance changes.


In the cells below, we'll explore the correlation between all of the environmental parameters we are using. By doing so, we might be able to reduce this feature space. 

In [None]:
spearman = X_cln.corr(method="spearman")
spearman.style.background_gradient(cmap="coolwarm")

We observe correlation between precipitation and soil moisture values. This makes sense as one (precip) certainly has an impact on the other (soil moisture levels). We will want to consider this in our model development, as we can perhaps reduce the number of features considered. 

In [None]:
(spearman > 0.8)

Now we'll drop those variables that have more than 0.8 correlation (i.e., we'll keep only `sm_0`)

In [None]:
X = X_cln.drop(["sm_1", "sm_2", "sm_3", "precip_1"], axis=1)

In [None]:
spearman = X.corr(method="spearman")
spearman.style.background_gradient(cmap="coolwarm")

We have utilized spearman rank correlation to reduce the feature space here. But there are other options as well. Correlation analyses are used extensively for variable selection, as it recognizes the degree of correlation between input and output variables. Alternatively, methods like Principal Component Analysis (PCA) can be used for identifying variables with high variances that influence the output (predicted) variable. 

## Creation of training/testing datasets

Normally we would want to split the entire dataset into train, validation and test datasets 
* Training set - is the portion of the dataset used to fit the model. This is what the model "see" and "learn from". It should be large enough to generate meaningful results (but not too large that overfitting occurs) and be representative of the dataset as a whole. Overfitting is when the model becomes to specialized on the training dataset that it is unable to generalize and make correct predictions on new data (i.e., the testing dataset)
* Validation set - is used to evaluate and fine-tine the machine learning model during training, to asses the models performance and make any adjustments. 
* Testing set - It is the set of data used to evaluate the final performance of the trained model. This is the subset of data that has been hidden from the model and so allows for the evaluation of the model's performance on a real-world dataset

Due to the limited number of cholera outbreaks (outbreaks=1) in the dataset, a decision was made to use only training/testing datasets with a split of 70:30 respectively. This follows similar literature methodologies used for cholera outbreak analysis (Campbell et al. 2020). 

The creation of the train/test dataset splits follow a random sampling approach. However, a stratified dataset splitting could (and should be) explored, especially as we are dealing with a highly imbalanced dataset. By enabling stratified splitting, we would preserve the relative proportions of each class (outbreak=1, outbreak=0) across splits. See documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and [here](https://scikit-learn.org/stable/modules/cross_validation.html#stratification) for enabling stratification in the train/test split. 


In [None]:
# split the dataset into train and test splits
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print(X_train)
print(y)

## Accounting for an imbalanced dataset

Below is a useful reference for different techniques use to solve the imbalance of classes in machine learning datasets: 
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/. Additionally, here is another great resource for understanding the different methodologies used for treating imbalanced datasets: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/.

For the purposes of this investigation, we'll first start with SMOTE, followed by ADASYN-SMOTE techniques as had been suggested in the literature for similar work with imbalanced outbreak datasets. 

### SMOTE

In [None]:
# apply SMOTE to the training data with a 1:10 ratio as used by Campbell et al 2020
smote = SMOTE(
    sampling_strategy=0.1, random_state=42
)  # worked but still reflected only outbreak = 0 category

# apply SMOTE at 1:2 ratio - accuracy is more reflective of minority category, but not biologically relevant
# smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Applying SMOTE
- Behaves similarly to a data transformation object in that it must be defined and configured, fit on a dataset, then applied to create a new transformed version of the dataset.
- In our code below, we define a SMOTE instance with default parameters that will balance the minority class and then fit and apply it in one step to create a transformed version of our dataset. The `sampling strategy=0.1` means we will `oversample` the minority class (outbreak=1) to have 10 percent number of examples of the majority class (i.e., maintain a 1:10 ratio of outbreaks to non-outbreaks).
- Once transformed, we will expect to see the class distribution of the new transformed dataset, now to be balanced (while maintaining that 1:10 ratio) through the creation of many new synthetic examples in the minority (i.e., outbreak=1) class.

### ADASYN SMOTE

In [None]:
# oversample = ADASYN(sampling_strategy=0.1, random_state=42)
# X_resampled, y_resampled = oversample.fit_resample(X_train, y_train)

In [None]:
# check the new class distribution after SMOTE
resampled_class_counts = pd.Series(y_resampled).value_counts()
print("\nClass Distribution after SMOTE:\n", resampled_class_counts)

## Sensitivity analysis of SMOTE vs ADASYN SMOTE treatments

Sensitivity analysis of SMOTE imbalance treatments (oversampling of the minority class [outbreaks = 1]). This was performed by running the Random Forest model and evaluating on the test dataset, while changing only the sampling strategy parameter value. 

| SMOTE Sampling strategy parameter | Accuracy | F1 Score | ROC AUC Score | 
|-----------------------------|----------|----------|--------------|
| 0.1 | 0.989 | 0.014 | 0.503|
| 0.2 | 0.986 | 0.044 | 0.512|
| 0.3 | 0.984 | 0.063 | 0.524| 
| 0.4 | 0.981 | 0.056 | 0.522|
| 0.5 | 0.979 | 0.069 | 0.533| 

With the lower (0.1) sampling strategy we get our highest accuracy scores, but that is reflective of correctly predicting the majority class (outbreaks = 0) only. A 1:2 sampling strategy allows us to correctly predict cholera outbreaks, while maintaining decent accuracy scores - however it is not biologically relevant in the real world (i.e., there isn't a 1:2 ratio of outbreak to non-outbreak months). 

| SMOTE ADASYN Sampling strategy parameter | Accuracy | F1 Score | ROC AUC Score | 
|-----------------------------|----------|----------|--------------|
| 0.1 | 0.989 | 0.042 | 0.511|
| 0.2 | 0.987 | 0.067 | 0.521|
| 0.3 | 0.983 | 0.062 | 0.523| 
| 0.4 | 0.981 | 0.056 | 0.522|
| 0.5 | 0.979 | 0.069 | 0.533| 

We also observe that both SMOTE vs ADASYN SMOTE techniques result in very similar (nearly negligible differences in) model results. 

## Evaluation metrics

**Accuracy**: Represents the total number of correctly classified data instances over the total number of data instance. This metric alone would is not a good measure if the dataset is highly imbalanced (as we have seen in the example above the high accuracy scores are a reflection of the majority class only). 

**F1 Score**: F1 score gives a single metrics that balances both precision and recall. It can be used, in a complementary manner with ROC AUC scores, to assess the effectiveness of a ML model. An F1 score is high (1) only when both the precision and recall are both high. 

**ROC AUC Score**: Tells us how efficient the model is. A higher AUC the better the model's performance at distinguishing between positive and negative classes. A score of 1 means the classifier can perfectly distinguish between positive and negative classes. An AUC value of 0 means the classifier predicts all negatives as positives and vice versa. A ROC AUC of 0.5 means the classifier is not working. An AUC value above 0.5 means the classifier can detect more numbers of true positives and true negatives than false positives and false negatives. 

For more information on evaluation metrics, please see this [helpful resource](https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd). 

## Creation of stratified training/testing datsets

In [None]:
# split the dataset into train and test splits with stratify set
(
    stratified_X_train,
    stratified_X_test,
    stratified_y_train,
    stratified_y_test,
) = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(stratified_X_train)

### Application of SMOTE on stratified split dataset

In [None]:
# apply SMOTE to the training data with a 1:10 ratio as used by Campbell et al 2020
smote = SMOTE(
    sampling_strategy=0.1, random_state=42
)  # worked but still reflected only outbreak = 0 category

# apply SMOTE at 1:2 ratio - accuracy is more reflective of minority category, but not biologically relevant
# smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

### Application of ADASYN SMOTE

In [None]:
# oversample = ADASYN(sampling_strategy=0.1, random_state=42)
# X_resampled, y_resampled = oversample.fit_resample(X_train, y_train)

In [None]:
# check the new class distribution after SMOTE
resampled_class_counts = pd.Series(y_resampled).value_counts()
print("\nClass Distribution after SMOTE:\n", resampled_class_counts)

### See also (TOMEK LINKS Exploration) 
Was also explored, but didn't work as successfully as either of the approaches above. Saved here for archived methodology, otherwise ignore. 

In [None]:
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import TomekLinks

tl = TomekLinks()

# tl = RandomOverSampler(sampling_strategy=0.2, random_state=42)

# fit predictor and target variable
X_resampled, y_resampled = tl.fit_resample(X_train, y_train)

# check the new class distribution after TOMEK LINKS
resampled_class_counts = pd.Series(y_resampled).value_counts()

# print("\nClass Distribution after Tomek Links:\n", y_tl)
print(f"TomekLinks Resampled dataset shape {Counter(y_resampled)}")

# TomekLinks Resampled dataset shape Counter({0: 29585, 1: 358})

### SMOTEENN Exploration
Was also explored, but didn't work as successfully as either of the approaches above. Saved here for archived methodology, otherwise ignore. 

In [None]:
# tl = SMOTEENN(sampling_strategy=0.2, random_state=42)

# Resampled dataset shape Counter({0: 29758, 1: 5951})
# tl = RandomOverSampler(sampling_strategy=0.2, random_state=42)

# fit predictor and target variable
# X_resampled, y_resampled = tl.fit_resample(X_train, y_train)

# check the new class distribution after TOMEK LINKS
# resampled_class_counts = pd.Series(y_resampled).value_counts()

# print("\nClass Distribution after Tomek Links:\n", y_tl)
# print(f'SMOTEEN Resampled dataset shape {Counter(y_resampled)}')
# SMOTTEENN Resampled dataset shape Counter({0: 26028, 1: 5031})

### SMOTETomek Exploration

In [None]:
from imblearn.combine import SMOTETomek

# SMOTETomek Resampled dataset shape Counter({0: 29601, 1: 5794})
# SMOTTEENN Resampled dataset shape Counter({0: 26028, 1: 5031})
# tl = SMOTETomek(sampling_strategy=0.2, random_state=42)

# Resampled dataset shape Counter({0: 29758, 1: 5951})
# tl = RandomOverSampler(sampling_strategy=0.2, random_state=42)

# fit predictor and target variable
# X_resampled, y_resampled = tl.fit_resample(X_train, y_train)

# check the new class distribution after TOMEK LINKS
# resampled_class_counts = pd.Series(y_resampled).value_counts()

# print("\nClass Distribution after Tomek Links:\n", y_tl)
# print(f'SMOTETomek Resampled dataset shape {Counter(y_resampled)}')
# SMOTETomek Resampled dataset shape Counter({0: 29601, 1: 5794})

## Model Exploration

### Random Forest

In [None]:
# train your machine learning model on the balanced dataset
clf_cln = RandomForestClassifier(random_state=42)
clf_cln.fit(X_resampled, y_resampled)

In [None]:
# evaluate your model
accuracy = clf_cln.score(X_test, y_test)
# (prev) Model Accuracy on Test Set: 0.9890756953591074
print("\nModel Accuracy on Test Set:", accuracy)

In [None]:
# run prediction
y_rf_pred = clf_cln.predict(X_test)

In [None]:
# create confusion matrix
cnf_matrix_rf = metrics.confusion_matrix(y_test, y_rf_pred)
cnf_matrix_rf

### Confusion matrix
| | predicted condition |
|--|--|



Interpretation of the confusion matrix above: 
In the first quadrant we have correctly classified 12,766 of the non-outbreak occurrences. The second and third quadrants we see 140 incorrectly classified (127 + 13) events for months where there was an outbreak. And we have correctly classified only 1 outbreak month. Our high accuracy is due to the underlying make-up of the data and it's imbalanced nature (i.e., it is classifying all non-outbreak events well as that is the predominant structure of the dataset). So we will want to revisit how we account for this imbalance in the data. 

In [None]:
print("ROC AUC score:", metrics.roc_auc_score(y_test, y_rf_pred))
print("Accuracy score:", metrics.accuracy_score(y_test, y_rf_pred))
print("F1 score:", metrics.f1_score(y_test, y_rf_pred))

In [None]:
# train your machine learning model on the balanced dataset (already done in the previous code)

# get feature importances from the trained RandomForestClassifier
feature_importances = clf_cln.feature_importances_

# create a DataFrame to display feature names and their corresponding importances
feature_importance_df = pd.DataFrame(
    {"Feature": X_resampled.columns, "Importance": feature_importances}
)

# sort the DataFrame by importance in descending order
feature_importance_df = feature_importance_df.sort_values(
    by="Importance", ascending=False
)

# print the top N most influential features (adjust N as needed)
top_n_features = 10  # Change this to the number of top features you want to display
print(f"Top {top_n_features} Most Influential Features:")
print(feature_importance_df.head(top_n_features))

### Support Vector Machines
Was also explored, but was not as successful as Random Forest. Methodology saved below for archive and comparison.

In [None]:
svm_clean = svm.SVC(random_state=42)
svm_clean.fit(X_resampled, y_resampled)

In [None]:
accuracy_svm = svm_clean.score(X_test, y_test)
print("\nModel Accuracy on Test Set:", accuracy_svm)

In [None]:
# run prediction
y_svm_pred = svm_clean.predict(X_test)

In [None]:
cnf_matrix_svm = metrics.confusion_matrix(y_test, y_svm_pred)
cnf_matrix_svm

In [None]:
print("ROC AUC score:", metrics.roc_auc_score(y_test, y_svm_pred))
print("Accuracy score:", metrics.accuracy_score(y_test, y_svm_pred))
print("F1 score:", metrics.f1_score(y_test, y_svm_pred))

Exploring the same dataset using `svm` we find similar results to our Random Forest results, except that things are worse! We will need to revisit how we handle the imbalanced nature of this dataset. 

### Logisitic regression 

Here we will look at the simplest classification (logistic regression) using only the most important feature identified by the Random Forest model - to see if we can explain all outbreak months by precip in the current month alone. We will explore this as a siml

In [None]:
X_precip = X.drop(
    columns=["lst_3", "lst_2", "lst_1", "lst_0", "precip_3", "precip_2", "sm_0"]
)  # keep only "precip_0"

In [None]:
# split the dataset into train and test splits
Xp_train, Xp_test, yp_train, yp_test = train_test_split(
    X_precip, y, test_size=0.3, random_state=42, stratify=y
)

In [None]:
Xp_resampled, yp_resampled = smote.fit_resample(Xp_train, yp_train)

In [None]:
from sklearn.linear_model import LogisticRegression

# create an instance of the model
logreg = LogisticRegression(solver="lbfgs", max_iter=400)

# train the model
logreg.fit(Xp_resampled, yp_resampled)

# run prediction
y_pred = logreg.predict(Xp_test)

In [None]:
# create confusion matrix
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(yp_test, y_pred)
cnf_matrix

In [None]:
print("ROC AUC score:", metrics.roc_auc_score(yp_test, y_pred))
print("Accuracy score:", metrics.accuracy_score(yp_test, y_pred))
print("F1 score:", metrics.f1_score(yp_test, y_pred))

## Summary Findings

* The input cholera dataset was largely imbalanced. As such, the model accuracy largely reflects the underlying class distribution.
* Dataset imbalances have been identified as a key challenge in accompanying ML approaches, including [cholera outbreak analysis](https://www.mdpi.com/1660-4601/17/24/9378)
* Synthetic Minority Oversampling Technique (SMOTE) use during the pre-processing stage allows for the generation of new examples of the minority class (outbreak=1)
* Different imbalance ratios were explored during a sensitivity analysis, with more balanced datasets producing higher accuracy results
* However, assuming a 1:1 ratio of outbreaks vs. non-outbreaks is unrealistic in real-world data
* Thus far a SMOTE sampling strategy of 0.3 proved to provide a reasonable balance between ML requirements and realistic applications to cholera outbreak analysis with an accuracy score of 0.984, F1 score of 0.063, and ROC AUC of of 0.524.  
* Further exploration of treatments for imbalanced datasets, reduction of environmental parameters within the random forest model should be pursued. 