# Supervised Machine Learning 

This training module was developed by Alexis Payton, MS, Oyemwenosa N. Avenbuan, and Dr. Julia E. Rager

Fall 2023

## Introduction to Machine Learning, including Supervised Machine Learning
Machine learning is a field that has been around for decades, but has exploded in popularity and utility in recent years due to proliferation of big data. Through the building of models, machine learning has the ability to sift through and learn from large volumes of data and use that knowledge to solve problems. The challenges of big and high dimensional data as they pertain to environmental health and how machine learning can mitigate some of those challenges are discussed further in [Payton et. al](https://www.frontiersin.org/articles/10.3389/ftox.2023.1171175/full).

## Types of Machine Learning
Within the field of machine learning, there are many different types of machine learning that can be run to address environmental health research questions. The two most commonly used categories used in environmental health research are: (1) supervised machine learning and (2) unsupervised machine learning.

**Supervised machine learning** involves training a model using a labeled dataset, where each dependent or predictor variable is associated with an independent variable with a known outcome. This allows the model to learn how to predict the labeled outcome on data it hasn't "seen" before based on the patterns and relationships it previously identified in the data. For example, supervised machine learning has been used for cancer prediction and prognosis based on variables like tumor size, stage, and age ([Lynch et. al](https://www.sciencedirect.com/science/article/abs/pii/S1386505617302368?via%3Dihub), [Asadi et. al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7416093/)). 

Supervised machine learning includes: 
+ Classification: Using algorithms to classify a categorical outcome (ie. plant species, disease status, etc.)
+ Regression: Using algorithms to predict a continuous outcome (ie. temperature, chemical concentration, etc.)
<img src="https://github.com/UNC-CEMALB/P1006_Data-Organization-for-High-Dimensional-Analyses-in-Environmental-Health/assets/69641855/3c22a8fc-ada5-4199-9967-77f504d99d2a" width="684" />

([Soni, 2018](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d))


**Unsupervised machine learning**, on the other hand, involves using models to find patterns or associations between variables in dataset that lack a known or labeled outcome. For example, unsupervised machine learning has been used to find associations of co-expressed genes within various biological pathways ([Botía et. al](https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-017-0420-6), [Pagnuco et. al](https://www.sciencedirect.com/science/article/pii/S0888754317300575?via%3Dihub)).

<img src="https://github.com/UNC-CEMALB/P1006_Data-Organization-for-High-Dimensional-Analyses-in-Environmental-Health/assets/69641855/4f78adef-97b6-425f-9a51-43315b0fb7b2" width="684" />

([Langs et. al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6244522/))

Overall, the distinction between supervised and unsupervised learning is an important concept in machine learning, as it can inform the choice of algorithms and techniques used to analyze and make predictions from data. It is worth noting that there are also other types of machine learning, such as semi-supervised learning, reinforcement learning and deep learning.

## Training Supervised Machine Learning Models

In supervised machine learning, algorithms need to be trained before they can be used to predict data. This involves selecting a smaller portion of data, known as training data, so that the model will learn how to predict the outcome as accurately as possible. The process of training an algorithm is essential for enabling it to learn and improve over time, allowing it to make more accurate predictions and better adapt to new and changing circumstances. Ultimately, the effectiveness of a machine learning model depends on the quality and relevance of its training data.

Let's imagine you're interested in predicting an animal's species (either a cat or a dog) based on a dataset that contains variables regarding weight, height, coat color, ear shape, etc. These data can be divided into a training set and a test set. The **training set** is a subset of the data that the model will learn from to make associations between the predictor variables (ie. height, weight etc.) and the outcome (ie. cat or dog). The **test set** is used used to evaluate what the model has learned from the training set. 

It is common to split the entire dataset into the training set that contains 60% of the data and the test set that contains 40% of the data:

<img src="https://github.com/UNC-CEMALB/P1006_Data-Organization-for-High-Dimensional-Analyses-in-Environmental-Health/assets/69641855/a8c723ec-e50c-4cb9-aff5-4230d0d6fc2c" width="684"/>

*Created with BioRender.com*

Other common splits include 70% training / 30% test and 80% training / 20% test.

The process of developing a model often involves dividing the data into three main sets:

1. **Training Set:** a subset of the data that the model uses to learn from the data by identifying patterns.

2. **Validation Set**: a subset of training set data that is used to evaluate the model's fit in an unbiased way by fine-tuning its parameters and optimizing its performance. This is akin to pop quizzes that help students improve their understanding and performance. 

2. **Test Set:** a subset of data that is used to evaluate the final model's fit based on the training and validation sets. This is akin to the model's final exam, as it provides an objective assessment of the model's ability to generalize to new, unseen data. 

It is important to note that the test set should only be used once, after the model has been trained using the training dataset. Using the test set multiple times during the development process can lead to overfitting, where the model performs well on the test data but poorly on new, unseen data. The ideal algorithm is generalizable or flexible enough to accurately predict unseen data. This is known as the bias-variance tradeoff. For further information on the bias-variance tradeoff, see [Understanding the Bias-Variance Tradeoff](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229).

### Cross Validation

The last topic we should mention in this section is **cross validation** or ***k*-fold cross validation**. If the dataset is based on this 60:40 split that we mentioned earlier, our model's accuracy will likely be influenced based upon *where* this 60:40 split occurs in the dataset. This will likely bias the data and reduce the algorithm's ability to predict accurately. Cross validation (CV) is implemented to fine tune a model's parameters and ensure that the model is exposed to more patterns in the dataset, which reduces bias and improves prediction accuracy. 

It works by equally splitting the samples in the dataset into *k* number of folds or groups. For example, if 5-fold CV were to be run, we would have 5 different groups in the dataset with 4 retained for validation sets to train the model and 1 retained for testing to fine tune the model's parameters. Across the five iterations, each fold would have a chance to be the test set as seen in the figure below. To measure prediction in Random Forest, out of bag (OOB) errors are calculated each time when tuning parameters. The lower the OOB error, the better the model performance.

<img src="https://github.com/UNC-CEMALB/P1006_Data-Organization-for-High-Dimensional-Analyses-in-Environmental-Health/assets/69641855/ffe9f044-3c13-4b56-80e1-109380f4a6d9" width="684"/>

*Created with BioRender.com*

**LAUREN - CAN YOU PLEASE MAKE SURE THIS ILLUSTRATION ACTUALLY VISUALIZES WHAT'S IN THE CODE BELOW? I MIGHT BE CONFUSING MYSELF.**

Confusion matrix metrics would be calculated after each iteration and averaged for the final results. Check out these resources for additional information on [Cross Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f) and [Cross Validation Pros & Cons](https://www.geeksforgeeks.org/cross-validation-machine-learning/).

## Assessing Model Performance  
Metrics from a confusion matrix are used to measure model performance for classification-based supervised machine learning models. A confusion matrix consists of table that displays the numbers of how often the algorithm correctly or incorrectly predicted the outcome. 

Going back to our animal classification example, let's say the confusion matrix below is a result of how well the model was able to predict whether an animal was considered to be a cat or a dog.

<img src="https://github.com/UNC-CEMALB/P1006_Data-Organization-for-High-Dimensional-Analyses-in-Environmental-Health/assets/69641855/4c7a90a0-725b-48b8-ada4-77a5907131a0" width="684" />

*Created with BioRender.com*

Some of the metrics that can be obtained from a confusion matrix are listed below:

+ **Balanced Accuracy:** is the ratio of correct predictions (TP + TN) to the total number of predictions (TP + TN + TN + FN) and is typically used to assess overall model performance. This metric is prone to skew for imbalanced data. For example, if the animal dataset had 11 cats and 74 dogs the data would be considered to be imbalanced.  

+ **Sensitivity or Recall:** evaluates how well the model was able to predict the "positive" class. It's the ratio of correctly classified true positives to total number of all true positives (TP + FN). 

+ **Specificity:** evaluates how well the model was able to predict the "negative" class. It's the ratio of correctly classified true negatives to total number of all true negatives (TN + FP). 

+  **Positive Predictive Value (PPV) or Precision:**  evaluates how well the model was able to predict the "positive" class. It's the ratio of correctly classified true positives to total number of predicted positives (TP + FP).

+  **Negative Predictive Value (NPV):**  evaluates how well the model was able to predict the "negative" class. It's the ratio of correctly classified true negatives to total number of predicted positives (TN + FN).

For all metrics, the values fall between 0 and 1, where 0 indicates the model was not able to classify any data points correctly and 1 indicating that the model was able to classify all test data correctly. Although subjective, a balanced accuracy of at least 0.7 is considered respectable ([Barkved, 2022](https://www.obviously.ai/post/machine-learning-model-performance#:~:text=Good%20accuracy%20in%20machine%20learning,also%20consistent%20with%20industry%20standards.)).


**Note**: For multi-class classification (more than two labeled outcomes to be predicted), the same metrics are used, but are obtained in a slightly different way. Regression based supervised machine learning models use loss functions to evaluate model performance. For more information regarding confusion matrices and loss functions for regression-based models, see:

 + [Additional Confusion Matrix Metrics](https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5)
 + [Precision vs. Recall or Specificity vs. Sensitivity](https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1)
 + [Loss Functions for Machine Learning Regression](https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3)

## Types of Supervised Machine Learning Algorithms

Although this module's example focuses random forest model in the coding example below, other commonly used algorithims for supervised machine learning include: 

+ **K-Nearest Neighbors (KNN):** Uses Euclidean distance to classify a data point in the test set based upon the most common class of neighboring data points. For more information on KNN, see [K-Nearst Neighbor](https://www.ibm.com/topics/knn)
<img src="https://user-images.githubusercontent.com/96756991/232493057-1e7ce98b-6985-44cd-98a9-3cfea5994659.png" width="684" />

*Created with BioRender.com*

+ **Support Vector Machine (SVM):** Creates a decision boundary line (hyperplane) in n-dimensional space to seperate the data into each class so that when new data is presented they can be easily cateogrized. For more information on SVM, see [Support Vector Machine](https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm).
<img src="https://user-images.githubusercontent.com/96756991/233735220-08682ea4-fe13-41c8-8ac5-9ddda7859328.png" width="684" />

*Created with BioRender.com* 

+ **Random Forest (RF):** Uses a multitude of decison trees trained on a subset of different samples from the training set and the resulting classification of a data point in the test set is aggregated from all the decision trees. A **decision tree** is a hierarchical model that selects the best predictors to classify the data with each node representing a test on a predictor, each branch representing the outcome, and leaf nodes representing the class label. For more information on RF and decision trees, check out [Random Forest](https://www.ibm.com/in-en/topics/random-forest) and
[Decision Trees](https://www.ibm.com/topics/decision-trees#:~:text=A%20decision%20tree%20is%20a,internal%20nodes%20and%20leaf%20nodes.).

Here is an illustration of the terminology used to describe the underlying structure of a decision tree:

<img src="https://github.com/UNC-CEMALB/P1011_Emission-Mixtures/assets/69641855/f4d15d7c-e2ee-42d0-b58c-4c8d3240f6be" width="684" />


Here is a published example decision tree with potential variables and decisions informing low vs high risk of having a heart attack: **I'LL REMAKE THIS SO IT'S BASED ON THE SAME EXAMPLE FROM ABOVE FOR TAME**

<img src="https://github.com/UNC-CEMALB/P1011_Emission-Mixtures/assets/69641855/690d8e90-059e-47d0-98c0-bd0fc6fbebc3" width="684" />

([Navlani, 2023](https://www.datacamp.com/tutorial/decision-tree-classification-python))

## Introduction to Training Module
In this activity, we will analyze an example dataset to see whether we can use environmental monitoring information to predict areas of contamination through random forest modeling. Specifically, RF will leverage a dataset of well water variables that span geospatial location, sampling date, and well water attributes, with the goal of predicting whether detectable levels of inorganic arsenic (iAs) are present. This dataset was obtained through the sampling of 713 private wells across North Carolina using a method that was capable of detecting levels of iAs. After the algorithm has been trained and tested, model performance is assessed using the aforementioned confusion matrix metrics.


## Training Module's Environmental Health Questions

This training module was specifically developed to answer the following environmental health questions:

1. Which well water predictor variables significantly differ between samples containing detectable levels of iAs vs samples that have non-detect levels of iAs?

2. How can we build and evaluate the performance of this RF model?

## Script Preparations

### Cleaning the global environment

In [None]:
rm(list=ls())

### Installing required R packages
If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you

In [None]:
if (!requireNamespace("readxl"))
  install.packages("readxl");
if (!requireNamespace("lubridate"))
  install.packages("lubridate");
if (!requireNamespace("tidyverse"))
  install.packages("tidyverse");
if (!requireNamespace("gtsummary"))
  install.packages("gtsummary");
if (!requireNamespace("flextable"))
  install.packages("flextable");
if (!requireNamespace("caret"))
  install.packages("caret");
if (!requireNamespace("randomForest"))
  install.packages("randomForest");

### Loading R packages required for this session

In [1]:
library(readxl);
library(lubridate);
library(tidyverse);
library(gtsummary);
library(flextable);
library(caret);
library(randomForest);


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr  [39m 1.1.3     [32m✔[39m [34mreadr  [39m 2.1.4
[32m✔[39m [34mforcats[39m 1.0.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mggplot2[39m 3.4.3     [32m✔[39m [34mtibble [39m 3.2.1
[32m✔[39m [34mpurrr  [39m 1.0.2     [32m✔[39m [34mtidyr  [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘flextable’


The following objects are masked from ‘package:gtsummary’:

    as_

### Set your working directory

In [None]:
setwd("/filepath to where your input files are")

TO DELETE IN FINAL MODULE (NOT NOW):
```{r , include=FALSE}
setwd("/Users/juliarager/Library/CloudStorage/OneDrive-UniversityofNorthCarolinaatChapelHill/CEMALB/CEMALB_DataAnalysisPM/Shared Project Folders/TAME 2.0/3. TAME 2.0 Code & Input Files/Chapter5/Supervised ML/Module5")
```

### Importing example dataset
**Will need to change input to module number and add module number to the file itself**

In [2]:
# Load the data
arsenic_data <- data.frame(read_excel("Module5/Module5_Arsenic_Data.xlsx"))

# View the top of the dataset
head(arsenic_data) 

Unnamed: 0_level_0,Tax_ID,Water_Sample_Date,Casing_Depth,Well_Depth,Static_Water_Depth,Flow_Rate,pH,Detect_Concentration
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,1006004,9/24/12,52,165,41,60.0,7.7,ND
2,1024009,12/17/15,40,445,42,2.0,7.3,ND
3,1054019,2/2/15,45,160,40,40.0,7.4,ND
4,1057017,10/22/12,42,440,57,1.5,8.0,D
5,1060006,1/3/11,48,120,42,25.0,7.1,ND
6,1066006,12/15/15,60,280,32,10.0,8.2,D


The columns in this dataset are described below:
+ `Tax_ID`: Tax ID for the property
+ `Water_Sample_ID`: Date that the well was sampled 
+ `Casing_Depth`: Depth of the casing of the well (ft)
+ `Well_Depth`: Depth of the well (ft)
+ `Static_Water_Depth`: Static water depth in the well (ft)
+ `Flow_Rate`: Well flow rate (gallons per minute)
+ `pH`: pH of water sample
+ `Detect_Concentration`: Binary identifier (either non-detect (ND) or detect (D)) if iAs concentration detected in water sample 

### Changing Data Types 
First, `Detect_Concentration` needs to be converted from a character to a factor, so that Random Forest to know that non-detect (binarized as 0) data is considered baseline. `Water_Sample_Date` needs to be converted from a character to a date type, so that Random Forest understands this column contains dates.

In [3]:
arsenic_data = arsenic_data %>%
    # Converting `Detect_Concentration from a character to a factor
    mutate(Detect_Concentration = relevel(factor(ifelse(Detect_Concentration == "D", 1, 0)), ref = "0"),
        # converting water sample date from a character to a date type 
        Water_Sample_Date = mdy(Water_Sample_Date)) %>%
    # Removing tax id and only keeping the predictor and outcome variables in the dataset
    select(-Tax_ID)

head(arsenic_data)

Unnamed: 0_level_0,Water_Sample_Date,Casing_Depth,Well_Depth,Static_Water_Depth,Flow_Rate,pH,Detect_Concentration
Unnamed: 0_level_1,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,2012-09-24,52,165,41,60.0,7.7,0
2,2015-12-17,40,445,42,2.0,7.3,0
3,2015-02-02,45,160,40,40.0,7.4,0
4,2012-10-22,42,440,57,1.5,8.0,1
5,2011-01-03,48,120,42,25.0,7.1,0
6,2015-12-15,60,280,32,10.0,8.2,1


### Testing for differences in predictor variables acrosss the outcome classes

It is useful to run summary statistics on the variables that will be used as predictors in the algorithm to see if there are differences in distributions between the outcomes classes (either non-detect or detect in this case). Typically, greater significance often leads to better predictivity for a certain variable, since the model is better able to separate the classes. We'll use the `tbl_summary()` function from the `gtsummary` package.

For more information on the `tbl_summary()` function, check out this helpful [Tutorial](https://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html).

In [4]:
arsenic_data %>%
    tbl_summary(by = Detect_Concentration,
    # Displaying the mean and standard deviation in parantheses for all continuous variables
                statistic = list(all_continuous() ~ "{mean} ({sd})")) %>%
    # Adding a column that displays the total number of samples for each variable
    # This will be 713 for all variables since we have no missing data
    add_n() %>% 
    # Adding a column that displays the p value from anova
    add_p(test = list(all_continuous() ~ "aov")) %>% 
    # as_flex_table() %>%
    # bold(bold = TRUE, part = "header")
    as_tibble()

**Characteristic**,**N**,"**0**, N = 515","**1**, N = 198",**p-value**
<chr>,<chr>,<chr>,<chr>,<chr>
Water_Sample_Date,713,2013-06-05 (979.174260670888),2013-03-05 (957.843005291701),0.3
Casing_Depth,713,74 (33),55 (23),<0.001
Well_Depth,713,301 (144),334 (128),0.005
Static_Water_Depth,713,35 (12),36 (13),0.4
Flow_Rate,713,25 (33),14 (16),<0.001
pH,713,7.45 (0.55),7.82 (0.40),<0.001


All the variables are significantly different between detect and non-detect iAs with the exception of the sample date and the static water depth, therefore the model should predict fairly well.

## Setting up Cross Validation

As mentioned above, cross validation is done so that the model is trained and tested on different portions of the entire dataset. We'll use 5-fold cross validation.

In [5]:
# Since the splits in the dataset are random, a seed is set for reproducibility to ensure the splits are occuring
# in the same locations each time the code is run
set.seed(12)

# 5-fold cross validation
# Saving the index (row number) where the 5 splits are occuring
# These indices will be iterated through using a loop to create each training and testing datasets
arsenic_index = createFolds(arsenic_data$Detect_Concentration, k = 5) 

Within CV, different parameters will be tested, including the number of trees "grown" by RF (`ntree_values`) and the number of predictors used in those trees (`mtry_values`).

In [6]:
ntree_values = c(50, 250, 500) # number of trees 
p = dim(arsenic_data)[2] - 1 # number of predictor variables in the dataset
mtry_values = c(sqrt(p), p/2, p) # number of predictors to be used in the model

## Predicting iAs Detection with a Random Forest (RF) Model

In [7]:
# Setting the seed again so the predictions are consistent
set.seed(12)

# Creating an empty dataframe to save the metrics
metrics_df = data.frame()

# Iterating through the cross validation folds
for (i in 1:length(arsenic_index)){
    # Training data
    data_train = arsenic_data[-arsenic_index[[i]],]
    
    # Test data
    data_test = arsenic_data[arsenic_index[[i]],]
    
    # Creating empty lists and dataframes to store errors 
    reg_rf_pred_tune = list()
    rf_OOB_errors = list()
    rf_error_df = data.frame()
    
    # Tuning parameters: using ntree and mtry values to determine which combination yields the smallest OOB error 
    # from the validation datasets
    for (j in 1:length(ntree_values)){
        for (k in 1:length(mtry_values)){
            
            # Running RF to tune parameters
            reg_rf_pred_tune[[k]] = randomForest(Detect_Concentration ~ ., data = data_train, 
                                                 ntree = ntree_values[j], mtry = mtry_values[k])
            # Obtaining the OOB error
            rf_OOB_errors[[k]] = data.frame("Tree Number" = ntree_values[j], "Variable Number" = mtry_values[k], 
                                   "OOB_errors" = reg_rf_pred_tune[[k]]$err.rate[ntree_values[j],1])
            
            # Storing the values in a dataframe
            rf_error_df = rbind(rf_error_df, rf_OOB_errors[[k]])
        }
    }
    
    # Finding the lowest OOB error using best number of predictors at split
    best_oob_errors <- which(rf_error_df$OOB_errors == min(rf_error_df$OOB_errors))

    # Now running RF on the entire training set with the tuned parameters
    reg_rf <- randomForest(Detect_Concentration ~ ., data = data_train,
                               ntree = rf_error_df$Tree.Number[min(best_oob_errors)],
                               mtry = rf_error_df$Variable.Number[min(best_oob_errors)])

    # Predicting on test set and adding the predicted values as an additional column to the test data
    data_test$Pred_Detect_Concentration = predict(reg_rf, newdata = data_test, type = "response")

    # Obtaining the confusion matrix
    matrix = confusionMatrix(data = data_test$Pred_Detect_Concentration, 
                             reference = data_test$Detect_Concentration, positive = "1")
    
    # Extracting balanced accuracy, sensitivity, specificity, and PPV
    matrix_values = data.frame(t(c(matrix$byClass[11])), t(c(matrix$byClass[1:3])))
    
    # Adding values to df to be averaged across the 5 splits from CV
    metrics_df = rbind(metrics_df, matrix_values)
}

# Taking average
metrics_df = metrics_df %>%
    summarise(`Balanced Accuracy` = mean(Balanced.Accuracy), Sensitivity = mean(Sensitivity), 
          Specificity = mean(Specificity), PPV = mean(Pos.Pred.Value))

# Viewing the model's performance metrics
metrics_df

Balanced Accuracy,Sensitivity,Specificity,PPV
<dbl>,<dbl>,<dbl>,<dbl>
0.6200317,0.3934615,0.8466019,0.516777


Takeaways from this confusion matrix:

+ Overall, the model did a moderate job at predicting if iAs would be detected based on a balanced accuracy of ~0.6
+ RF did a poor job of predicting detect data with a sensitivity of 0.4 and a PPV of ~0.5
+ The model was significantly better at predicting non-detect data based on a specificity of ~0.8 

**I THINK WE COULD DISCUSS THE CLASS IMBALANCE ISSUE AND ADD EITHER SMOTE OR AUC, BUT THE MODEL PERFORMED DECENTLY WELL SO I'M NOT SURE IF THAT'S REALLY NECESSARY.**

ALEXIS- I AGREE, I THINK JUST INCORPORATING AUC AS AN ADDITIONAL MODEL PERFORMANCE METRIC IS A GREAT IDEA, AND LET'S LEAVE OUT SMOTE FOR NOW

## Additional Resources
To learn more check out the following resources: **I WILL LIKELY CHANGE THESE RESOURCES...I THINK WE CAN FIND BETTER ONES** ALEXIS- YES AGREED, LET'S ALSO ASK OTHERS FOR THEIR FAVORITES - KYLE HAS RESPONDED WITH SOME GOOD SUGGESTIONS AT THIS POINT

+ [Machine Learning Mastery](https://machinelearningmastery.com/machine-learning-in-r-step-by-step/)
+ [Master in Data Science](https://www.mastersindatascience.org/learning/machine-learning-algorithms/decision-tree/)
+ [IBM - What is Machine Learning](https://www.ibm.com/topics/machine-learning)
+ Machine Learning by Mueller, J. P. (2021). Machine learning for dummies. John Wiley &amp; Sons. 
## Concluding Remarks

In conclusion, this training module has provided an informative introduction to supervised machine learning using classification techniques in R. Machine learning is a powerful tool that can help researchers gain new insights and improve models to analyze complex datasets faster and in a more comprehensive way. The example we've explored demonstrates the utility of supervised machine learning models on datasets with a plethora of features.

## Test Your Knowledge 

1. Using the "Manganese_Data", use RF to determine if well water data can be accurate predictors of Manganese detection. The data is structured similarly to the "Arsenic_Data" used in this module, however it now includes 4 additional predictor variables:

+ `Longtitude`: Longtitude of address (decimal degrees)
+ `Latitude`: Latitude of address (decimal degrees)
+ `Stream_Distance`: Euclidean distance to the nearest stream (feet)
+ `Elevation`: Surface elevation of the sample location