# The Receiver Operating Characteristic Curve in R

In this notebook, we will train a logistic regression model on an extended dataset.  This includes variations of the original dataset, as well as some randomized records to ensure that our model does not end up perfect.  Then, we will build a confusion matrix and see how it aligns to the Receiver Operating Characteristic (ROC) curve.

In addition to the `tidyverse` package, we will also use two more packages: `caret`, a general-purpose package which helps data scientists with common tasks; and `mice`, a library for imputing missing data.

In [None]:
library(tidyverse)
library(caret)
library(mice)

## Train the Model

Now that we have loaded the packages, let's quickly train a model.

In [None]:
attack_data <- read_csv("../data/1553_dos_cm_R.csv")
set.seed(184856)
rand_attack_data <- attack_data[sample(nrow(attack_data)), ]
imputed_data <- mice(rand_attack_data, m=5, maxit=50, meth='pmm', seed=103409)
completed_data <- complete(imputed_data, action=1)
completed_data$sa[is.na(completed_data$sa)] <- 0
completed_data$ssa[is.na(completed_data$ssa)] <- 20
trainIndex <- caret::createDataPartition(completed_data$malicious, p = 0.7, list  = FALSE, times = 1)
train_data <- completed_data[trainIndex,]
test_data <- completed_data[-trainIndex,]

In [None]:
nrow(train_data)

In [None]:
nrow(test_data)

In [None]:
model <- glm(malicious ~ dw0 + msgTime + rxSts + sa + gap + dsa + ssa + txSts + da + wc, data=train_data, family=binomial)
model

In [None]:
model_pred <- predict(model, test_data, type="response")
pred_malicious <- case_when(model_pred >= 0.5 ~ TRUE, is.na(model_pred) ~ NA, .default=FALSE)
outcomes <- cbind(as.data.frame(pred_malicious), test_data)

## Using the Confusion Matrix

The `caret` package provides us an built-in confusion matrix, showing us the results for sensitivity and specificity, as well as positive and negative predictive value.

In this case, our accuracy is 85.6% but our specificity (if the result was FALSE, did we predict FALSE?) is only 67.8% because we had 116 out of 171 correct.  We were very good with sensitivity (if the result was TRUE, did we predict TRUE?).

In order to get the results to show up in the right order, we're going to label TRUE as "Malicious" and FALSE as "Not Malicious" because `confusionMatrix()` displays results in alphabetical order.  We will also set the positive indicator to "Malicious" to indicate that this is the outcome we want to see:  we want to capture malicious databus traffic.

In [None]:
outcomes$pred_malicious_label <- case_when(outcomes$pred_malicious == TRUE ~ "Malicious", .default = "Not Malicious")
outcomes$malicious_label <- case_when(outcomes$malicious == TRUE ~ "Malicious", .default = "Not Malicious")

In [None]:
caret::confusionMatrix(as.factor(outcomes$pred_malicious_label), as.factor(outcomes$malicious_label), positive='Malicious')

## Plotting the ROC Curve

Now that we have our data, we can build a Receiver Operating Characteristic curve.  To do this, we will load the `pROC` library and use the `roc()` function to generate the ROC curve.  We will also use `ggroc()` to interact with the `ggplot2` library (installed as part of the `tidyverse` library) to generate a quick and easy drawing for us.

In [None]:
library(pROC)

In [None]:
prob_malicious <- case_when(outcomes$malicious == TRUE ~ 1.0, .default = 0.0)
rocobj <- roc(prob_malicious, model_pred)
rocobj

In [None]:
options(repr.plot.width=5, repr.plot.height=5)
ggroc(rocobj) +
    geom_abline(slope=1,intercept=1,color="#999999") +
    theme_minimal()

### Defining Area Under the Curve

We saw the calculation for area under the curve in the `rocobj` results, but we can also calculate it ourselves using the `auc()` function.

In [None]:
auc <- round(auc(prob_malicious, model_pred),4)
auc

## Finding the Best Fit

The ROC curve tells us the scope of how well we can perform given our current model.  In other words, it gives us a tradeoff between sensitivity and (1 - specificity):  to imrpove specificity, we necessarily decrease sensitivity, and vice versa.  There are a few different calculations of what is the 'best' trade-off and the most popular one is Youden's J Statistic, which measures the maximum distance from our curve to the diagonal line.

Note that this also matches up with the results we see in the confusion matrix!

In [None]:
coords(rocobj, "best", transpose = FALSE, best.method="youden")