# Building a Confusion Matrix in R

In this notebook, we will train a logistic regression model on an extended dataset.  This includes variations of the original dataset, as well as some randomized records to ensure that our model does not end up perfect.

In addition to the `tidyverse` package, we will also use two more packages: `caret`, a general-purpose package which helps data scientists with common tasks; and `mice`, a library for imputing missing data.

In [None]:
library(tidyverse)
library(caret)
library(mice)

## Train the Model

Now that we have loaded the packages, let's quickly train a model.

In [None]:
attack_data <- read_csv("../data/1553_dos_cm_R.csv")
set.seed(184856)
rand_attack_data <- attack_data[sample(nrow(attack_data)), ]
imputed_data <- mice(rand_attack_data, m=5, maxit=50, meth='pmm', seed=103409)
completed_data <- complete(imputed_data, action=1)
completed_data$sa[is.na(completed_data$sa)] <- 0
completed_data$ssa[is.na(completed_data$ssa)] <- 20
trainIndex <- caret::createDataPartition(completed_data$malicious, p = 0.7, list  = FALSE, times = 1)
train_data <- completed_data[trainIndex,]
test_data <- completed_data[-trainIndex,]

In [None]:
nrow(train_data)

In [None]:
nrow(test_data)

In [None]:
model <- glm(malicious ~ dw0 + msgTime + rxSts + sa + gap + dsa + ssa + txSts + da + wc, data=train_data, family=binomial)
model

In [None]:
model_pred <- predict(model, test_data, type="response")
pred_malicious <- case_when(model_pred >= 0.5 ~ TRUE, is.na(model_pred) ~ NA, .default=FALSE)
outcomes <- cbind(as.data.frame(pred_malicious), test_data)

## Using the Confusion Matrix

The `caret` package provides us an built-in confusion matrix, showing us the results for sensitivity and specificity, as well as positive and negative predictive value.

In this case, our accuracy is 85.6% but our specificity (if the result was FALSE, did we predict FALSE?) is only 67.8% because we had 116 out of 171 correct.  We were very good with sensitivity (if the result was TRUE, did we predict TRUE?).

In order to get the results to show up in the right order, we're going to label TRUE as "Malicious" and FALSE as "Not Malicious" because `confusionMatrix()` displays results in alphabetical order.  We will also set the positive indicator to "Malicious" to indicate that this is the outcome we want to see:  we want to capture malicious databus traffic.

In [None]:
outcomes$pred_malicious_label <- case_when(outcomes$pred_malicious == TRUE ~ "Malicious", .default = "Not Malicious")
outcomes$malicious_label <- case_when(outcomes$malicious == TRUE ~ "Malicious", .default = "Not Malicious")

In [None]:
caret::confusionMatrix(as.factor(outcomes$pred_malicious_label), as.factor(outcomes$malicious_label), positive='Malicious')