# Model a Logistic Regression in R

This notebook will perform logistic regression on our sample data.  The total number of sample records is not great, though given that most of our input features have very few unique values, it's not quite as bad as it would first appear to be.

In addition to the `tidyverse` package, we will also use two more packages: `caret`, a general-purpose package which helps data scientists with common tasks; and `mice`, a library for imputing missing data.

In [None]:
library(tidyverse)
library(caret)
library(mice)

## Prepare the Data

This is the cleaned-up attack data from our prior notebook.

In [None]:
attack_data <- read_csv("../1553_dos_attack1_R_clean.csv")

The next step we will perform is shuffling the order of the data.  This way, our training and test choices are randomizes within the dataset so we reduce the risk of something in our test data that the training side never saw.

In [None]:
set.seed(184856)
rand_attack_data <- attack_data[sample(nrow(attack_data)), ]

## Impute Missing Data

The next thing we'll want to do is impute data.  Because we have data missing from our dataset, that can throw off our logistic regression.  R's logistic regression function can support missing values, but the results won't be as effective as if we fill in the gaps.  First, let's look at the pattern of missing data.

In [None]:
md.pattern(rand_attack_data)

After finding the pattern of the data, we'll perform imputation, using the default of 5 imputed datasets (`m=5`), setting the max number of iterations to 50 (`maxit=50`), and the method to predictive mean matching (`meth='pmm'`).  These are fairly standard settings, so we could tweak them if needed.

In [None]:
imputed_data <- mice(rand_attack_data, m=5, maxit=50, meth='pmm', seed=103409)

Once we've generated the pattern, we can complete our missing data by retrieving the first complete data set from MICE.

In [None]:
completed_data <- complete(imputed_data, action=1)

After doing this, we still have two columns with missing values:  `sa` and `ssa`.  We'll set missing values to 0 for `sa` and 20 for `ssa`.

In [None]:
completed_data$sa[is.na(completed_data$sa)] <- 0
completed_data$ssa[is.na(completed_data$ssa)] <- 20

## Partitioning the Data

The `createDataPartition` allows us to split data on some variable.  Typically, this would be a categorical input variable, to increase the likelihood that we get coverage of its potential values in the training and test data, as a new category on the test side can cause prediction errors.

In this case, I'll split on the label because there is some imbalance in in the two classes.

In [None]:
trainIndex <- caret::createDataPartition(completed_data$malicious, p = 0.7, list  = FALSE, times = 1)

The data partition gives us back an index.  We can use that index to split our randomized attack data into training and test datasets.

In [None]:
train_data <- completed_data[trainIndex,]
test_data <- completed_data[-trainIndex,]

I want to see approximately 70% in the training dataset and approximately 30% in the test dataset.  That's the `p = 0.7` parameter in the prior call.

In [None]:
nrow(train_data)

In [None]:
nrow(test_data)

## Training a Model

Training a model is very easy to do with R.  The `glm()` (Generalized Linear Model) function allows us to create (among others) linear, logistic, and Poisson regressions using the same common syntax.  For logistic regression, the family is `binomial`, meaning that our label takes on one of two values.

In [None]:
model <- glm(malicious ~ dw0 + msgTime + rxSts + sa + gap + dsa + ssa + txSts + da + wc, data=train_data, family=binomial)

Now that we have trained the model, we can review the outputs.  Note that, if we did not impute missing values, several of our variables would return `NA` for the coefficient.  Because we imputed missing values, we get weights for each input variable.

In [None]:
model

## Making Predictions

We have a left-over test data set we can use to generate predictions.  This will give us a good idea of how well we did in our logistic regression exercise.  Note that we need to include `type="response"` to get back probability data scaled between 0 and 1; otherwise, we will get back a numeric value representing how far along the logistic curve we are.

In [None]:
model_pred <- predict(model, test_data, type="response")

In [None]:
head(model_pred)

The responses are a bit tough to read but they are close to 1 or 0.  What we'll do is convert these into logical `TRUE` and `FALSE` statements based on whether the prediction is greater than or equal to 0.5.

In [None]:
pred_malicious <- case_when(model_pred >= 0.5 ~ TRUE, is.na(model_pred) ~ NA, .default=FALSE)

The `pred_malicious` column gives us our predicted values.  We can add this on to our test data so we can see in one go the input data, our prediction of whether that traffic was malicious, and whether the traffic actually was malicious.

In [None]:
outcomes <- cbind(as.data.frame(pred_malicious), test_data)

In [None]:
outcomes

After getting the number of rows, we next want to find the number of correct predictions, which is cases where the value of `pred_malicious` is the same as `malicious`.

In [None]:
num_rows <- nrow(outcomes)
correct_predictions <- sum(outcomes$pred_malicious == outcomes$malicious)

Finally, let's show these results to see how many we got correct, how many predictions there were, and our **accuracy**, which is the number of correct predictions divided by the total number of predictions.

In [None]:
c(correct_predictions, num_rows, correct_predictions / num_rows)