# Model a Multinomial Logistic Regression in R

This notebook will perform multinomial logistic regression on our sample data.  We have a decent amount of data, though there is some skew that we'll have to watch out for:  two of our classes are under-represented in the dataset.

For prior analysis, we've used `tidyverse`, `caret`, and `mice`.  Now we'll add one more library, `nnet`, which allows you to train a neural network in R.

In [None]:
library(tidyverse)
library(caret)
library(mice)
library(nnet)

## Data Preparation

The extended attack data already has the features we need roughly in the format we need it.

In [None]:
attack_data <- read_csv("../data/ExtendedAttackData.csv")

In [None]:
head(attack_data)

The `malicious` column is not necessary here, as we instead can use the `AttackType` category to discern exactly which attack type corresponds with this databus transmission.

In [None]:
set.seed(106842)
attack_data$malicious <- NULL
attack_data <- type.convert(attack_data, as.is = TRUE)

Just as before, we can perform a cleanup of the data.  This time around, `mice()` doesn't quite capture `sa` or the new `modeCodeVal` column, so we'll set those to 0 if they are missing.

In [None]:
rand_attack_data <- attack_data[sample(nrow(attack_data)), ]
imputed_data <- mice(rand_attack_data, m=5, maxit=50, meth='pmm', seed=88109)
completed_data <- complete(imputed_data, action=1)
completed_data$sa[is.na(completed_data$sa)] <- 0
completed_data$modeCodeVal[is.na(completed_data$modeCodeVal)] <- 0

We want to turn `AttackType` into a categorical variable, which in R is called a factor.  Then, we want to relevel the factor to specify "None" as the default for multinomial logistic regression.

In [None]:
completed_data$AttackType <- relevel(as.factor(completed_data$AttackType), ref = "None")

After releveling our label, we'll split the data into training and test subsets.

In [None]:
trainIndex <- caret::createDataPartition(completed_data$AttackType, p = 0.7, list  = FALSE, times = 1)
train_data <- completed_data[trainIndex,]
test_data <- completed_data[-trainIndex,]

## Modeling

The `multinom()` function actually performs our multinomial logistic regression analysis.  Note that we don't need to do anything special--everything related to our SoftMax function, cross-entropy loss function, and weighting happens inside the `multinom()` function itself.

In [None]:
model <- multinom(AttackType ~ dw0 + msgTime + rxSts + sa + gap + dsa + ssa + txSts + da + wc + modeCodeVal, data=train_data)


Now that we have a model, we can see the coefficients and standard errors for each class.  The "None" class is the baseline, so each value is a relative change from the "None" case.  These weights don't necessarily make a lot of sense to us as-is but we can convert them into more human-useful results in a bit.

In [None]:
summary(model)

We can take the exponent of coefficients to get the risk ratio of each variable for each class, giving us an indication of what a change in one of these values does to our likelihood of landing on a particular class.

In [None]:
exp(coef(model))

The easiest way to see what the results look like is to use the `fitted()` function on our model.  Here, we can see that we were able to differentiate between these five classes of result rather easily.  Even after adding in the new datasets, it turns out that there's enough variation in the dataset to nearly-guarantee a single result.

In [None]:
head(fitted(model))

## Evaluation against Test Data

Nonetheless, we still want to test against unseen data.  Just because we did extremely well on the training data doesn't mean we'll nail the test dataset.  There are two methods we can use:  "probs," which returns the probability of choosing each class; and "class," which simply gives us the most likely class.  Let's perform each in turn.

In [None]:
model_pred <- predict(model, test_data, type="probs")
model_class <- predict(model, test_data, type="class")

The result of viewing probabilities is a matrix where the row summation will always be 1.  If we didn't have such an easy task of differentiating results, this would likely include probabilities for multiple candidate results.

In [None]:
head(model_pred)

The class returns our most likely response.  And drilling into Levels, we can see that we did, in fact, predict results for each class, something we have to be concerned about when dealing with imbalanced data.

In [None]:
head(model_class)

Let's now combine together the predicted model class and the actual attack type in our test dataset.

In [None]:
outcomes <- cbind(as.data.frame(model_class), test_data)
head(outcomes, 15)

Once we've done that, we can use the confusion matrix to see how we did.  We see that the R logistic regression algorithm does a terrible job of separating regular denial of service attacks from broadcast denial of service attacks.  Because of this, we get every one of the classic DoS predictions wrong.  It does a great job of getting everything else correct, however.

In [None]:
caret::confusionMatrix(as.factor(outcomes$model_class), as.factor(outcomes$AttackType))