# Analyze Sample Attack Data

This notebook will analyze a sample attack data file and help us understand the shape of this sample data.

We will load the `tidyverse` package, which allows us to perform data operations with ease.  The `rpart` package introduces Recursive Partitioning and Regression Trees, which we will use in calculating a rough measure of correlation.

In [None]:
library(tidyverse)
library(rpart)

## Basic Analysis

The dataset we will read is 1553_dos_attack1.csv.

In [None]:
sample <- read_csv("../data/1553_dos_attack1.csv")

`nrow()` tells us how many rows there are in a dataframe.

In [None]:
nrow(sample)

Review the top few rows in the dataframe.

In [None]:
head(sample)

What is the set of unique values for one of these columns?  I'll choose `dw31` as an example.

In [None]:
unique(sample$dw31)

The `rapply()` function allows us to execute a function for each **column** in a dataframe.  Here, I want to see the cardinality of each feature.

In [None]:
rapply(sample, function(x) { length(unique(x)) })

Let's drill into the possible values and how many times each shows up.  We'll do that for `connType` (as an example) and `malicious` (our label).

In [None]:
data.table::setDT(sample)[, .N, keyby=connType]

In [None]:
data.table::setDT(sample)[, .N, keyby=malicious]

Let's take a look at some of the rows which are marked as malicious.

In [None]:
sample %>%
    filter(malicious == TRUE) %>%
    head(10)

## "Correlation" Analysis

Many columns have the string "N/A" instead of an R-friendly `NA`.  This code will fix that.  We'll do this again when we perform the actual data cleanup, but for now, it makes the next operations more effective.

In [None]:
sample[sample == 'N/A'] <- NA

This block of code provides us a rough idea of how various features 'correlate' to our label.  I put 'correlation' in quotations because technically, correlation requires numeric features and most of these are strings.

In [None]:
# https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/
calc_mae_reduction <- function(y_hat, y_actual) {
  model_error <- mean(abs(y_hat - y_actual))
  baseline <- mean(y_actual, na.rm = TRUE)
  baseline_error <-  mean(abs(baseline - y_actual))
  result <- 1 - model_error/baseline_error
  # cat("MAE - baseline:", baseline_error, "\n")
  # cat("MAE - model:", model_error, "\n")
  # cat("MAE - before cleaning up:", result, "\n")
  result <- max(0.0, min(result, 1.0))
  round(100*result, 2)
}

calc_misclass_reduction <- function(y_hat, y_actual) {
  tab <- table(y_hat, y_actual)
  model_error <- 1 - sum(diag(tab))/sum(tab)
  majority_class <- names(which.max(table(y_actual)))
  baseline.preds <- rep(majority_class, length(y_actual))
  baseline_error <- mean(baseline.preds != y_actual)
  result <- 1 - model_error/baseline_error
  # cat("MISCLASS - baseline:", baseline_error, "\n")
  # cat("MISCLASS - model:", model_error, "\n")
  # cat("MISCLASS - before cleaning up:", result, "\n")
  result <- max(0.0, min(result, 1.0))
  round(100*result, 2)
}

x2y_inner <- function(x, y) {
  
  if (length(unique(x)) == 1 |
      length(unique(y)) == 1 ) {
    return(NA)
  } 
  # if y is continuous
  if (is.numeric(y)) {
    preds <- predict(rpart(y ~ x, method = "anova"), type = 'vector')
    calc_mae_reduction(preds, y)
  }
  # if y is categorical
  else {
    preds <- predict(rpart(y ~ x, method = "class"), type = 'class')
    calc_misclass_reduction(preds, y)
  }
}


simple_boot <- function(x,y) {
  ids <- sample(length(x), replace = TRUE)
  x2y_inner(x[ids], y[ids])
}

x2y <- function(x, y, confidence = FALSE) {
  results <- list()
  
  missing <-  is.na(x) | is.na(y)
  results$perc_of_obs <- round(100 * (1 - sum(missing) / length(x)), 2)
  
  x <- x[!missing]
  y <- y[!missing]
  
  results$x2y <- x2y_inner(x, y)
  
  if (confidence) {
    results$CI_95_Lower = NA
    results$CI_95_Upper = NA
    if (!is.na(results$x2y) & results$x2y > 0) {
      n <- length(x)
      draws <- replicate(1000, simple_boot(x, y))
      errors <- draws - results$x2y
      results$CI_95_Lower <- results$x2y - round(quantile(errors,
                                                          probs = 0.975,
                                                          na.rm = TRUE), 2)
      results$CI_95_Upper <- results$x2y - round(quantile(errors,
                                                          probs = 0.025,
                                                          na.rm = TRUE), 2)
    }
  }
  results
}

dx2y <- function(d,
                 target = NA,
                 confidence = FALSE) {
  if (is.na(target)) {
    pairs <- combn(ncol(d), 2)
    pairs <- cbind(pairs, pairs[2:1, ])
  }
  else {
    n <- 1:ncol(d)
    idx <- which(target == names(d))
    n <- n[n != idx]
    pairs <- cbind(rbind(n, idx), rbind(idx, n))
  }
  
  n <- dim(pairs)[2]
  
  results <- data.frame(x = names(d)[pairs[1,]],
                        y = names(d)[pairs[2,]],
                        perc_of_obs = rep(0.00, n),
                        x2y = rep(0.00, n),
                        CI_95_Lower = rep(NA, n),
                        CI_95_Upper = rep(NA, n))
  
  for (i in 1:n) {
    x <- d %>% pull(pairs[1, i])
    y <- d %>% pull(pairs[2, i])
    if (confidence) {
      results[i, 3:6] <- x2y(x, y, confidence = TRUE)
    }
    else {
      results[i, 3:4] <- x2y(x, y)
    }
  }
  
  if (!confidence) {
    results$CI_95_Lower <- NULL
    results$CI_95_Upper <- NULL
  }
  
  results <- results %>% arrange(desc(x2y), desc(perc_of_obs))
  
  results
}

In [None]:
dx2y(sample, target = "malicious", confidence = FALSE) %>%
    filter(y == 'malicious') %>%
    filter(x2y > 0)

## Quick Analysis

We can see the set of columns which appear to drive our label.  Let's look at a few of these in turn and see if we can learn something from them.

In [None]:
data.table::setDT(sample)[, .N, keyby=c("malicious", "sa")]

In [None]:
data.table::setDT(sample)[, .N, keyby=c("malicious", "dw0")]

In [None]:
data.table::setDT(sample)[, .N, keyby=c("malicious", "msgTime")]

In [None]:
data.table::setDT(sample)[, .N, keyby=c("malicious", "gap")]