# Clean Up Sample Attack Data

This notebook will clean up a sample attack data file and prepare us for logistic regression.

We will load the `tidyverse` package, which allows us to perform data operations with ease.

In [None]:
library(tidyverse)

In [None]:
sample <- read_csv("../data/1553_dos_attack1.csv")

## Data Operations

The first thing we will do is replace any occurrence of the string "N/A" with R's `NA`.  This will help us understand when data is not available for a given feature.

In [None]:
sample[sample == 'N/A'] <- NA

In the prior notebook, we learned that only certain features seem to influence our `malicious` label.  Therefore, for the sake of simplicity, we will only include those features.

Note that we also remove a few features like `timestamp`, which do correlate to `malicious` but have zero predictive value.

In [None]:
sample_clean <- sample %>%
            select(malicious, dw0, msgTime, rxSts, sa, gap, dsa, connType, ssa, txSts, da, wc, modeCode)

The datatypes that `read_csv()` gave us are a little strange, though we can use `type.convert()` to let R guess what the best data types are.  This will work much better now that we've replaced "N/A" strings with `NA` and R doesn't have to take those values into consideration when inferring types.

In [None]:
sample_clean <- type.convert(sample_clean, as.is = TRUE)

Let's take a quick look at our resulting dataframe to see if everything looks okay.

In [None]:
head(sample_clean)

## Write Out Results

Now that we're satisfied with the results, we can write them out and pick it back up in the next notebook.

In [None]:
write_csv(sample_clean, "../1553_dos_attack1_R_clean.csv")