# Clean Up Sample Attack Data

This notebook will clean up a sample attack data file and prepare us for logistic regression.

We will the `Pandas` and `NumPy` libraries, two useful Python libraries for data science.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("../data/1553_dos_attack1.csv")

## Data Operations

Let's take a quick review of the data.  We can see that there are many instances of NaN in our dataset.

In [None]:
df

We'll replace NumPy's `NaN` with Python's `None` to make our next data cleanup easier.  It turns out that when we re-load the data for the next notebook, everything will be read in again as `NaN`.

In [None]:
df = df.replace({np.nan:None})

In the prior notebook, we learned that only certain features seem to influence our `malicious` label.  Therefore, for the sake of simplicity, we will only include those features.

Note that we also remove a few features like `timestamp`, which do correlate to `malicious` but have zero predictive value.

In [None]:
df_selective = df[['malicious', 'dw0', 'msgTime', 'rxSts', 'sa', 'gap', 'dsa', 'connType', 'ssa', 'txSts', 'da', 'wc', 'modeCode']]

In [None]:
df_selective.head(5)

The R notebook auto-converted our binary data elements to integers, but Python kept them as strings.  Let's convert `0x0010` and the like to their appropriate integer values.

In [None]:
df_selective['dw0'] = df_selective['dw0'].transform(lambda d: None if pd.isnull(d) else int(d, 16))
df_selective['rxSts'] = df_selective['rxSts'].transform(lambda d: None if pd.isnull(d) else int(d, 16))
df_selective['txSts'] = df_selective['txSts'].transform(lambda d: None if pd.isnull(d) else int(d, 16))

Let's take a quick look to make sure that our results look okay.

In [None]:
df_selective.head()

## Write Out Results

Now that we're satisfied with the results, we can write them out and pick it back up in the next notebook.

In [None]:
df_selective.to_csv("../1553_dos_attack1_Py_clean.csv", index=False)