# Data Preprocessing

#### Table of Contents  
[GDN Data Preprocessing](#example)  
[NAB Data Preprocessing](#example)  
[Save Results](#example)  

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.preprocessing import MinMaxScaler

## GDN Data Preprocessing

GDN requires the following files:
1. **list.txt**: the feature names, one feature per line
2. **train.csv**: training data modeling normal behavior, no anomalies were present according to the paper
3. **test.csv**: test data.test.csv should have a column named "attack" which contains ground truth label(0/1) of being attacked or not(0: normal, 1: attacked)

In [2]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [3]:
train_df.head(n=5)

Unnamed: 0,EnvironmentID,Year,Month,Day,Hour,UserCount,SessionCount,Duration,InputOctet,OutputOctet,InputPacket,OutputPacket
0,0,2017,12,10,9,10,12,43,9357.0,6310.0,46,30
1,0,2017,12,10,9,6,7,38,6011.0,5163.0,27,27
2,0,2017,12,10,9,6,6,50,7973.0,7375.0,37,37
3,0,2017,12,10,9,5,5,28,1435.0,1313.296296,7,7
4,0,2017,12,10,10,4,4,42,2222.0,1011.217778,12,12


In [4]:
test_df.head(n=5)

Unnamed: 0,EnvironmentID,Year,Month,Day,Hour,UserCount,SessionCount,Duration,InputOctet,OutputOctet,InputPacket,OutputPacket
0,0,2017,12,10,9,10,13,39,7195.0,4435.166667,34,25
1,0,2017,12,10,9,14,15,104,20991.0,10631.666667,88,74
2,0,2017,12,10,9,8,8,81,15266.0,11901.217391,75,74
3,0,2017,12,10,9,6,6,42,9135.0,6474.139752,43,43
4,0,2017,12,10,10,6,6,43,8297.0,6401.6,42,41


The wifi data contains ata for each environment at 15 minute increments throughout the day. In order to represent this as time series data, we sort by Year, Month, Day, Hour and then randomize all the data samples that fall within that group. 

In [None]:
random_data = np.random.randint(1,100000,size=len(train_df))
train_df['random_numbers'] = random_data


print(train_df)

In [None]:
random_data = np.random.randint(1,100000,size=len(test_df))
test_df['random_numbers'] = random_data

print(df)

We are using Net2: AP Shutdown/Halt within the wifi dataset as our test data. Our Anomalous data is all in environment 3. Let's add the attack collumn for our test data

In [None]:
def fill_attack(row):
    if row["EnvironmentID"] == 3:
        return 1
    else:
        return 0

In [None]:
test_df['attack'] = test_df.apply(lambda x:fill_attack(x), axis=1)

In [None]:
test_df.loc[df['EnvironmentID'] == 3]

In [None]:
train_df = train_df.sort_values(by = ['Year', 'Month', "Day", "Hour", "random_numbers"])
test_df = test_df.sort_values(by = ['Year', 'Month', "Day", "Hour", "random_numbers"])

train_df.head(n=5)

In [27]:
train_df = train_df.drop(columns=['Year', 'Month', "Day", "Hour", "EnvironmentID", "random_numbers"])
test_df = test_df.drop(columns=['Year', 'Month', "Day", "Hour", "EnvironmentID", "random_numbers"])

In [28]:
train_df.head()

Unnamed: 0,UserCount,SessionCount,Duration,InputOctet,OutputOctet,InputPacket,OutputPacket
601,9,9,47,5167.0,2177.133333,20,12
3,5,5,28,1435.0,1313.296296,7,7
240,12,19,60,0.0,0.0,0,0
0,10,12,43,9357.0,6310.0,46,30
123,26,27,298,61246.0,88346.939782,291,347


These are the column names to be used in list.txt

In [None]:
train_df.columns

In [None]:
train_df.to_csv("train_preprocessed.csv")
test_df.to_csv("test_preprocessed.csv")

Train and test data must be normalized for better results

In [None]:
# max min(0-1)
def norm(train, test):

    normalizer = MinMaxScaler(feature_range=(0, 1)).fit(train) # scale training data to [0,1] range
    train_ret = normalizer.transform(train)
    test_ret = normalizer.transform(test)

    return train_ret, test_ret

In [None]:
test = pd.read_csv('data/wifi/test.csv', index_col=0)
train = pd.read_csv('data/wifi/train.csv', index_col=0)


test = test.iloc[:, 1:]
train = train.iloc[:, 1:]

train = train.fillna(train.mean())
test = test.fillna(test.mean())
train = train.fillna(0)
test = test.fillna(0)

train_columns = train.columns
test_columns = test.columns

# trim column names
train = train.rename(columns=lambda x: x.strip())
test = test.rename(columns=lambda x: x.strip())

print(len(test.columns),test.columns)
print(len(train.columns),train.columns)


# train_labels = train.attack
test_labels = test.attack

# train = train.drop(columns=['attack'])
test = test.drop(columns=['attack'])


# x_train, x_test = norm(train.values, test.values)

In [None]:
train_df = pd.DataFrame(x_train, columns = train_columns)
test_df = pd.DataFrame(x_test, columns = test_columns[:-1])
train_df.head()

In [None]:
test_df['attack'] = test_labels

In [None]:
train_df.to_csv('data/wifi/train.csv')
test_df.to_csv('data/wifi/test.csv')