<a href="https://colab.research.google.com/github/day02/AnomalyDetection/blob/default/AnomalyDetector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Autoencoder

An autoencoder is a neural network that has the same number of input neurons as it does outputs.  The hidden layers of the neural network will have fewer neurons than the input/output neurons.  Because there are fewer neurons, the auto-encoder must learn to encode the input to the fewer hidden neurons.  The predictors (x) and output (y) are exactly the same in an autoencoder.  Because of this, we consider autoencoders to be unsupervised. Figure 14.AUTO shows an autoencoder. 

**Figure 14.AUTO: Simple Auto Encoder**
![Simple Auto Encoder](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_13_auto_encode.png "Simple Auto Encoder")

The following program demonstrates a very simple autoencoder that learns to encode a sequence of numbers.  Fewer hidden neurons will make it much more difficult for the autoencoder to learn.

###Read Device Traffic CSV
Load the CSV data file from Google Drive. 

In [85]:
import pandas as pd
from google.colab import drive

drive.mount('/content/gdrive')

pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

# Load the Device Traffic CSV file
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/DeviceTraffic.csv')
print("Read {} elements.".format(len(df)))
df

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Read 5955 elements.


Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info
0,1,0.000000,74.125.138.95,10.0.0.169,TLSv1.2,111,Application Data
1,2,0.007024,10.0.0.169,74.125.138.95,TCP,66,58031 > 443 [ACK] Seq=1 Ack=46 Win=406 Len=0...
2,3,0.291968,10.0.0.169,74.125.196.99,TCP,74,32864 > 443 [SYN] Seq=0 Win=14600 Len=0 MSS=...
3,4,0.321301,74.125.196.99,10.0.0.169,TCP,74,"443 > 32864 [SYN, ACK] Seq=0 Ack=1 Win=65535..."
4,5,0.326030,10.0.0.169,74.125.196.99,TCP,66,32864 > 443 [ACK] Seq=1 Ack=1 Win=14656 Len=...
...,...,...,...,...,...,...,...
5950,5951,215.461207,10.0.0.169,74.125.196.104,TCP,66,52002 > 443 [ACK] Seq=1132688 Ack=52560 Win=...
5951,5952,215.461471,10.0.0.169,74.125.196.104,TCP,66,52002 > 443 [ACK] Seq=1132688 Ack=52599 Win=...
5952,5953,215.461663,10.0.0.169,74.125.196.104,TCP,66,52002 > 443 [ACK] Seq=1132688 Ack=52630 Win=...
5953,5954,215.477882,10.0.0.169,74.125.196.104,TLSv1.3,105,Application Data


###Info stream is parsed into categorical representation

Segmenting "Out Of Order TCP" info packets as an anomaly and rest of the device trafic as normal.

In [86]:
# Adding another Feature for "Out Of Order"
appended_df = df
appended_df['Out-Of-Order'] = appended_df['Info'].str.find("TCP Out-Of-Order") != -1
appended_df[116:118]

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Info,Out-Of-Order
116,117,0.559796,74.125.196.99,10.0.0.169,TCP,86,[TCP Window Update] 443 > 32864 [ACK] Seq=13...,False
117,118,0.561547,10.0.0.169,74.125.196.99,TCP,1484,[TCP Out-Of-Order] 32864 > 443 [ACK] Seq=544...,True


##Pre-processing

Before we can feed the Device Traffic data into the neural network we must perform some preprocessing. We provide the following two functions to assist with preprocessing. The first function converts numeric columns into Z-Scores. The second function replaces categorical values with dummy variables. We now use these functions to preprocess each of the columns. Once the program preprocesses the data we display the results. This code converts all numeric columns to Z-Scores and all textual columns to dummy variables.


In [87]:
# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()
    if sd is None:
        sd = df[name].std()
    df[name] = (df[name] - mean) / sd
    
# Encode text values to dummy variables
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = f"{name}-{x}"
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)

# Encode the feature vector
def encode_feature_vector(df):
    encode_numeric_zscore(df, 'Time')
    encode_text_dummy(df, 'Source')
    encode_text_dummy(df, 'Destination')
    encode_text_dummy(df, 'Protocol')
    encode_numeric_zscore(df, 'Length')
    encode_text_dummy(df, 'Info')

encode_feature_vector(appended_df)
appended_df

Unnamed: 0,No.,Time,Length,Out-Of-Order,Source-10.0.0.169,...,Info-[TCP Window Update] 443 > 52002 [ACK] Seq=7238 Ack=95907 Win=266240 Len=0 TSval=1287395879 TSecr=554205 SLE=107251 SRE=108669,Info-[TCP Window Update] 443 > 52002 [ACK] Seq=7238 Ack=95907 Win=269056 Len=0 TSval=1287395885 TSecr=554205 SLE=125671 SRE=127089 SLE=107251 SRE=108669,Info-[TCP Window Update] 443 > 52002 [ACK] Seq=7238 Ack=95907 Win=271104 Len=0 TSval=1287395889 TSecr=554205 SLE=131343 SRE=132761 SLE=125671 SRE=127089 SLE=107251 SRE=108669,Info-[TCP Window Update] 80 > 54054 [ACK] Seq=1 Ack=1 Win=62464 Len=0 TSval=3854677989 TSecr=552763,Info-[TCP Window Update] 80 > 54059 [ACK] Seq=1 Ack=1 Win=62464 Len=0 TSval=4080375805 TSecr=553470
0,1,-1.530995,-0.715536,False,0,...,0,0,0,0,0
1,2,-1.530903,-0.785123,False,1,...,0,0,0,0,0
2,3,-1.527177,-0.772752,False,1,...,0,0,0,0,0
3,4,-1.526794,-0.772752,False,0,...,0,0,0,0,0
4,5,-1.526732,-0.785123,False,1,...,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
5950,5951,1.286169,-0.785123,False,1,...,0,0,0,0,0
5951,5952,1.286172,-0.785123,False,1,...,0,0,0,0,0
5952,5953,1.286175,-0.785123,False,1,...,0,0,0,0,0
5953,5954,1.286387,-0.724814,False,1,...,0,0,0,0,0


###Create Data Partition Set for Training and Inference
To perform anomaly detection, we divide the data into two sets - a "normal" set and "Out Of Order TCP" or an anomaly set. 

In [88]:
from sklearn.model_selection import train_test_split

# Create Normal Dataset Mask
normal_mask = appended_df['Out-Of-Order']==False
anomaly_mask = appended_df['Out-Of-Order']==True

# Create Separate Set for normal and anomaly data set
appended_df.drop('Out-Of-Order', axis=1, inplace=True)
normal_df = appended_df[normal_mask]
anomaly_df = appended_df[anomaly_mask]

print(f"Normal Count: {len(normal_df)}")
print(f"Out-Of-Order Anomaly Count: {len(anomaly_df)}")

# This is the numeric feature vector, as it goes to the neural net for testing
x_normal = normal_df.values
x_anomaly = anomaly_df.values

# Partition the Normal Data Set to Training and Inference
x_normal_train, x_normal_test = train_test_split(x_normal, test_size=0.25, random_state=42)
print(f"Normal train count: {len(x_normal_train)}")
print(f"Normal test count: {len(x_normal_test)}")

Normal Count: 5912
Out-Of-Order Anomaly Count: 43
Normal train count: 4434
Normal test count: 1478


###AutoEncoder
Ready to train the autoencoder on the data. The autoencoder will learn to compress the data to a vector of just three numbers. The autoencoder should be able to also decompress with reasonable accuracy. As is typical for autoencoders, we are merely training the neural network to produce the same output values as were fed to the input layer.


In [89]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(50, input_dim=x_normal.shape[1], activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(x_normal.shape[1]))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_normal_train, x_normal_train, verbose=1, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f6e467aa7b8>


###Inferencing an Anomaly

Check if the passed data set is matching normal data set or an anomaly. We compare the RMSE error value to the one seen with normal data set during training.

In [90]:
rmse_normal_training = np.sqrt(metrics.mean_squared_error(model.predict(x_normal_test), x_normal_test))
rmse_normal = np.sqrt(metrics.mean_squared_error(model.predict(x_normal), x_normal))
rmse_anomaly = np.sqrt(metrics.mean_squared_error(model.predict(x_anomaly), x_anomaly))

print(f"RMSE between Normal Training and Testing : {(abs(rmse_normal - rmse_normal_training) / rmse_normal_training) * 100}")
print(f"RMSE between Normal Training and Anomaly : {(abs(rmse_anomaly - rmse_normal_training) / rmse_normal_training) * 100}")

RMSE between Normal Training and Testing : 0.5316444167868574
RMSE between Normal Training and Anomaly : 8.196324285679337
