# Network Anomaly Detection using Autoencoders

- Ananth Sankar, Solutions Architect at NVIDIA.
- Eric Harper, Solutions Architect, Global Telecoms at NVIDIA.

Welcome to the second lab of this series!

In the previous lab we used XGBoost, a powerful and efficient tree-based algorithm for classification of anomalies. We were able to almost perfectly identify the anomalous data in the KDD99 dataset and which type of anomaly occurred.  However, in the real-world, *labeled* data can be expensive and hard to come by. Especially with network security, zero-day attacks can be the most challenging and also the most important attacks to detect. Since, by definition, these attacks are happening for the first time, there will be no way to have labels from them.

So how do we approach *this* problem?

For starters, we could have security analysts investigate the network packets and label anomalous ones. But that solution doesn't scale and our models might have difficulty identifying attacks that haven't occurred before.

Our solution *needs to use* "unsupervised learning." Unsupervised learning is the class of machine learning and deep learning algorithms that enable us to draw inferences from our dataset without labels.



In this lab we will use autoencoders (AEs) to detect anomalies in the KDD99 dataset. There are a lot of advantages to using autoencoders for detecting anomalies. One main advantage is the that AEs can learn non-linear relationships in the data.

While we will not be using the labels in the KDD99 dataset explicitly for model training, we will be using them to evaluate how well our model is doing at detecting the anomalies.  We will also use the labels to see if the AE is embedding the anomalies in latent space according to the type of anomaly.

Note that we will be using Keras as the deep learning framework for this lab. Keras is an open source neural network library written in Python and it is designed to enable fast experimentation with deep neural networks.

In [3]:
# Import libraries that will be needed for the lab
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
import os, datetime

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.cluster import KMeans

import tensorflow as tf
from tensorflow.keras import optimizers
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.utils import plot_model
%load_ext tensorboard

import pickle

import random
random.seed(123)

2024-11-30 12:51:34.082642: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
