# DNN Autoencoder example

The following tutorial demonstrates how to utilize safekit's multivariate DNN autoencoder to perform anomaly detection.

In [None]:
import os, sys
current_dir = os.path.abspath(os.path.dirname('__file__'))
print(current_dir)
sys.path.append(current_dir)

C:\Users\cgy\Desktop\safekit\safekit


In [35]:
import tensorflow as tf
import numpy as np
import json

import sys

from safekit.batch import OnlineBatcher, split_batch

from safekit.graph_training_utils import ModelRunner, EarlyStop
from safekit.tf_opss import join_multivariate_inputs, dnn, multivariate_loss, eyed_mvn_loss
from safekit.util import make_feature_spec, make_loss_spec

tf.set_random_seed(408)
np.random.seed(408)

ModuleNotFoundError: No module named 'tf_opss'

In [None]:
layer_list = [100, 50, 25, 50, 100]
lr = 5e-3
embed_ratio = 0.75
min_embed = 2
max_embed = 1000

Next, we load the JSON file describing the specifications for the data.

This JSON file describes a dictionary specifying the number of features in the input data; the categories corresponding to the features; whether the corresponding category is metadata, input, or output; and the indices which map these categories to specific features. This dictionary can later be used to ease interaction with the data when providing it as input to Tensorflow.

`datastart_index` specifies where the event counts begin in a single row of features; this is used by the minibatcher to ensure that it doesn't include metadata in the minibatches it produces.

In [4]:
# 导入LANL数据集对应的特征json文件，然后得到事件计数特征值在特征向量中对应的起始位置index （3）
dataspecs = json.load(open('../safekit/features/specs/agg/lanl_count_in_count_out_agg.json', 'r'))
datastart_index = dataspecs['counts']['index'][0]

FileNotFoundError: [Errno 2] No such file or directory: './../safekit/features/specs/agg/lanl_count_in_count_out_agg.json'

Now that the data specifications have been loaded, we instantiate a batcher to divide the data into smaller portions. Since our dataset is rather large, we want to provide it to the model in small batches to avoid filling memory. Adjusting the minibatch size may also improve the model's performance. Here, we'll use a batch size of 256 data points.

In [4]:
data = OnlineBatcher('/home/hutch_research/data/lanl/agg_feats/begin_no_weekends2.txt', 256, skipheader=True)

To put the data into a form that Tensorflow can efficiently process, we use `join_multivariate_inputs`; this function creates placeholders for the input data and defines operations in the Tensorflow graph that extract learned embeddings of categorical features and concatenate these embeddings together with the continuous features in the input. This defines the input that will be used by the DNN autoencoder.

In [5]:
feature_spec = make_feature_spec(dataspecs)
x, ph_dict = join_multivariate_inputs(feature_spec, dataspecs, embed_ratio, min_embed, max_embed)

Now we instantiate the graph operations comprising the DNN autoencoder with a single call to the `dnn` function, which will return a Tensorflow variable mapping to the last layer of the DNN. This variable will then be used to define the model's loss. We specify the previously-defined input as the model's input, and provide a list which defines the number of hidden nodes at each hidden layer in the network.

In addition to model depth and width, this function has additional hyperparameters that can be tuned, including activation function, weight initialization range, and dropout factor.

In [6]:
h = dnn(x, layer_list)

To determine how to compute losses, we use `make_loss_spec` to generate a specification dictionary mapping loss functions to features in the input. Then, `multivariate_loss` interprets this specification to define operations to compute feature-wise losses according to the data specifications—since the inputs can be a mixture of continuous and categorical features, their losses need to be defined accordingly. This function supports the use of three different loss functions—`eyed_mvn_loss`, `diag_mvn_loss`, and `full_mvn_loss`. Here we use `eyed_mvn_loss` to compute the squared error for predictions. The latter two use either diagonal or full covariance matrices to compute the Mahalonobis distance true values and predictions.

Once we define the graph operations that compute the squared error loss, we sum the losses over all features and average these losses over all data points in the minibatch. This is the scalar loss we will attempt to minimize using gradient descent.

In [7]:
loss_spec = make_loss_spec(dataspecs, eyed_mvn_loss)
loss_matrix = multivariate_loss(h, loss_spec, ph_dict)

loss_vector = tf.reduce_sum(loss_matrix, reduction_indices=1)
loss = tf.reduce_mean(loss_vector)

To map losses back to our input features easily, we'll next define a function that we can call during the training loop that will write metadata and losses for each data point in the current minibatch. 

In [8]:
def write_results(data_dict, feat_loss, outfile):
    for d, u, t, l in zip(data_dict['time'].flatten().tolist(),
                          data_dict['user'].tolist(),
                          data_dict['redteam'].flatten().tolist(),
                          feat_loss.flatten().tolist()):
        outfile.write('%s %s %s %s\n' % (int(d), u, int(t), l))

Now we instantiate a `ModelRunner` object, which provides a simple interface for interacting with the Tensorflow session. Instantiating this object will define the optimizer Tensorflow will use for gradient descent and initialize all of the variables in the Tensorflow graph. We can then use the `train_step` method on this object to perform an optimization step or the `eval` method to retrieve the values of arbitrary tensors in the graph.

The `loss_feats` variable specifies the names of the features over which we are computing losses. In this case, it is a single category—the counts of the categorical features—though in instances with a mixture of categorical and continuous features, more than one category could be represented here.

In order to record the losses for all of the features, we define a list `eval_tensors` that contains tensors whose values we want to retrieve during training. We'll provide this list to the `ModelRunner`'s `eval` method during the training loop to compute these tensors, then record their values with the `write_results` function defined previously.

In [9]:
# other args incl. learning rate, optimizer, decay rate...
model = ModelRunner(loss, ph_dict, learnrate=lr, opt='adam')

loss_feats = [triple[0] for triple in loss_spec]

# list of tensors we want to retrieve at each training step; can also add loss_matrix to this
eval_tensors = [loss, loss_vector]

Now we begin training our model. We start by defining a stopping criteria for training using the `EarlyStop` object; if our model's performance doesn't improve after 20 training steps, the `check_error` function we instantiate will return `False`, and training will be discontinued.

Inside the training loop, `split_batch` is first used to construct a dictionary for Tensorflow that maps features to the placeholders used to feed data into the computational graph during training. Since our targets are defined separately from the inputs provided to our batcher, we add these to the dictionary.

We retrieve the losses for the current batch, then perform a training step to perform gradient descent over a single batch of inputs. This process repeats until either the batcher has reached the end of the input file, the stopping criteria has been met, or the model's error has diverged to infinity.

In [10]:
check_error = EarlyStop(20)

cur_loss = sys.float_info.max # starting with largest loss possible
raw_batch = data.next_batch()
training = check_error(raw_batch, cur_loss)

outfile = open('results', 'w')

while training:
    data_dict = split_batch(raw_batch, dataspecs)
    targets = {'target_' + name : data_dict[name] for name in loss_feats}
    data_dict.update(targets)
    cur_loss, feat_loss = model.eval(data_dict, eval_tensors)
    model.train_step(data_dict)
    
    write_results(data_dict, feat_loss, outfile)
    print('index: %s, loss: %.4f' % (data.index, cur_loss))
    raw_batch = data.next_batch()
    training = check_error(raw_batch, cur_loss)
    
outfile.close()

index: 256, loss: 7235519488.0000
index: 512, loss: 27242684.0000
index: 768, loss: 7783619.0000
index: 1024, loss: 10031691.0000
index: 1280, loss: 5930339.5000
index: 1536, loss: 3970240.2500
index: 1792, loss: 5967686.0000
index: 2048, loss: 4960282.0000
index: 2304, loss: 4517553.5000
index: 2560, loss: 7036947.0000
index: 2816, loss: 2647112.0000
index: 3072, loss: 10882352.0000
index: 3328, loss: 2459881.0000
index: 3584, loss: 1709068.6250
index: 3840, loss: 5032989.0000
index: 4096, loss: 1138042.7500
index: 4352, loss: 1072580.0000
index: 4608, loss: 1090286.5000
index: 4864, loss: 1288823.0000
index: 5120, loss: 977030.6250
index: 5376, loss: 1519262.2500
index: 5632, loss: 19444328.0000
index: 5888, loss: 730013.1875
index: 6144, loss: 1931345.5000
index: 6400, loss: 616478.6250
index: 6656, loss: 537939.9375
index: 6912, loss: 446257.3125
index: 7168, loss: 458408.2500
index: 7424, loss: 228352.2344
index: 7680, loss: 231614.3125
index: 7936, loss: 1510352128.0000
index: 81

index: 64512, loss: 12476050.0000
index: 64768, loss: 20791440.0000
index: 65024, loss: 31349966.0000
index: 65280, loss: 10900652.0000
index: 65536, loss: 3343473.0000
index: 65792, loss: 9055377.0000
index: 66048, loss: 7649208.5000
index: 66304, loss: 1584254.1250
index: 66560, loss: 2929770.2500
index: 66816, loss: 8568593.0000
index: 67072, loss: 3307513.2500
index: 67328, loss: 2161881.2500
index: 67584, loss: 5490057.0000
index: 67840, loss: 2327615.0000
index: 68096, loss: 1260092.2500
index: 68352, loss: 723937024.0000
index: 68608, loss: 645145856.0000
index: 68864, loss: 13266762.0000
index: 69120, loss: 11260218.0000
index: 69376, loss: 7383053.5000
index: 69632, loss: 9246822.0000
index: 69888, loss: 16765551.0000
index: 70144, loss: 7445397.0000
index: 70400, loss: 17689336.0000
index: 70656, loss: 11074226.0000
index: 70912, loss: 9646866.0000
index: 71168, loss: 4960465.5000
index: 71424, loss: 6416443.0000
index: 71680, loss: 7880465.5000
index: 71936, loss: 9655702.00

index: 129280, loss: 5345947.5000
index: 129536, loss: 3072081.0000
index: 129792, loss: 4227489.0000
index: 130048, loss: 4295295.0000
index: 130304, loss: 4332012.5000
index: 130560, loss: 4636956.0000
index: 130816, loss: 5481211.0000
index: 131072, loss: 2071885.0000
index: 131328, loss: 4298899.5000
index: 131584, loss: 4098910.5000
index: 131840, loss: 3905417.5000
index: 132096, loss: 5066809.0000
index: 132352, loss: 2797831.2500
index: 132608, loss: 3887858.5000
index: 132864, loss: 4821723.5000
index: 133120, loss: 3663502.7500
index: 133376, loss: 2259491.7500
index: 133632, loss: 50285808.0000
index: 133888, loss: 3832247.5000
index: 134144, loss: 2560074.0000
index: 134400, loss: 2301788.5000
index: 134656, loss: 6069030.0000
index: 134912, loss: 15496918.0000
index: 135168, loss: 1949879.2500
index: 135424, loss: 2788536.5000
index: 135680, loss: 4276271.5000
index: 135936, loss: 2872138.0000
index: 136192, loss: 13347507.0000
index: 136448, loss: 2096684.7500
index: 1367

Done Training. End of data stream.