Competitors
======

There are impliementations of two state of the art methods:

**Deeplog, Loganomaly**

Our implimentations mainly based on three Github repositories:


https://github.com/wuyifan18/DeepLog 200+ Stars

https://github.com/donglee-afar/logdeep 170+ Stars

https://github.com/HelenGuohx/logbert published a peer-reviewed paper

Target of this notebook
=====
everyone can reproduce our experiments without any missing information. Everything for reproduce is on our thesis

Our contribution
=====
* Improve **prediction speed** on GPU for **60x**

    * originally: **more than 60 minutes** 

    * ours: **only 1 minutes**

* Simplify those repositories' scripts and made them suitable for notebook

Seed
=====

```python
# use seed 1 to ensure reproducibility:
import torch
import os
import numpy as np

seed = 1

os.environ["PL_GLOBAL_SEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
``` 

# Data
there is an identical parsed data on both repositories
* https://github.com/donglee-afar/logdeep
* https://github.com/wuyifan18/DeepLog. 

This data has perfect number of log templates, we believe this one is generate from source code analysis (**most accurate way**). So for HDFS dataset we don't conduct log parsing anymore. We directly use the data from those Github repositories   

    - num_classes = 28

# Deeplog

* Model Structure
    - num_classes = 28
    - input_size = 1 (input size of LSTM layers)
    - num_layers = 2 (number of LSTM layers)
    - hidden_size = 64 (Hidden size of LSTM layers)
    - init hidden vector and memory vector of LSTM = torch.zeros(hidden_size) 

* Training parameters
    - num_epochs = 300
    - batch_size = 2048
    - optimizer = Adam
    - learning_rate = 0.001 (pytorch Adam default learning rate) 

* Data Format
    - window_size = 10 (use 10 continuous log keys to predict next log key)

* Outlier Definition
    - if at least one log key in a session can't be correctly predicted, the session is an outlier
 


*   Same network architecture, but different embedding size for different log system. Different network objects are separately trained for different log system

# Training hyperparameters

# Outlier detection








In [None]:
! nvidia-smi

Mon Aug 16 18:47:20 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
#!pip install pytorch-lightning

# Preparing


*   fix bug of pytorch transformer framework
*   Connect to Gdrive



In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp /content/drive/MyDrive/openstack/functional.py /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py
#after first copy, restart the notebook
from tqdm.notebook import tqdm
import time
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import TensorDataset, DataLoader
import argparse
import os
from collections import Counter

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



import os
import random
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
# https://gist.github.com/KirillVladimirov/005ec7f762293d2321385580d3dbe335
def seed_everything(seed=1234):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)





Mounted at /content/drive


In [None]:
# for predict
#'/content/drive/MyDrive/DeepLog/data/' + name
def generate(path, window_size = 10,
             quantitative_features = False,
             vocab_size = 28,
             need_sessions_number = False):
    num_sessions = 0
    inputs = []
    quantitative = []
    semantic = []
    outputs = []
    session_id = []
    #default load hdfs with realtive path, other dataset need provide full path
    if path[0]!='/':
        path = '/content/drive/MyDrive/DeepLog/data/' + path
    name = path.split('/')[-1]
    with open(path, 'r') as f:
        for line in f.readlines():
            num_sessions += 1
            line = list(map(lambda n: n-1, map(int, line.strip().split())))
            line = line + [28] * (window_size + 1 - len(line))

            for i in range(len(line) - window_size):
                ##########
                # key generation of each log sequence
                ##########
                inputs.append(line[i:i + window_size])

                if quantitative_features:
                    Quantitative_pattern = [0] * vocab_size
                    log_counter = Counter(line[i:i + window_size])
                    for key in log_counter:
                        if key>vocab_size-1:
                            continue
                        Quantitative_pattern[key] = log_counter[key]
                    quantitative.append(Quantitative_pattern)

                outputs.append(line[i + window_size])
                session_id.append(num_sessions)
    print('Number of sessions({}): {}'.format(name, num_sessions))
    print('Number of seqs({}): {}'.format(name, len(inputs)))
    if quantitative_features:
        dataset = TensorDataset(torch.tensor(inputs, dtype=torch.float),
                                torch.tensor(quantitative, dtype=torch.float), 
                                torch.tensor(outputs), 
                                torch.tensor(session_id))
    else:
        dataset = TensorDataset(torch.tensor(inputs, dtype=torch.float), torch.tensor(outputs), torch.tensor(session_id))
    if need_sessions_number:
        return dataset,num_sessions
    else:
        return dataset

In [None]:
generate('/content/drive/MyDrive/hadoop/hadoop_train.txt')[1]

Number of sessions(hadoop_train.txt): 6
Number of seqs(hadoop_train.txt): 8903


(tensor([ 3., 33., 34., 35., 36., 37., 37., 37., 37., 37.]),
 tensor(37),
 tensor(1))

# Deeplog definition

In [None]:
# Deeplog
class Deeplog(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_keys):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_keys)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        out, _ = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        return out

def train_deeplog(model, dataloader, model_dir,learning_rate):
    #device
    device = next(model.parameters()).device

    # Hyperparameters for training
    num_epochs = 300
    input_size = 1
    window_size = 10


    # Loss and optimizer
    criterion = nn.CrossEntropyLoss(ignore_index=28)
    optimizer = optim.Adam(model.parameters(),lr = learning_rate)

    # Training loop
    start_time = time.time()
    total_step = len(dataloader)
    with tqdm(range(num_epochs)) as t:
        for epoch in t:  # Loop over the dataset multiple times
            train_loss = 0
            for step, (seq, label,_) in enumerate(dataloader):
                # Forward pass
                seq = seq.clone().detach().view(-1, window_size, input_size).to(device)
                output = model(seq)
                loss = criterion(output, label.to(device))

                # Backward and optimize
                optimizer.zero_grad()
                loss.backward()
                train_loss += loss.item()
                optimizer.step()
                #scheduler.step()

            t.set_postfix({'_': 'Epoch [{}/{}], train_loss: {:.4f}'.format(epoch + 1, num_epochs, train_loss / total_step)})

    # Summary of training
    elapsed_time = time.time() - start_time
    print('elapsed_time: {:.3f}s'.format(elapsed_time))
    if not os.path.isdir(model_dir):
        os.makedirs(model_dir)
    torch.save(model.state_dict(), model_dir+'.pt')
    print('Finished Training')


def predict_deeplog(model, dataloader, window_size = 10, input_size = 1, top = 9):
    result = set([])
    model.eval()
    device = next(model.parameters()).device
    
    for i in tqdm(dataloader):    
        # get probabilities 
        f = model(i[0].clone().detach().view(-1, window_size, input_size).to(device))

        #check if windows are outlier
        predict = (torch.argsort(f, 1)[:,-top:] == i[1].reshape(-1,1).cuda()).any(-1).tolist()

        # outlier result
        id = i[2].tolist()
        for i in range(len(id)):
            if not predict[i]:
                result.add(id[i])
    model.train()
    return result

# Deeplog HDFS

In [None]:
#seed everything
#seed_everything(1234)

###############
# model define
###############
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device('cpu')
# Hyperparameters
num_classes = 300
num_epochs = 300
batch_size = 2048
input_size = 1
num_layers = 2
hidden_size = 128
window_size = 10

seed_everything(1234)
deeplog = Deeplog(input_size, hidden_size, num_layers, num_classes).to(device)

#########
# data  #
#########
#train data:
seq_dataset = generate('/content/drive/MyDrive/hadoop/hadoop_train.txt')
train_dataloader = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True)
saving_path = '/content/drive/MyDrive/1'

#predict data:
#normal
normal_data = generate('/content/drive/MyDrive/hadoop/hadoop_test_normal.txt')
test_normal_dataloader = DataLoader(normal_data,2048)
#abnormal
abnormal,length = generate('/content/drive/MyDrive/hadoop/hadoop_test_abnormal.txt',need_sessions_number=True)
test_abnormal_dataloader = DataLoader(abnormal,2048)

##############
# training   #
##############
train_deeplog(deeplog, train_dataloader, saving_path, learning_rate= 0.001)

################
# predicting   #
################
# predict normal
result1 = predict_deeplog(deeplog,test_normal_dataloader,
                          window_size= window_size, 
                          input_size = input_size, top = 9)
# predict abnormal
result2 = predict_deeplog(deeplog,test_abnormal_dataloader,
                          window_size= window_size, 
                          input_size = input_size, top = 9)

################
# print result #
################
FP = len(result1)
TP = len(result2)
FN = length - TP
P = 100 * TP / (TP + FP)
R = 100 * TP / (TP + FN)
F1 = 2 * P * R / (P + R)
print('false positive (FP): {}, false negative (FN): {}, Precision: {:.3f}%, Recall: {:.3f}%, F1-measure: {:.3f}%'.format(FP, FN, P, R, F1))
print('Finished Predicting')

Number of sessions(hadoop_train.txt): 6
Number of seqs(hadoop_train.txt): 8903
Number of sessions(hadoop_test_normal.txt): 5
Number of seqs(hadoop_test_normal.txt): 5834
Number of sessions(hadoop_test_abnormal.txt): 43
Number of seqs(hadoop_test_abnormal.txt): 109400


  0%|          | 0/300 [00:00<?, ?it/s]

elapsed_time: 41.624s
Finished Training


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/54 [00:00<?, ?it/s]

false positive (FP): 5, false negative (FN): 0, Precision: 89.583%, Recall: 100.000%, F1-measure: 94.505%
Finished Predicting


# Loganomaly

### Model

In [None]:
# Loganomaly
class loganomaly(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_keys):
        super(loganomaly, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm0 = nn.LSTM(input_size,
                             hidden_size,
                             num_layers,
                             batch_first=True)
        self.lstm1 = nn.LSTM(input_size,
                             hidden_size,
                             num_layers,
                             batch_first=True)
        self.fc = nn.Linear(2 * hidden_size, num_keys)

    # it is very strange in logdeep implimentation. They impliment attention_net
    # but they don't use it any where in this class

    def forward(self, features, device):
        input0, input1 = features[0], features[1]

        h0_0 = torch.zeros(self.num_layers, input0.size(0),
                           self.hidden_size).to(device)
        c0_0 = torch.zeros(self.num_layers, input0.size(0),
                           self.hidden_size).to(device)

        out0, _ = self.lstm0(input0, (h0_0, c0_0))

        h0_1 = torch.zeros(self.num_layers, input1.size(0),
                           self.hidden_size).to(device)
        c0_1 = torch.zeros(self.num_layers, input1.size(0),
                           self.hidden_size).to(device)

        out1, _ = self.lstm1(input1, (h0_1, c0_1))
        multi_out = torch.cat((out0[:, -1, :], out1[:, -1, :]), -1)
        out = self.fc(multi_out)
        return out


def predict(model, test_loader):
    result2 = set([])
    model.eval()
    for i in tqdm(DataLoader(test_loader,2048*4)):    
        # input parsing
        seq = i[0].clone().detach().view(-1, window_size, input_size).to(device)
        quantitative = i[1].clone().detach().view(-1, i[1].size(1), input_size).to(device)
        y = i[2].reshape(-1,1).to(device)
        id = i[3].tolist()

        # model prediction
        f = model((seq,quantitative),device)

        predict = (torch.argsort(f, 1)[:,-num_candidates:] == y).any(-1).tolist()

        
        for i in range(len(id)):
            if not predict[i]:
                result2.add(id[i])
    return result2

### Parameter and model init


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Hyperparameters
num_classes = 300 # hadoop
num_epochs = 370
batch_size = 2048
input_size = 1
model_dir = 'model'
log = 'Adam_batch_size={}_epoch={}'.format(str(batch_size), str(num_epochs))
num_layers = 2
hidden_size = 64
window_size = 10

# 
learning_rate = 0.001
accumulation_step = 1
lr_step = (300,350)
lr_decay_ratio =0.1

num_candidates = 1

from torch.optim.lr_scheduler import MultiStepLR

model = loganomaly(input_size, hidden_size, num_layers, num_classes).to(device)

### Training

In [None]:
device = next(model.parameters()).device

#training data
seq_dataset = generate('/content/drive/MyDrive/hadoop/hadoop_train.txt',vocab_size= num_classes,quantitative_features = True)
dataloader = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)

# Loss and optimizer and scheduler
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(),lr = learning_rate)
scheduler = MultiStepLR(optimizer, milestones=lr_step, gamma=lr_decay_ratio)

# Training loop
start_time = time.time()
total_step = len(dataloader)
with tqdm(range(num_epochs)) as t:
    for epoch in t:  # Loop over the dataset multiple times
        train_loss = 0
        for step, (seq, quantitative ,label,_) in enumerate(dataloader):
            # Forward pass
            seq = seq.clone().detach().view(-1, window_size, input_size).to(device)
            quantitative = quantitative.clone().detach().view(-1, num_classes, input_size).to(device)
            output = model((seq, quantitative),device)
            loss = criterion(output, label.to(device))

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            train_loss += loss.item()
            optimizer.step()

        scheduler.step()

        t.set_postfix({'_': 'Epoch [{}/{}], train_loss: {:.4f}'.format(epoch + 1, num_epochs, train_loss / total_step),
                      'learning_rate': str(scheduler.get_last_lr())})

# Summary of training
elapsed_time = time.time() - start_time
print('elapsed_time: {:.3f}s'.format(elapsed_time))
if not os.path.isdir(model_dir):
    os.makedirs(model_dir)
torch.save(model.state_dict(), model_dir + '/' + log + '.pt')
print('Finished Training')

Number of sessions(hadoop_train.txt): 6
Number of seqs(hadoop_train.txt): 8903


  0%|          | 0/370 [00:00<?, ?it/s]

elapsed_time: 188.000s
Finished Training


In [None]:
test_normal_loader,_ = generate('/content/drive/MyDrive/hadoop/hadoop_test_normal.txt',vocab_size= num_classes,
                                quantitative_features= True,need_sessions_number=True)

result1 = predict(model,test_normal_loader)
FP = len(result1)


test_abnormal_loader,length = generate('/content/drive/MyDrive/hadoop/hadoop_test_abnormal.txt',vocab_size= num_classes,
                                       need_sessions_number=True,quantitative_features = True)

result2 = predict(model,test_abnormal_loader)

TP = len(result2)
print(TP)
FN = length - TP
P = 100 * TP / (TP + FP)
R = 100 * TP / (TP + FN)
F1 = 2 * P * R / (P + R)
print('false positive (FP): {}, false negative (FN): {}, Precision: {:.3f}%, Recall: {:.3f}%, F1-measure: {:.3f}%'.format(FP, FN, P, R, F1))
print('Finished Predicting')

Number of sessions(hadoop_test_normal.txt): 5
Number of seqs(hadoop_test_normal.txt): 5834


  0%|          | 0/1 [00:00<?, ?it/s]

Number of sessions(hadoop_test_abnormal.txt): 43
Number of seqs(hadoop_test_abnormal.txt): 109400


  0%|          | 0/14 [00:00<?, ?it/s]

43
false positive (FP): 5, false negative (FN): 0, Precision: 89.583%, Recall: 100.000%, F1-measure: 94.505%
Finished Predicting
