# ADR (Anomaly Detection by workflow Relations)

ADR mines numerical relations from log data and uses the relations for anomaly detection.

In the following parts, we use the BGL logs as an example to show the capability of ADR.

## parse raw logs to log events and build the event count matrix

The example raw logs are in "_data/BGL_2k.log_".

For ease of presentation, the raw logs are already parsed into structured log events by Drain <sup>[1]</sup> and the parsed results are in "_data/Drain_result/bgl_" folder. The file "_BGL_2k.log_structured.csv_" are the parsed structured logs and the file "_BGL_2k.log_templates.csv_" are the templates (events) of the logs.

## Build the event count matrix

In [5]:
from ADR import preprocess
import pandas as pd


log_path = 'data/Drain_result/bgl/BGL_2k.log_structured.csv'
template_path = 'data/Drain_result/bgl/BGL_2k.log_templates.csv'

df_log = pd.read_csv(log_path, sep=',', header=0)
eventID_list = pd.read_csv(template_path, sep=',', header=0)['EventId'].tolist()
df_log["bLabel"] = True
df_log.loc[df_log["Label"]=="-", "bLabel"] = False

seq_df, seq_ecm_df = preprocess.event_sequence_by_identifier(df_log, col_identifier='Node', col_EventId='EventId', col_bLabel='bLabel')

seq_ecm_df = seq_ecm_df.fillna(0).astype(int)

In [12]:
# print the session sequences and the events of each session
seq_df

Unnamed: 0,bLabel,seq_EventId,seq_LineId,seq_bLabel
R00-M0-N0-C:J10-U01,0,[E7],[1873],[False]
R00-M0-N0-C:J13-U11,0,[E76],[199],[False]
R00-M0-N1-C:J07-U01,0,[E67],[730],[False]
R00-M0-N1-C:J09-U11,0,[E67],[412],[False]
R00-M0-N2,0,[E91],[1203],[False]
...,...,...,...,...
R77-M0-NC-I:J18-U11,0,[E37],[1411],[False]
R77-M1-N7-C:J08-U11,0,[E12],[1893],[False]
R77-M1-NC-I:J18-U01,0,[E40],[1987],[False]
R77-M1-NF-C:J02-U11,0,[E3],[1332],[False]


In [13]:
# print the event count matrix
seq_ecm_df

Unnamed: 0,E7,E76,E67,E91,E101,E2,E4,E109,E12,E15,...,E86,E27,E65,E35,E36,E40,E87,E11,E73,E88
R00-M0-N0-C:J10-U01,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R00-M0-N0-C:J13-U11,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R00-M0-N1-C:J07-U01,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R00-M0-N1-C:J09-U11,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R00-M0-N2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
R77-M0-NC-I:J18-U11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R77-M1-N7-C:J08-U11,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
R77-M1-NC-I:J18-U01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
R77-M1-NF-C:J02-U11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## load datasets

The example log with 2k lines are too small to be used for anomaly detection. So we use the event count matrix of the whole BGL dataset for the anomaly detection demo.

In [2]:
import numpy as np

log_paths = {
    'bgl': 'data/Drain_result/bgl/x_y_xColumns.npz',
    }

log_datasets = {}
for name, log_path in log_paths.items():
    log_datasets[name] = np.load(log_path, allow_pickle=True)

## sADR (semi-supervised, need normal logs for training)

In [3]:
from ADR import preprocess
from ADR import sADR

train_numbers = [100, 150, 200, 250, 300, 350, 400, 450, 500]

for log_name, x_y_xColumns in log_datasets.items():
    print(f'='*30)
    print(log_name)
    x, y, xColumns = x_y_xColumns['x'], x_y_xColumns['y'], x_y_xColumns['xColumns']

    for i in range(len(train_numbers)):
        train_number = train_numbers[i]
        print(f'-----train number:{train_number}-----')
        if i == 0:
            x_train, y_train, x_test, y_test = x_train, y_train, x_test, y_test = preprocess.split_to_train_test_by_num(x, y, num_train=train_number)
        else:
            x_train_adding, y_train_adding, x_test, y_test = preprocess.split_to_train_test_by_num(x, y, num_train=train_numbers[i]-train_numbers[i-1])
            x_train = np.concatenate((x_train, x_train_adding), axis=0)
            y_train = np.concatenate((y_train, y_train_adding), axis=0)

        # print(np.arange(x_train.shape[0]))
        model = sADR.sADR()
        model.fit(x_train, y_train)
        precision, recall, f1 = model.evaluate(x_train, y_train)
        print('Accuracy on training set:')
        print(f"precision, recall, f1: {[precision, recall, f1]}")

        precision, recall, f1 = model.evaluate(x_test, y_test)
        print('Accuracy on testing set:')
        print(f"precision, recall, f1: {[precision, recall, f1]}")

bgl
-----train number:100-----
Accuracy on training set:
precision, recall, f1: [0.9492, 1.0, 0.9739]
Accuracy on testing set:
precision, recall, f1: [0.8084, 1.0, 0.8941]
-----train number:150-----
Accuracy on training set:
precision, recall, f1: [0.9302, 1.0, 0.9639]
Accuracy on testing set:
precision, recall, f1: [0.8232, 1.0, 0.9031]
-----train number:200-----
Accuracy on training set:
precision, recall, f1: [0.9626, 1.0, 0.981]
Accuracy on testing set:
precision, recall, f1: [0.8294, 1.0, 0.9068]
-----train number:250-----
Accuracy on training set:
precision, recall, f1: [0.9407, 1.0, 0.9695]
Accuracy on testing set:
precision, recall, f1: [0.8224, 1.0, 0.9026]
-----train number:300-----
Accuracy on training set:
precision, recall, f1: [0.9506, 1.0, 0.9747]
Accuracy on testing set:
precision, recall, f1: [0.8434, 1.0, 0.915]
-----train number:350-----
Accuracy on training set:
precision, recall, f1: [0.9358, 1.0, 0.9669]
Accuracy on testing set:
precision, recall, f1: [0.8372, 1.0

## uADR (unsupervised, do not need labelled logs for training)

In [4]:
from ADR import preprocess

u_log_datasets_train_test = {}

u_train_ratios = {
    'bgl': 0.8
    }
for name, x_y_xColumns in log_datasets.items():
    if name in ['hdfs', 'bgl']:
        print("========")
        print(name)
        x, y, xColumns = x_y_xColumns['x'], x_y_xColumns['y'], x_y_xColumns['xColumns']
        print(y.sum()/y.size)
        print(f'x shape: {x.shape}')
        x_train, y_train, x_test, y_test = preprocess.split_to_train_test_by_ratio(x, y, train_ratio=u_train_ratios[name])
        u_log_datasets_train_test[name] = [x_train, y_train, x_test, y_test]
        print(f'x_train shape:{x_train.shape}')
        print(f'x_test shape:{x_test.shape}')

bgl
0.4530555074221683
x shape: (69252, 384)
x_train shape:(55401, 384)
x_test shape:(13851, 384)


In [18]:
from ADR import uADR

log_name = 'bgl'
estimated_pN = 0.5

print('='*30)
print(log_name)
print(f'estimated_pN: {estimated_pN}')
x_train, y_train, x_test, y_test = u_log_datasets_train_test[log_name]

model = uADR.uADR(AN_ratio=1-estimated_pN, nrows_per_sample=10, nrounds=100)
model.fit(x_train)
precision, recall, f1 = model.evaluate(x_train, y_train)
print('Accuracy on training set:')
print(f"precision, recall, f1: {[precision, recall, f1]}")

precision, recall, f1 = model.evaluate(x_test, y_test)
print('Accuracy on testing set:')
print(f"precision, recall, f1: {[precision, recall, f1]}")

=====bgl=====
-----0.5-----
Accuracy on training set:
precision, recall, f1: [0.6581, 1.0, 0.7938]
Accuracy on testing set:
precision, recall, f1: [0.6625, 1.0, 0.797]


## References

[1] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An Online Log Parsing Approach with Fixed Depth Tree,” in 2017 IEEE International Conference on Web Services (ICWS), Jun. 2017, pp. 33–40, doi: 10.1109/ICWS.2017.13.