# ADR (Anomaly Detection by workflow Relations)

ADR mines numerical relations from log data and uses the relations for anomaly detection.

In the following parts, we use the BGL logs as an example to show the capability of ADR.

## parse raw logs to log events

## load datasets

For ease of presentation, the raw logs are already parsed into structured log events by Drain <sup>[1]</sup> and the event-count-matrices are evaluated and saved in "_data.zip_". Please unzip "_data.zip_" to ADR folder before running the demo code.

In [2]:
import numpy as np

log_paths = {
    'bgl': 'data/Drain_result/bgl/x_y_xColumns.npz',
    }

log_datasets = {}
for name, log_path in log_paths.items():
    log_datasets[name] = np.load(log_path, allow_pickle=True)

## sADR (semi-supervised, need normal logs for training)

In [3]:
from ADR import preprocess
from ADR import sADR

train_numbers = [100, 150, 200, 250, 300, 350, 400, 450, 500]

for log_name, x_y_xColumns in log_datasets.items():
    print(f'='*30)
    print(log_name)
    x, y, xColumns = x_y_xColumns['x'], x_y_xColumns['y'], x_y_xColumns['xColumns']

    for i in range(len(train_numbers)):
        train_number = train_numbers[i]
        print(f'-----train number:{train_number}-----')
        if i == 0:
            x_train, y_train, x_test, y_test = x_train, y_train, x_test, y_test = preprocess.split_to_train_test_by_num(x, y, num_train=train_number)
        else:
            x_train_adding, y_train_adding, x_test, y_test = preprocess.split_to_train_test_by_num(x, y, num_train=train_numbers[i]-train_numbers[i-1])
            x_train = np.concatenate((x_train, x_train_adding), axis=0)
            y_train = np.concatenate((y_train, y_train_adding), axis=0)

        # print(np.arange(x_train.shape[0]))
        model = sADR.sADR()
        model.fit(x_train, y_train)
        precision, recall, f1 = model.evaluate(x_train, y_train)
        print('Accuracy on training set:')
        print(f"precision, recall, f1: {[precision, recall, f1]}")

        precision, recall, f1 = model.evaluate(x_test, y_test)
        print('Accuracy on testing set:')
        print(f"precision, recall, f1: {[precision, recall, f1]}")

bgl
-----train number:100-----
Accuracy on training set:
precision, recall, f1: [0.9492, 1.0, 0.9739]
Accuracy on testing set:
precision, recall, f1: [0.8084, 1.0, 0.8941]
-----train number:150-----
Accuracy on training set:
precision, recall, f1: [0.9302, 1.0, 0.9639]
Accuracy on testing set:
precision, recall, f1: [0.8232, 1.0, 0.9031]
-----train number:200-----
Accuracy on training set:
precision, recall, f1: [0.9626, 1.0, 0.981]
Accuracy on testing set:
precision, recall, f1: [0.8294, 1.0, 0.9068]
-----train number:250-----
Accuracy on training set:
precision, recall, f1: [0.9407, 1.0, 0.9695]
Accuracy on testing set:
precision, recall, f1: [0.8224, 1.0, 0.9026]
-----train number:300-----
Accuracy on training set:
precision, recall, f1: [0.9506, 1.0, 0.9747]
Accuracy on testing set:
precision, recall, f1: [0.8434, 1.0, 0.915]
-----train number:350-----
Accuracy on training set:
precision, recall, f1: [0.9358, 1.0, 0.9669]
Accuracy on testing set:
precision, recall, f1: [0.8372, 1.0

## uADR (unsupervised, do not need labelled logs for training)

In [4]:
from ADR import preprocess

u_log_datasets_train_test = {}

u_train_ratios = {
    'bgl': 0.8
    }
for name, x_y_xColumns in log_datasets.items():
    if name in ['hdfs', 'bgl']:
        print("========")
        print(name)
        x, y, xColumns = x_y_xColumns['x'], x_y_xColumns['y'], x_y_xColumns['xColumns']
        print(y.sum()/y.size)
        print(f'x shape: {x.shape}')
        x_train, y_train, x_test, y_test = preprocess.split_to_train_test_by_ratio(x, y, train_ratio=u_train_ratios[name])
        u_log_datasets_train_test[name] = [x_train, y_train, x_test, y_test]
        print(f'x_train shape:{x_train.shape}')
        print(f'x_test shape:{x_test.shape}')

bgl
0.4530555074221683
x shape: (69252, 384)
x_train shape:(55401, 384)
x_test shape:(13851, 384)


In [18]:
from ADR import uADR

log_name = 'bgl'
estimated_pN = 0.5

print('='*30)
print(log_name)
print(f'estimated_pN: {estimated_pN}')
x_train, y_train, x_test, y_test = u_log_datasets_train_test[log_name]

model = uADR.uADR(AN_ratio=1-estimated_pN, nrows_per_sample=10, nrounds=100)
model.fit(x_train)
precision, recall, f1 = model.evaluate(x_train, y_train)
print('Accuracy on training set:')
print(f"precision, recall, f1: {[precision, recall, f1]}")

precision, recall, f1 = model.evaluate(x_test, y_test)
print('Accuracy on testing set:')
print(f"precision, recall, f1: {[precision, recall, f1]}")

=====bgl=====
-----0.5-----
Accuracy on training set:
precision, recall, f1: [0.6581, 1.0, 0.7938]
Accuracy on testing set:
precision, recall, f1: [0.6625, 1.0, 0.797]


## References

[1] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An Online Log Parsing Approach with Fixed Depth Tree,” in 2017 IEEE International Conference on Web Services (ICWS), Jun. 2017, pp. 33–40, doi: 10.1109/ICWS.2017.13.