# Anomaly Detection in OPA Decision Logs

[Open Policy Agent](https://www.openpolicyagent.org/) (OPA) provides a mechanism by which decision logs are periodically submitted to a remote server. This mechanism is outlined under the OPA [Decision Log Service API](https://www.openpolicyagent.org/docs/latest/management/#decision-logs). Decisions are batched, compressed, and submitted to the log receiver as a gzip-compressed JSON array, with each array element corresponding to a different decision.

In terms of feature extraction and anomaly detection, these will be influenced by the format of the input payload and the decision results. A simple example carrying out access violation detection from an HTTP API with a simple true/false authorization decision is implemented below.

The anomaly detection model itself is taken from the [loglizer](https://github.com/logpai/loglizer) log-analysis framework.

## Log Parsing and Structuring

The first step is to parse the incoming logs and to prepare a structured CSV:

In [15]:
import json
import pandas as pd

# Read the decision array as JSON
with open('data/opa-decision-logs.json') as log_file:
    json_data = log_file.read()
    
# Convert from JSON to a pandas DataFrame
decisions = json.loads(json_data)
df = pd.json_normalize(decisions)

# Drop unnecessary columns
df = df.drop(['labels.id', 'bundles.authz.revision'], axis=1)

df.head()

Unnamed: 0,decision_id,path,result,requested_by,timestamp,labels.app,labels.version,input.method,input.path
0,4ca636c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:00:00.000000Z,my-example-app,latest,GET,/salary/bob
1,4ca123c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:01:00.000000Z,my-example-app,latest,GET,/salary/karen
2,6ba123c1-22e4-817c-f1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:01:01.000000Z,my-example-app,latest,GET,/salary/karen
3,9bd456c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,False,[::1]:59943,2020-01-01T09:01:02.000000Z,my-example-app,latest,GET,/salary/paul
4,4ca123c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:01:03.000000Z,my-example-app,latest,GET,/salary/bob


With the dataframe now structured, write the structured data to a CSV file:

In [21]:
# Convert dataframe to CSV
df.to_csv('opa-decision-logs-structured.csv')

## Labelling Anomalies

A separate labelling file is provided for determining which decisions are anomalies. This is now read in and the dataframe labelled accordingly:

In [18]:
# Read in anomalies and label the dataframe
label_data = pd.read_csv('data/anomalies.csv', engine='c', na_filter=False, memory_map=True)
label_data = label_data.set_index('decision_id')
label_dict = label_data['label'].to_dict()
df['label'] = df['decision_id'].apply(lambda x: 1 if label_dict[x] == 'Anomaly' else 0)

df.head()

Unnamed: 0,decision_id,path,result,requested_by,timestamp,labels.app,labels.version,input.method,input.path,label
0,4ca636c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:00:00.000000Z,my-example-app,latest,GET,/salary/bob,0
1,4ca123c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:01:00.000000Z,my-example-app,latest,GET,/salary/karen,0
2,6ba123c1-22e4-817c-f1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:01:01.000000Z,my-example-app,latest,GET,/salary/karen,0
3,9bd456c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,False,[::1]:59943,2020-01-01T09:01:02.000000Z,my-example-app,latest,GET,/salary/paul,1
4,4ca123c1-55e4-417a-b1d8-4aceb67960d1,http/example/authz/allow,True,[::1]:59943,2020-01-01T09:01:03.000000Z,my-example-app,latest,GET,/salary/bob,0


## Splitting Training and Test Data

A train and test split is then generated from the dataframe in preparation for fitting to the loglizer model:

In [19]:
from loglizer import dataloader

# Split train and test data
(x_train, y_train), (x_test, y_test) = dataloader._split_data(df['result'].values,
                                                              df['label'].values, 0.5, 'uniform')

print(y_train.sum(), y_test.sum())

1 2


The dataset can now be summarized:

In [20]:
num_train = x_train.shape[0]
num_test = x_test.shape[0]
num_total = num_train + num_test
num_train_pos = sum(y_train)
num_test_pos = sum(y_test)
num_pos = num_train_pos + num_test_pos

print('Total: {} instances, {} anomaly, {} normal' \
      .format(num_total, num_pos, num_total - num_pos))
print('Train: {} instances, {} anomaly, {} normal' \
      .format(num_train, num_train_pos, num_train - num_train_pos))
print('Test: {} instances, {} anomaly, {} normal\n' \
      .format(num_test, num_test_pos, num_test - num_test_pos))

Total: 11 instances, 3 anomaly, 8 normal
Train: 5 instances, 1 anomaly, 4 normal
Test: 6 instances, 2 anomaly, 4 normal



## SVM Model Training and Validation

With the data split, we can now proceed with fitting the training data to the loglizer SVM model and validating against the test split:

In [22]:
from loglizer import preprocessing
from loglizer.models import SVM

feature_extractor = preprocessing.FeatureExtractor()
x_train = feature_extractor.fit_transform(x_train, term_weighting='tf-idf')
x_test = feature_extractor.transform(x_test)

model = SVM()
model.fit(x_train, y_train)

print('Train validation:')
model.evaluate(x_train, y_train)

print('Test validation:')
model.evaluate(x_test, y_test)

Train data shape: 5-by-8

Test data shape: 6-by-8

Train validation:
Precision: 1.000, recall: 1.000, F1-measure: 1.000

Test validation:
Precision: 1.000, recall: 1.000, F1-measure: 1.000



(1.0, 1.0, 1.0)

## References

+ Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. [Experience Report: System Log Analysis for Anomaly Detection](https://jiemingzhu.github.io/pub/slhe_issre2016.pdf), *IEEE International Symposium on Software Reliability Engineering (ISSRE)*, 2016. [[Bibtex](https://dblp.org/rec/bibtex/conf/issre/HeZHL16)][[中文版本](https://github.com/AmateurEvents/article/issues/2)]