### Log Anomaly Detection on BGL dataset using LogBERT model
This is a running example of an end-to-end workflow of Log Anomaly Detection on public dataset HDFS using the LogBERT model.

There are similar workflows on the BGL datasets using other anomaly detectors (like LSTM based one in `bgl_lstm_unsupervised_parsed_sequential.ipynb`). 

The actual workflow script is exactly identical in these cases, except in the LogBERT case we choose to skip the log-parsing step. This is simply done following past literature, but there are no restrictions from the LogAI library side. 


Also check out the other config files that in this directory that cater to other datasets (HDFS), or other experimental configs like (parsing/nonparsing based, sliding/session window based log partitioning, sequential/semantic log feature representations, supervised/unsupervised setting, LSTM/CNN/Transformer/BERT model). 

To use these different experimental configs, you only need to point to the correct config file and the same workflow code should work perfectly for those!

Only in case of changing the dataset (eg. from BGL to HDFS) you need to not only change the config.yaml file but also use the HDFSPreprocessor in the preprocessing step. Note that each custom dataset that are added should have its own Preprocessor class (which should inherit from logai.preproces.preprocessor.Preprocessor). 

For more complete explanations of each step of the workflow check out the `hdfs_lstm_unsupervised_parsed_sequential.ipynb` notebook instead.


In [1]:
import os 
from logai.applications.openset.anomaly_detection.openset_anomaly_detection_workflow import OpenSetADWorkflowConfig, validate_config_dict
from logai.utils.file_utils import read_file
from logai.utils.dataset_utils import split_train_dev_test_for_anomaly_detection
import logging 
from logai.dataloader.data_loader import FileDataLoader
from logai.preprocess.bgl_preprocessor import BGLPreprocessor
from logai.information_extraction.log_parser import LogParser
from logai.preprocess.openset_partitioner import OpenSetPartitioner
from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
from logai.information_extraction.log_vectorizer import LogVectorizer
from logai.utils import constants

In [2]:
config_path = "configs/bgl_logbert_config.yaml"
config_parsed = read_file(config_path)
config_dict = config_parsed["workflow_config"]
config = OpenSetADWorkflowConfig.from_dict(config_dict)

In [3]:
dataloader = FileDataLoader(config.data_loader_config)
logrecord = dataloader.load_data()
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
358455    RAS KERNEL FATAL idoproxy communication failur...
358456    RAS KERNEL FATAL idoproxy communication failur...
358457    RAS KERNEL FATAL idoproxy communication failur...
358458    RAS KERNEL FATAL idoproxy communication failur...
358459    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 358460, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected[constants.LOG_TIMESTAMPS] = pd.to_datetime(


In [4]:
preprocessor = BGLPreprocessor(config.preprocessor_config)
preprocessed_filepath = os.path.join(config.output_dir, 'BGL_11k_processed.csv')            
logrecord = preprocessor.clean_log(logrecord)
logrecord.save_to_csv(preprocessed_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
358455    RAS KERNEL FATAL idoproxy communication failur...
358456    RAS KERNEL FATAL idoproxy communication failur...
358457    RAS KERNEL FATAL idoproxy communication failur...
358458    RAS KERNEL FATAL idoproxy communication failur...
358459    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 358460, dtype: object


In [5]:
partitioner = OpenSetPartitioner(config.open_set_partitioner_config)
partitioned_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session.csv')
logrecord = partitioner.partition(logrecord)
logrecord.save_to_csv(partitioned_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0       RAS KERNEL INFO instruction cache parity error...
1       RAS KERNEL INFO instruction cache parity error...
2       RAS KERNEL INFO instruction cache parity error...
3       RAS KERNEL INFO instruction cache parity error...
4       RAS KERNEL INFO instruction cache parity error...
                              ...                        
1848    RAS APP FATAL ciod Error reading message prefi...
1849    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
1850    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
1851    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
1852    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 1853, dtype: object


In [6]:
train_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session_unsupervised_train.csv')
dev_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session_unsupervised_dev.csv')
test_filepath = os.path.join(config.output_dir, 'BGL_11k_nonparsed_session_unsupervised_test.csv')

(train_data, dev_data, test_data) = split_train_dev_test_for_anomaly_detection(
                logrecord,training_type=config.training_type,
                test_data_frac_neg_class=config.test_data_frac_neg,
                test_data_frac_pos_class=config.test_data_frac_pos,
                shuffle=config.train_test_shuffle
            )

train_data.save_to_csv(train_filepath)
dev_data.save_to_csv(dev_filepath)
test_data.save_to_csv(test_filepath)
print ('Train/Dev/Test Anomalous', len(train_data.labels[train_data.labels[constants.LABELS]==1]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==1]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==1]))
print ('Train/Dev/Test Normal', len(train_data.labels[train_data.labels[constants.LABELS]==0]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==0]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==0]))

indices_train/dev/test:  32 4 1817
Train/Dev/Test Anomalous 0 0 1808
Train/Dev/Test Normal 32 4 9


In [7]:
vectorizer = LogVectorizer(config.log_vectorizer_config)
vectorizer.fit(train_data)
train_features = vectorizer.transform(train_data)
dev_features = vectorizer.transform(dev_data)
test_features = vectorizer.transform(test_data)
print (train_features)






Map (num_proc=4):   0%|          | 0/32 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/4 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1817 [00:00<?, ? examples/s]

Dataset({
    features: ['labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 32
})


In [8]:
anomaly_detector = NNAnomalyDetector(config=config.nn_anomaly_detection_config)
anomaly_detector.fit(train_features, dev_features)

initialized data collator


The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 32
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 80


  0%|          | 0/80 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 3.9481, 'learning_rate': 8.75e-05, 'epoch': 1.25}
{'loss': 3.5514, 'learning_rate': 7.500000000000001e-05, 'epoch': 2.5}
{'loss': 3.139, 'learning_rate': 6.25e-05, 'epoch': 3.75}
{'loss': 2.9058, 'learning_rate': 5e-05, 'epoch': 5.0}


The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: __index_level_0__. If __index_level_0__ are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4
  Batch size = 256


{'loss': 2.7939, 'learning_rate': 3.7500000000000003e-05, 'epoch': 6.25}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50
Configuration saved in temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50/config.json


{'eval_loss': 5.910862445831299, 'eval_runtime': 0.1782, 'eval_samples_per_second': 22.441, 'eval_steps_per_second': 5.61, 'epoch': 6.25}


Model weights saved in temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50/pytorch_model.bin


{'loss': 2.2386, 'learning_rate': 2.5e-05, 'epoch': 7.5}
{'loss': 1.7941, 'learning_rate': 1.25e-05, 'epoch': 8.75}




Training completed. Do not forget to share your model on huggingface.co/models =)




{'loss': 2.1435, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 42.8749, 'train_samples_per_second': 7.464, 'train_steps_per_second': 1.866, 'train_loss': 2.8143080949783323, 'epoch': 10.0}


In [9]:
predict_results = anomaly_detector.predict(test_features)
print (predict_results)

INFO:root:Loading model from /Users/alexander.huang/Workspace/logai/examples/jupyter_notebook/nn_ad_benchmarking/temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50
loading configuration file /Users/alexander.huang/Workspace/logai/examples/jupyter_notebook/nn_ad_benchmarking/temp_output/BGL_11k_parsed_session_supervised_AD/bert-base-cased/checkpoint-50/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.23.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 3

Map:   0%|          | 0/1817 [00:00<?, ? examples/s]

***** Running Prediction *****
  Num examples = 1867
  Batch size = 256
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.928493022918701 test_runtime: 53.9503 test_samples/s: 34.606
INFO:root:number of original test instances 1481
INFO:root:loss_mean Pos scores:  mean: 6.956572309020314, std: 0.6080354536980397
INFO:root:loss_mean Neg scores: mean: 3.0824834387871176, std: 1.6453940559119908
INFO:root:AUC of loss_mean: 0.9982179226069247
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.256480937253352, std: 0.35247602095338415
INFO:root:loss_max Neg scores: mean: 4.885785311460495, std: 2.3674712402659304
INFO:root:AUC of loss_max: 0.9978784792939579
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.644149760391735, std: 0.3221544536587862
INFO:root:loss_top6_mean Neg scores: mean: 3.615651013329625, std: 1.9612775925279653
INFO:root:AUC of loss_top6_mean: 0.9996605566870332
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.843062021000963, std: 0.0006618899932966818
INFO:root:scores_top6_max_prob Neg scores: mean: 0.7798267786080639, std: 0.084854502809618

  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.927598476409912 test_runtime: 53.3697 test_samples/s: 34.982
INFO:root:number of original test instances 1543
***** Running Prediction *****
  Num examples = 1867
  Batch size = 256


  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.944589138031006 test_runtime: 52.7862 test_samples/s: 35.369
INFO:root:number of original test instances 1586
INFO:root:loss_mean Pos scores:  mean: 6.960877492884767, std: 0.3125803434389781
INFO:root:loss_mean Neg scores: mean: 3.116841279288272, std: 1.619121324586434
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.568647076723996, std: 0.3393042668344223
INFO:root:loss_max Neg scores: mean: 5.053523451089859, std: 2.2668106463743003
INFO:root:AUC of loss_max: 1.0
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.650102574703828, std: 0.18209936760463388
INFO:root:loss_top6_mean Neg scores: mean: 3.632692766893241, std: 1.959059970462125
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8430962857866755, std: 0.00039696159249113367
INFO:root:scores_top6_max_prob Neg scores: mean: 0.7852794508863654, std: 0.07493434039226053
INFO:root:AUC of scores_top6_max_prob: 0.68

  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.970725059509277 test_runtime: 53.6928 test_samples/s: 34.772
INFO:root:number of original test instances 1642
***** Running Prediction *****
  Num examples = 1867
  Batch size = 256


  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.930238723754883 test_runtime: 53.205 test_samples/s: 35.091
INFO:root:number of original test instances 1694
INFO:root:loss_mean Pos scores:  mean: 6.973155573297613, std: 0.27860316832100634
INFO:root:loss_mean Neg scores: mean: 3.0994350669005373, std: 1.5537590110223798
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.619907401506554, std: 0.3353480344539885
INFO:root:loss_max Neg scores: mean: 4.948211563958062, std: 2.151513058115627
INFO:root:AUC of loss_max: 1.0
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.674706078963118, std: 0.17001428601839375
INFO:root:loss_top6_mean Neg scores: mean: 3.6250014551627783, std: 1.8839540175865472
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8431832878216684, std: 0.00034071575384181786
INFO:root:scores_top6_max_prob Neg scores: mean: 0.7914862497445242, std: 0.07365872379765771
INFO:root:AUC of scores_top6_max_prob: 0

  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.954885482788086 test_runtime: 54.2889 test_samples/s: 34.39
INFO:root:number of original test instances 1747
***** Running Prediction *****
  Num examples = 1867
  Batch size = 256


  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.891811370849609 test_runtime: 54.86 test_samples/s: 34.032
INFO:root:number of original test instances 1772
INFO:root:loss_mean Pos scores:  mean: 6.975145703049782, std: 0.2607007075216614
INFO:root:loss_mean Neg scores: mean: 3.1087315424699193, std: 1.551414920839871
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.648724076156919, std: 0.3322655258187494
INFO:root:loss_max Neg scores: mean: 4.97110030386183, std: 2.1251133063303382
INFO:root:AUC of loss_max: 1.0
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.727370514145509, std: 0.14610120517657663
INFO:root:loss_top6_mean Neg scores: mean: 3.692090356239566, std: 1.883234391886213
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8433170856274405, std: 0.0002551948758332728
INFO:root:scores_top6_max_prob Neg scores: mean: 0.7955843684390004, std: 0.0677194366331644
INFO:root:AUC of scores_top6_max_prob: 0.658032

  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.974289417266846 test_runtime: 55.0723 test_samples/s: 33.883
INFO:root:number of original test instances 1801
***** Running Prediction *****
  Num examples = 1866
  Batch size = 256


  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.930877685546875 test_runtime: 347.9693 test_samples/s: 5.363
INFO:root:number of original test instances 1813
INFO:root:loss_mean Pos scores:  mean: 6.981966105960047, std: 0.24387351536445512
INFO:root:loss_mean Neg scores: mean: 3.1068283653285667, std: 1.5511973716441778
INFO:root:AUC of loss_mean: 1.0
INFO:root:

INFO:root:loss_max Pos scores:  mean: 8.668997799477927, std: 0.3331098104219032
INFO:root:loss_max Neg scores: mean: 4.980885664621989, std: 2.1147806726719143
INFO:root:AUC of loss_max: 1.0
INFO:root:

INFO:root:loss_top6_mean Pos scores:  mean: 7.776402157923218, std: 0.11800260476669419
INFO:root:loss_top6_mean Neg scores: mean: 3.706846517545206, std: 1.8740946036341497
INFO:root:AUC of loss_top6_mean: 1.0
INFO:root:

INFO:root:scores_top6_max_prob Pos scores:  mean: 0.8433658191870901, std: 0.0002228806073072557
INFO:root:scores_top6_max_prob Neg scores: mean: 0.7968413198803678, std: 0.06583845451870429
INFO:root:AUC of scores_top6_max_prob: 0

  0%|          | 0/8 [00:00<?, ?it/s]

INFO:root:test_loss: 6.982187271118164 test_runtime: 54.1155 test_samples/s: 34.482
INFO:root:number of original test instances 1817


      indices  max_loss   sum_loss num_loss  \
0           0  1.788414   8.479586        9   
1           0   1.91644   9.006147        8   
2           1  2.114819   9.054378        8   
3           2  2.682362  10.225121        8   
4           3  7.217087  47.702454        9   
...       ...       ...        ...      ...   
18662    1813  8.931251   63.58242        8   
18663    1814  8.526834  69.759872        9   
18664    1814  8.944598  64.604271        8   
18665    1815  8.536341   61.96146        8   
18666    1816  8.256463  60.937569        8   

                                               top6_loss  \
0      [1.788413643836975, 1.7165228128433228, 1.0566...   
1      [1.9164402484893799, 1.7673128843307495, 1.688...   
2      [2.114818572998047, 1.5756577253341675, 1.5471...   
3      [2.6823620796203613, 1.5462324619293213, 1.351...   
4      [7.2170867919921875, 6.036893844604492, 5.8234...   
...                                                  ...   
18662  [8.93125