### Log Anomaly Detection on BGL Dataset using LSTM based models
This is a running example of an end-to-end workflow for Log Anomaly Detection on public dataset BGL using LSTM based neural anomaly detectors.

It is very similar to that on the HDFS dataset. We will only mark out the dataset specific portions of this workflow i.e. the parts that differ between the two datasets. 

For a more complete elaboration of the full workflow please refer to `hdfs_lstm_unsupervised_parsed_sequential.ipynb` notebook.

In [11]:
import os 
from logai.applications.openset.anomaly_detection.openset_anomaly_detection_workflow import OpenSetADWorkflowConfig, validate_config_dict
from logai.utils.file_utils import read_file
from logai.utils.dataset_utils import split_train_dev_test_for_anomaly_detection
import logging 
from logai.dataloader.data_loader import FileDataLoader
from logai.preprocess.bgl_preprocessor import BGLPreprocessor
from logai.information_extraction.log_parser import LogParser
from logai.preprocess.openset_partitioner import OpenSetPartitioner
from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
from logai.information_extraction.log_vectorizer import LogVectorizer
from logai.utils import constants

### Loading config from yaml
While the way to load config from yaml file is generic across all datasets, dive into the yaml file itself to specify particular nuances of your dataset (for e.g. regex patterns or mapping of column names to the LogRecordObject attributes)

In [12]:
config_path = "configs/bgl_lstm_unsupervised_parsed_sequential_config.yaml"
config_parsed = read_file(config_path)
config_dict = config_parsed["workflow_config"]
validate_config_dict(config_dict)
config = OpenSetADWorkflowConfig.from_dict(config_dict)

In [13]:
dataloader = FileDataLoader(config.data_loader_config)
logrecord = dataloader.load_data()
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
358455    RAS KERNEL FATAL idoproxy communication failur...
358456    RAS KERNEL FATAL idoproxy communication failur...
358457    RAS KERNEL FATAL idoproxy communication failur...
358458    RAS KERNEL FATAL idoproxy communication failur...
358459    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 358460, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected[constants.LOG_TIMESTAMPS] = pd.to_datetime(


### Preprocessing loaded data using BGLPreprocessor 
This is the only part of the workflow that differs based on the dataset used. Each dataset must have its own Preprocessor class implemented. The main functionalities of the preprocessor class is to help process the raw log data and extract the standard fields of the LogRecordObject (e.g. body, labels, timestamps, span_ids, attributes etc). 

For some of the fields (like timestamp) where the extraction is generic, it is already automatically handled in the DataLoader class. 

Whereas, for some of the more dataset-specific fields (e.g. span_ids or labels), the custom extraction code has to be implemented in the dataset's corresponding Preprocessor class. For e.g. raw BGL dataset does not have any id associated with the loglines. But most existing log anomaly detection literature does a fixed time-partitioning of the logs and uses these partition indices as ids of the log segments. 

If you want to use a different time-partitioning or a different scheme for id-ing the loglines in BGL dataset, you have to write your own custom Preprocessor for BGL to serve that purpose. 

In [14]:
preprocessor = BGLPreprocessor(config.preprocessor_config)
preprocessed_filepath = os.path.join(config.output_dir, 'BGL_11k_processed.csv')            
logrecord = preprocessor.clean_log(logrecord)
logrecord.save_to_csv(preprocessed_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
358455    RAS KERNEL FATAL idoproxy communication failur...
358456    RAS KERNEL FATAL idoproxy communication failur...
358457    RAS KERNEL FATAL idoproxy communication failur...
358458    RAS KERNEL FATAL idoproxy communication failur...
358459    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 358460, dtype: object


In [15]:
parser = LogParser(config.log_parser_config)
parsed_result = parser.parse(logrecord.body[constants.LOGLINE_NAME])
logrecord.body[constants.LOGLINE_NAME] = parsed_result[constants.PARSED_LOGLINE_NAME]
parsed_filepath = os.path.join(config.output_dir, 'BGL_11k_parsed.csv')
logrecord.save_to_csv(parsed_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
358455    RAS KERNEL FATAL idoproxy communication failur...
358456    RAS KERNEL FATAL idoproxy communication failur...
358457    RAS KERNEL FATAL idoproxy communication failur...
358458    RAS KERNEL FATAL idoproxy communication failur...
358459    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 358460, dtype: object


In [16]:
partitioner = OpenSetPartitioner(config.open_set_partitioner_config)
partitioned_filepath = os.path.join(config.output_dir, 'BGL_11k_parsed_sliding10.csv')
logrecord = partitioner.partition(logrecord)
logrecord.save_to_csv(partitioned_filepath)
print (logrecord.body[constants.LOGLINE_NAME])

0         RAS KERNEL INFO instruction cache parity error...
1         RAS KERNEL INFO instruction cache parity error...
2         RAS KERNEL INFO instruction cache parity error...
3         RAS KERNEL INFO instruction cache parity error...
4         RAS KERNEL INFO instruction cache parity error...
                                ...                        
346567    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
346568    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
346569    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
346570    RAS KERNEL FATAL Lustre mount FAILED ALPHANUM ...
346571    RAS KERNEL FATAL idoproxy communication failur...
Name: logline, Length: 346572, dtype: object


In [17]:
train_filepath = os.path.join(config.output_dir, 'BGL_11k_parsed_sliding10_unsupervised_train.csv')
dev_filepath = os.path.join(config.output_dir, 'BGL_11k_parsed_sliding10_unsupervised_dev.csv')
test_filepath = os.path.join(config.output_dir, 'BGL_11k_parsed_sliding10_unsupervised_test.csv')

(train_data, dev_data, test_data) = split_train_dev_test_for_anomaly_detection(
                logrecord,training_type=config.training_type,
                test_data_frac_neg_class=config.test_data_frac_neg,
                test_data_frac_pos_class=config.test_data_frac_pos,
                shuffle=config.train_test_shuffle
            )

train_data.save_to_csv(train_filepath)
dev_data.save_to_csv(dev_filepath)
test_data.save_to_csv(test_filepath)
print ('Train/Dev/Test Anomalous', len(train_data.labels[train_data.labels[constants.LABELS]==1]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==1]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==1]))
print ('Train/Dev/Test Normal', len(train_data.labels[train_data.labels[constants.LABELS]==0]), 
                                   len(dev_data.labels[dev_data.labels[constants.LABELS]==0]), 
                                   len(test_data.labels[test_data.labels[constants.LABELS]==0]))

indices_train/dev/test:  9503 1941 336941
Train/Dev/Test Anomalous 0 0 336941
Train/Dev/Test Normal 9503 1941 0


In [18]:
vectorizer = LogVectorizer(config.log_vectorizer_config)
vectorizer.fit(train_data)
train_features = vectorizer.transform(train_data)
dev_features = vectorizer.transform(dev_data)
test_features = vectorizer.transform(test_data)

In [19]:
anomaly_detector = NNAnomalyDetector(config=config.nn_anomaly_detection_config)
anomaly_detector.fit(train_features, dev_features)

INFO:root:Start training on 2376 batches with cpu.
INFO:root:Batch 100, training loss : 1.086510915160179
INFO:root:Batch 200, training loss : 0.5830365583673119
INFO:root:Batch 300, training loss : 0.39607727030913037
INFO:root:Batch 400, training loss : 0.30436887553427366
INFO:root:Batch 500, training loss : 0.24499564726836978
INFO:root:Batch 600, training loss : 0.205032207845555
INFO:root:Batch 700, training loss : 0.2037280415514085
INFO:root:Batch 800, training loss : 0.17889216940820915
INFO:root:Batch 900, training loss : 0.15944159963349294
INFO:root:Batch 1000, training loss : 0.14380327356606723
INFO:root:Batch 1100, training loss : 0.1309589400227097
INFO:root:Batch 1200, training loss : 0.17312054718941605
INFO:root:Batch 1300, training loss : 0.2145836355474491
INFO:root:Batch 1400, training loss : 0.20053650532466624
INFO:root:Batch 1500, training loss : 0.18776037297133977
INFO:root:Batch 1600, training loss : 0.17638718325804803
INFO:root:Batch 1700, training loss : 

In [20]:
predict_results = anomaly_detector.predict(test_features)
print (predict_results)

INFO:root:Evaluating test data.
INFO:root:Finish inference. [46.899296045303345s]
INFO:root:Calculating acc sum.
INFO:root:Finish generating store_df.
INFO:root:Finish counting [2.489457845687866s]
INFO:root:Best result: f1: 1.0 rc: 1.0 pc: 1.0


{'f1': 1.0, 'rc': 1.0, 'pc': 1.0, 'pred': 0       1
1       1
2       1
3       1
4       1
       ..
1803    1
1804    1
1805    1
1806    1
1807    1
Name: window_pred_anomaly_8, Length: 1808, dtype: int64, 'true': 0       1
1       1
2       1
3       1
4       1
       ..
1803    1
1804    1
1805    1
1806    1
1807    1
Name: window_anomalies, Length: 1808, dtype: int64}
