# DGX Kernel Root Cause Analysis Acceleration & Predictive Maintenance using NLP

# Table of Contents 
* Introduction
* Dataset
* Reading in the datasets
* Initialize/Load BERT model
* Training - DGX Kernel logs dataset
* Evaluation
* Conclusion
* References

# Introduction

Like any other Linux based machine, DGX's generate a vast amount of logs. Analysts spend hours trying to identify the root causes of each failure. There could be infinitely many types of root causes of the failures. Some patterns might help to narrow it down; however, regular expressions can only help to identify previously known patterns. Moreover, this creates another manual task of maintaining a search script. 

In this notebook, we show how GPU's can accelerate the analysis of the enormous amount of logs using machine learning. Another benefit of analyzing in a probabilistic way is that we can pin down previously undetected root causes. To achieve this, we will fine-tune a pre-trained BERT* model with a classification layer using HuggingFace library.

Once the model is capable of identifying even the new root causes, it could also be deployed as a process running in the machines to predict failures before they happen.

*BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found [here.](https://arxiv.org/pdf/1810.04805.pdf)

## Dataset
* DGX Linux Kernel logs

The dataset comprises `kern.log` files from multiple DGX's. Each line inside has been labelled as either `0` for `ordinary` or `1` for `root cause` by a script that uses some known patterns. We will be especially interested in lines that are marked as ordinary in the test set but predicted as a root cause, as they may be new types of root causes of failures.

More information on Linux log types can be found [here.](https://help.ubuntu.com/community/LinuxLogFiles)

### Required Libraries

In [1]:
import cudf;
from binary_sequence_classifier import BinarySequenceClassifier;
from os import path;
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import time
import torch

from morpheus.utils.seed import manual_seed

## Reading the files

In [2]:
random_seed=42

In [3]:
dflogs = pd.read_csv("../../datasets/training-data/root-cause-training-data.csv",header=None,names=["label","log"])

Each row in the `log` column has a line from the `kern.log` file, and the label column has information on whether it is an ordinary log or a root cause.

Previously unseen error types are read into a new dataframe to be appended to the test set later.

In [4]:
dfnewerror=pd.read_csv("../../datasets/training-data/root-cause-unseen-errors.csv", header=None, names=['label', 'log'])

In [5]:
X_train, X_test, y_train, y_test = train_test_split(dflogs, dflogs.label,random_state=random_seed)

In [6]:
X_train.reset_index(drop=True,inplace=True)

In [7]:
y_train.reset_index(drop=True,inplace=True)

In [8]:
X_test=pd.concat([X_test,dfnewerror])

In [9]:
y_test=pd.concat([y_test,dfnewerror["label"]])

In [10]:
X_test.reset_index(drop=True,inplace=True)

In [11]:
#X_train.to_csv("Rootcause-training-data.csv",index=False)

In [12]:
#X_test.to_csv("Rootcause-validation-data.csv",index=False)

In [13]:
X_train.to_json("Rootcause-training-data.jsonlines", orient='records',lines=True)

In [14]:
X_test.to_json("Rootcause-validation-data.jsonlines",orient='records',lines=True)

In [15]:
y_test.reset_index(drop=True,inplace=True)

In [16]:
dfnewerror=cudf.DataFrame.from_pandas(dfnewerror)

In [17]:
dflogs=cudf.DataFrame.from_pandas(dflogs)

## Initialize/Load the BERT model
We will initialize the sequence classifier module with a pre-trained BERT model. The pre-trained model we use is located at https://huggingface.co/bert-base-uncased For more information on the model, please see the paper at https://arxiv.org/pdf/2005.01634.pdf


In [18]:
seq_classifier = BinarySequenceClassifier()
seq_classifier.init_model("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Training 

Part of the dataset will be used for fine-tuning the model. The rest of the dataset will be used as the test set to evaluate if the model is useful. With default settings, 80% of the dataset will be the training set.

The number of epochs should be adjusted for each dataset.

In [19]:
manual_seed(random_seed)
seq_classifier.train_model(X_train["log"], y_train,batch_size=128, epochs=1,learning_rate=3.6e-4)

Epoch: 100%|██████████| 1/1 [00:08<00:00,  8.64s/it]

Train loss: 0.6523606370795857





In [20]:
timestr = time.strftime("%Y%m%d-%H%M%S")

seq_classifier.save_model(timestr)

## Evaluation of the model

`evaluate_model` returns the accuracy in the test set.

We will check if any of the error lines we appended are predicted as root-cause

In [21]:
seq_classifier.evaluate_model(X_test["log"], y_test)

0.9601666666666666

We get the predictions from the model.

In [22]:
test_preds = seq_classifier.predict(X_test["log"], batch_size=128, threshold=0.5)

In [23]:
tests = test_preds[0].to_numpy()
true_labels = X_test["label"]

Calculate the F1 score since it's not a balanced dataset.

In [24]:
f1_score(true_labels, tests)

0.9765765765765766

Accuracy is higher than the F1 score. The distribution of the labels is not balanced hence accuracy might be less indicative of performance.

We can use a confusion matrix to check how many of each label are predicted as marked.

In [25]:
confusion_matrix(true_labels, tests)

array([[189,  13],
       [  0, 271]])

In [26]:
testpredseries=(test_preds[0]).to_pandas()

In [27]:
(np.where(testpredseries == 1))

(array([  0,   1,   7,   9,  11,  12,  13,  14,  16,  18,  19,  20,  22,
         23,  24,  25,  29,  31,  34,  42,  44,  45,  46,  47,  49,  50,
         51,  52,  53,  55,  56,  57,  58,  59,  61,  62,  63,  64,  65,
         66,  67,  68,  69,  70,  71,  73,  74,  75,  80,  81,  82,  85,
         86,  88,  89,  90,  92,  93,  94,  95,  97,  98,  99, 101, 103,
        104, 105, 106, 107, 109, 110, 112, 113, 114, 115, 117, 119, 120,
        123, 124, 125, 126, 128, 129, 130, 131, 134, 136, 137, 138, 139,
        140, 141, 142, 143, 144, 145, 146, 148, 149, 150, 153, 154, 156,
        159, 160, 161, 162, 163, 164, 165, 166, 170, 172, 173, 174, 175,
        176, 178, 181, 182, 183, 187, 188, 191, 194, 196, 200, 201, 202,
        205, 206, 207, 209, 210, 214, 215, 216, 217, 218, 221, 222, 223,
        224, 226, 227, 228, 229, 231, 234, 235, 236, 237, 240, 244, 245,
        246, 250, 251, 254, 255, 256, 257, 262, 265, 266, 267, 268, 269,
        271, 274, 275, 276, 278, 284, 285, 287, 289

# Conclusion

Although we didn't include hardware errors in the training set, the model still marks some of them as root cause lines which can be seen from the indices of the positive predictions above. The lines identified by this model may indicate a problem hours before the failure or outage. This approach can be implemented on the machines to warn the users well before the problems occur so corrective actions can be taken.

# References
* https://github.com/huggingface/transformers/tree/master/examples#
* BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/pdf/1810.04805.pdf