# Tutorial: Log Anomaly Detection Using LogAI

This is an example to show how to use LogAI to conduct log anomaly detection analysis.

## Load Data

You can use `OpensetDataLoader` to load a sample open log dataset. Here we use HealthApp dataset from
[LogHub](https://zenodo.org/record/3227177#.Y1M3LezML0o) as an example.


In [1]:
import os

from logai.dataloader.openset_data_loader import OpenSetDataLoader, OpenSetDataLoaderConfig

#File Configuration
filepath = os.path.join("..", "datasets", "HealthApp_2000.log") # Point to the target HealthApp.log dataset

dataset_name = "HealthApp"
data_loader = OpenSetDataLoader(
    OpenSetDataLoaderConfig(
        dataset_name=dataset_name,
        filepath=filepath)
)

logrecord = data_loader.load_data()

logrecord.to_dataframe().head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected[constants.LOG_TIMESTAMPS] = pd.to_datetime(


Unnamed: 0,logline,timestamp,Action,ID
0,onExtend:1514038530000 14 0 4,2017-12-23 22:15:29.615,Step_LSC,30002312
1,onReceive action: android.intent.action.SCREEN_ON,2017-12-23 22:15:29.633,Step_StandReportReceiver,30002312
2,processHandleBroadcastAction action:android.in...,2017-12-23 22:15:29.635,Step_LSC,30002312
3,flush sensor data,2017-12-23 22:15:29.635,Step_StandStepCounter,30002312
4,getTodayTotalDetailSteps = 1514038440000##699...,2017-12-23 22:15:29.635,Step_SPUtils,30002312


## Preprocess

In preprocessing step user can retrieve and replace any regex strings and clean the raw loglines. This
can be very useful to improve information extraction of the unstructured part of logs,
 as well as generate more structured attributes with domain knowledge.

Here in the example, we use the below regex to retrieve IP addresses.

In [2]:
from logai.preprocess.preprocessor import PreprocessorConfig, Preprocessor
from logai.utils import constants

loglines = logrecord.body[constants.LOGLINE_NAME]
attributes = logrecord.attributes

preprocessor_config = PreprocessorConfig(
    custom_replace_list=[
        [r"\d+\.\d+\.\d+\.\d+", "<IP>"],   # retrieve all IP addresses and replace with <IP> tag in the original string.
    ]
)

preprocessor = Preprocessor(preprocessor_config)

clean_logs, custom_patterns = preprocessor.clean_log(
    loglines
)

## Parsing

After preprocessing, we call auto-parsing algorithms to automatically parse the cleaned logs.


In [3]:
from logai.information_extraction.log_parser import LogParser, LogParserConfig
from logai.algorithms.parsing_algo.drain import DrainParams

# parsing
parsing_algo_params = DrainParams(
    sim_th=0.5, depth=5
)

log_parser_config = LogParserConfig(
    parsing_algorithm="drain",
    parsing_algo_params=parsing_algo_params
)

parser = LogParser(log_parser_config)
parsed_result = parser.parse(clean_logs)

parsed_loglines = parsed_result['parsed_logline']

## Time-series Anomaly Detection

Here we show an example to conduct time-series anomaly detection with parsed logs.

### Feature Extraction

After parsing the logs and get log templates, we can extract timeseries features by coverting
these parsed loglines into counter vectors.

In [4]:
from logai.information_extraction.feature_extractor import FeatureExtractorConfig, FeatureExtractor

config = FeatureExtractorConfig(
    group_by_time="15min",
    group_by_category=['parsed_logline', 'Action', 'ID'],
)

feature_extractor = FeatureExtractor(config)

timestamps = logrecord.timestamp['timestamp']
parsed_loglines = parsed_result['parsed_logline']
counter_vector = feature_extractor.convert_to_counter_vector(
    log_pattern=parsed_loglines,
    attributes=attributes,
    timestamps=timestamps
)

counter_vector.head(5)


Unnamed: 0,parsed_logline,Action,ID,timestamp,event_index,counts
0,* * 0 4,Step_LSC,30002312,2017-12-23 22:15:00,"[0, 10, 13, 20, 27, 34, 41, 48, 55, 62, 65, 83...",111
1,* * 0 4,Step_LSC,30002312,2017-12-23 23:00:00,"[1354, 1361, 1368, 1375, 1382, 1389, 1396, 140...",11
2,* 0 0 *,Step_LSC,30002312,2017-12-23 22:15:00,"[317, 400, 589, 673, 895, 901, 910, 911, 912, ...",15
3,* 0 0 *,Step_LSC,30002312,2017-12-23 22:30:00,"[955, 956, 957, 958, 959, 960, 971, 978, 988, ...",19
4,* 0 0 *,Step_LSC,30002312,2017-12-23 22:45:00,"[1079, 1085, 1097, 1104, 1115, 1121, 1127, 113...",22


### Anomaly Detection

With the generated `counter_vcetor`, you can use `AnomalyDetector` to detect timeseries anomalies.
Here we use an algorithm in Merlion library called `DynamicBaseLine`.

In [5]:
from logai.analysis.anomaly_detector import AnomalyDetector, AnomalyDetectionConfig
from sklearn.model_selection import train_test_split
import pandas as pd

counter_vector["attribute"] = counter_vector.drop(
                [
                    constants.LOG_COUNTS,
                    constants.LOG_TIMESTAMPS,
                    constants.EVENT_INDEX
                ],
                axis=1
            ).apply(
                lambda x: "-".join(x.astype(str)), axis=1
            )

attr_list = counter_vector["attribute"].unique()

anomaly_detection_config = AnomalyDetectionConfig(
    algo_name='dbl'
)

res = pd.DataFrame()
for attr in attr_list:
    temp_df = counter_vector[counter_vector["attribute"] == attr]
    if temp_df.shape[0] >= constants.MIN_TS_LENGTH:
        train, test = train_test_split(
            temp_df[[constants.LOG_TIMESTAMPS, constants.LOG_COUNTS]],
            shuffle=False,
            train_size=0.3
        )
        anomaly_detector = AnomalyDetector(anomaly_detection_config)
        anomaly_detector.fit(train)
        anom_score = anomaly_detector.predict(test)
        res = res.append(anom_score)


  res = res.append(anom_score)


In [6]:
# Get anomalous datapoints
anomalies = counter_vector.iloc[res[res>0].index]
anomalies.head(5)

Unnamed: 0,parsed_logline,Action,ID,timestamp,event_index,counts,attribute
120,processHandleBroadcastAction *,Step_LSC,30002312,2017-12-23 23:00:00,"[1242, 1243, 1244, 1245, 1246, 1261, 1271, 129...",14,processHandleBroadcastAction *-Step_LSC-30002312
121,processHandleBroadcastAction *,Step_LSC,30002312,2017-12-23 23:15:00,"[1436, 1437, 1449, 1464, 1469, 1483, 1484, 149...",17,processHandleBroadcastAction *-Step_LSC-30002312
122,processHandleBroadcastAction *,Step_LSC,30002312,2017-12-23 23:30:00,"[1531, 1532, 1533, 1544, 1545, 1546, 1547, 154...",15,processHandleBroadcastAction *-Step_LSC-30002312
123,processHandleBroadcastAction *,Step_LSC,30002312,2017-12-23 23:45:00,"[1586, 1587, 1600, 1601, 1602, 1603, 1610, 161...",15,processHandleBroadcastAction *-Step_LSC-30002312
124,processHandleBroadcastAction *,Step_LSC,30002312,2017-12-24 00:00:00,"[1775, 1820, 1821, 1822, 1823, 1824, 1831, 183...",15,processHandleBroadcastAction *-Step_LSC-30002312


## Semantic Anomaly Detection

We can also use the log template for semantic based anomaly detection. In this approach, we retrieve
the semantic features from the logs. This includes two parts: vectorizing the unstructured log templates
and encoding the structured log attributes.

### Vectorization for unstructured loglines

Here we use `word2vec` to vectorize unstructured part of the logs. The output will be a list of
numeric vectors that representing the semantic features of these log templates.

In [7]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alexander.huang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
from logai.information_extraction.log_vectorizer import VectorizerConfig, LogVectorizer

vectorizer_config = VectorizerConfig(
    algo_name = "word2vec"
)

vectorizer = LogVectorizer(
    vectorizer_config
)

# Train vectorizer
vectorizer.fit(parsed_loglines)

# Transform the loglines into features
log_vectors = vectorizer.transform(parsed_loglines)

### Categorical Encoding for log attributes

We also do categorical encoding for log attributes to convert the strings into numerical representations.

In [9]:
from logai.information_extraction.categorical_encoder import CategoricalEncoderConfig, CategoricalEncoder

encoder_config = CategoricalEncoderConfig(name="label_encoder")

encoder = CategoricalEncoder(encoder_config)

attributes_encoded = encoder.fit_transform(attributes)

### Feature Extraction

Then we extract and concate the semantic features for both the unstructured and structured part of logs.


In [10]:
from logai.information_extraction.feature_extractor import FeatureExtractorConfig, FeatureExtractor

timestamps = logrecord.timestamp['timestamp']

config = FeatureExtractorConfig(
    max_feature_len=100
)

feature_extractor = FeatureExtractor(config)

_, feature_vector = feature_extractor.convert_to_feature_vector(log_vectors, attributes_encoded, timestamps)


  block_list = gb.mean().reset_index()


### Anomaly Detection

With the extracted log semantic feature set, we can perform anomaly detection to find the abnormal
logs. Here we use `isolation_forest` as an example.

In [13]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(feature_vector, train_size=0.7, test_size=0.3)

from logai.algorithms.anomaly_detection_algo.isolation_forest import IsolationForestParams
from logai.analysis.anomaly_detector import AnomalyDetectionConfig, AnomalyDetector

algo_params = IsolationForestParams(
    n_estimators=10,
    max_features=100,
    warm_start=False
)
config = AnomalyDetectionConfig(
    algo_name='isolation_forest',
    algo_params=algo_params
)

anomaly_detector = AnomalyDetector(config)
anomaly_detector.fit(train)
res = anomaly_detector.predict(test)
# obtain the anomalous datapoints
anomalies = res[res==1]

InvalidParameterError: The 'warm_start' parameter of IsolationForest must be an instance of 'bool' or an instance of 'numpy.bool_'. Got 0 instead.

#### Check the corresponding loglines

In [None]:
loglines.iloc[anomalies.index].head(5)

#### Check the corresponding attributes

In [None]:
attributes.iloc[anomalies.index].head(5)