# Feature Engineering

The initial concept and some of the code borrows heavily from [Loglizer](https://github.com/logpai/loglizer)  
Input data for the `data_processor.py` file is created by `parse/project_parser.py` and is stored in `parse/project_parsed`

This demo will walk you through the process of converting the semi-structured log data created by the `parse` files into `image images` to be fed into a CNN

In [36]:
import pandas as pd
import numpy as np
from collections import OrderedDict
import regex as re
from sliding_window_processor import FeatureExtractor, sequence_padder, windower
from collections import Counter

The train set is loaded to look at the format

In [14]:
input_data = pd.read_csv("../../project_processed_data/HDFS_100k.log_structured.csv")

Now load the y data, and subset the x for easy demonstration

In [15]:
y = pd.read_csv("../../project_processed_data/anomaly_label.csv")
x_train = input_data[:100]

In [16]:
x_train.shape

(100, 9)

First, each event is collected into a list for each block id with:

In [17]:
def collect_event_ids(data_frame, regex_pattern, column_names):
    """
    turns input data_frame into a 2 columned dataframe
    with columns: BlockId, EventSequence
    where EventSequence is a list of the events that happened to the block
    """
    data_dict = OrderedDict()
    for _, row in data_frame.iterrows():
        blk_id_list = re.findall(regex_pattern, row["Content"])
        blk_id_set = set(blk_id_list)
        for blk_id in blk_id_set:
            if blk_id not in data_dict:
                data_dict[blk_id] = []
            data_dict[blk_id].append(row["EventId"])
    data_df = pd.DataFrame(list(data_dict.items()), columns=column_names)
    return data_df

In [18]:
re_pat = r"(blk_-?\d+)"
col_names = ["BlockId", "EventSequence"]
events_df = collect_event_ids(x_train, re_pat, col_names) # taking a subset for demonstrative purposes

This produced a dataframe with a unique identifier (BlockId) and the list of events in EventSequence  

And now join with the y data, so the y data can become split into train and test sets

In [19]:
events_df = events_df.merge(y, on="BlockId")
display(events_df.head())

Unnamed: 0,BlockId,EventSequence,Label
0,blk_-1608999687919862906,"[E5, E22, E5, E5, E11, E11, E9, E9, E11, E9, E...",Normal
1,blk_7503483334202473044,"[E5, E5, E22, E5, E11, E9, E11, E9, E11, E9, E...",Normal
2,blk_-3544583377289625738,"[E5, E22, E5, E5, E11, E9, E11, E9, E11, E9, E...",Anomaly
3,blk_-9073992586687739851,"[E5, E22, E5, E5, E11, E9, E11, E9, E11, E9, E...",Normal


The EventSequence column is then passed to the feature extractor `fit_transform_subblocks()` method  
To demonstrate what is happening in the class the code will be dissected and shown here step by step 

Note: `data_processor.py` also contains `fit_transform()` and `transform()`, these functions are to be used when you don't want to create time images (more on this to come)

In [26]:
# The way the code is called normally
events_values = events_df["EventSequence"].values
fe = FeatureExtractor()
subblocks_train = fe.fit_transform(events_values, term_weighting="tf-idf", length_percentile=100, window_size=10)

final shape will be  36 10
train data shape:  (4, 36, 10)


Here we will define the parameters of the method call

In [34]:
window_size = 10
term_weighting = "tf-idf"
length_percentile=100

X_seq = events_values

max_seq_length = max(np.array(list(map(len, X_seq))))

num_rows = max_seq_length - window_size + 1

unique_events = set()
for i in X_seq:
    unique_events.update(i)
events = unique_events

Next, we will be turning each event sequence into a time image.  
This is done by applying a sliding window to the event sequences

In [37]:
# loop over each sequence to create the time image
time_images = []
for block in X_seq:
    padded_block = sequence_padder(block, max_seq_length)
    time_image = windower(padded_block, window_size)
    time_image_counts = []
    for time_row in time_image:
        row_count = Counter(time_row)
        time_image_counts.append(row_count)

    time_image_df = pd.DataFrame(time_image_counts, columns=events)
    time_image_df = time_image_df.reindex(sorted(time_image_df.columns), axis=1)
    time_image_df = time_image_df.fillna(0)
    time_image_np = time_image_df.to_numpy()

    # resize if too large
    if len(time_image_np) > num_rows:

        time_image_np = resize_time_image(
            time_image_np, (num_rows, len(self.events)),
        )

    time_images.append(time_image_np)

# stack all the blocks
X = np.stack(time_images)

Looking at the first time image from the `X` numpy array:

In [38]:
print(X.shape)
X[0]

(4, 36, 10)


array([[3., 0., 0., 1., 0., 0., 0., 3., 0., 3.],
       [3., 0., 0., 1., 0., 1., 0., 2., 0., 3.],
       [3., 0., 0., 0., 0., 2., 0., 2., 0., 3.],
       [3., 0., 0., 0., 0., 3., 0., 1., 0., 3.],
       [3., 0., 0., 0., 0., 3., 0., 0., 1., 3.],
       [2., 0., 0., 0., 0., 3., 0., 1., 1., 3.],
       [1., 1., 0., 0., 0., 3., 0., 1., 1., 3.],
       [1., 1., 0., 0., 0., 3., 0., 1., 2., 2.],
       [1., 1., 0., 0., 0., 3., 0., 2., 2., 1.],
       [0., 1., 1., 0., 0., 3., 0., 2., 2., 1.],
       [0., 1., 1., 0., 1., 3., 0., 2., 2., 0.],
       [0., 1., 1., 0., 1., 3., 0., 2., 2., 0.],
       [0., 1., 1., 0., 1., 3., 0., 2., 2., 0.],
       [0., 1., 1., 0., 1., 2., 1., 2., 2., 0.],
       [0., 1., 1., 0., 2., 2., 1., 2., 1., 0.],
       [0., 1., 1., 0., 2., 2., 1., 1., 2., 0.],
       [0., 0., 1., 0., 2., 2., 1., 1., 3., 0.],
       [0., 0., 1., 0., 2., 2., 1., 2., 2., 0.],
       [0., 0., 1., 0., 2., 2., 1., 2., 2., 0.],
       [0., 1., 0., 0., 2., 2., 1., 2., 2., 0.],
       [0., 1., 1., 

From the shape of (4, 36, 10), there are 4 time images of size 36 rows and 10 columns.  
10 columns means in this data sub set that there are 10 unique events.

Next, if the fit_transform's term_weighting = "tf-idf" is True then the following transformation will be applied

Since the data is 3-dimensional (an array of time images) to apply tf-idf the array is reshaped to 2-dimensional, then after its again reshaped, back to the original 3-dimensions

In [39]:
# applies tf-idf if pararmeter
if term_weighting == "tf-idf":

    # Set up sizing
    num_instance, _, _ = X.shape
    dim1, dim2, dim3 = X.shape
    X = X.reshape(-1, dim3)

    # apply tf-idf
    df_vec = np.sum(X > 0, axis=0)
    idf_vec = np.log(num_instance / (df_vec + 1e-8))
    idf_tile = np.tile(idf_vec, (num_instance * dim2, 1))
    idf_matrix = X * idf_tile
    X = idf_matrix

    # reshape to original dimensions
    X = X.reshape(dim1, dim2, dim3)

x_train = X

In [40]:
x_train.shape

(4, 36, 10)

Then once the fit_transform has been applied, the columns, tf-idf information, and other processing parameters are saved to be applied to the test_set with `fit_transform`

In [41]:
y_train = events_df["Label"]

In [43]:
print(x_train.shape)
print(y_train.shape)


(4, 36, 10)
(4,)


So now the train data set is complete.  
To create the test data set, the same process is run after, but with `transform()` instead of `fit_transform()`