# Feature Engineering

The initial concept and some of the code borrows heavily from [Loglizer](https://github.com/logpai/loglizer)  
Input data for the `data_processor.py` file is created by `parse/project_parser.py` and is stored in `parse/project_parsed`

This demo will walk you through the process of converting the semi-structured log data created by the `parse` files into `image images` to be fed into a CNN

In [39]:
import pandas as pd
import numpy as np
from collections import OrderedDict
import regex as re
from data_processor import FeatureExtractor
from collections import Counter

The train set is loaded to look at the format

In [58]:
input_data = pd.read_csv("../parse/project_parsed/HDFS_train.log_structured.csv")

Now load the y data, and subset the x for easy demonstration

In [89]:
y = pd.read_csv("../parse/project_parsed/anomaly_label.csv")
x_train = input_data.loc[:100]

In [90]:
input_data.head()

Unnamed: 0,LineId,Date,Time,Pid,Level,Component,Content,EventId,EventTemplate,ParameterList
0,1,81109,203518,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.19.102:5..."
1,2,81109,203518,35,INFO,dfs.FSNamesystem,BLOCK* NameSystem.allocateBlock: /mnt/hadoop/m...,3d91fa85,BLOCK* NameSystem.allocateBlock: <*> <*>,['/mnt/hadoop/mapred/system/job_200811092030_0...
2,3,81109,203519,143,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.10.6:405..."
3,4,81109,203519,145,INFO,dfs.DataNode$DataXceiver,Receiving block blk_-1608999687919862906 src: ...,09a53393,Receiving block <*> src: <*> dest: <*>,"['blk_-1608999687919862906', '/10.250.14.224:4..."
4,5,81109,203519,145,INFO,dfs.DataNode$PacketResponder,PacketResponder 1 for block blk_-1608999687919...,d38aa58d,PacketResponder <*> for block <*> <*>,"['1', 'blk_-1608999687919862906 terminating']"


First, each event is collected into a list for each block id with:

In [91]:
def collect_event_ids(data_frame, regex_pattern, column_names):
    """
    turns input data_frame into a 2 columned dataframe
    with columns: BlockId, EventSequence
    where EventSequence is a list of the events that happened to the block
    """
    data_dict = OrderedDict()
    for _, row in data_frame.iterrows():
        blk_id_list = re.findall(regex_pattern, row["Content"])
        blk_id_set = set(blk_id_list)
        for blk_id in blk_id_set:
            if blk_id not in data_dict:
                data_dict[blk_id] = []
            data_dict[blk_id].append(row["EventId"])
    data_df = pd.DataFrame(list(data_dict.items()), columns=column_names)
    return data_df

In [95]:
re_pat = r"(blk_-?\d+)"
col_names = ["BlockId", "EventSequence"]
events_df = collect_event_ids(x_train, re_pat, col_names) # taking a subset for demonstrative purposes

This produced a dataframe with a unique identifier (BlockId) and the list of events in EventSequence  

And now join with the y data, so the y data can become split into train and test sets

In [96]:
events_df = events_df.merge(y, on="BlockId")
display(events_df.head())

Unnamed: 0,BlockId,EventSequence,Label
0,blk_-1608999687919862906,"[09a53393, 3d91fa85, 09a53393, 09a53393, d38aa...",Normal
1,blk_7503483334202473044,"[09a53393, 09a53393, 3d91fa85, 09a53393, d38aa...",Normal
2,blk_-3544583377289625738,"[09a53393, 3d91fa85, 09a53393, 09a53393, d38aa...",Anomaly
3,blk_-9073992586687739851,"[09a53393, 3d91fa85, 09a53393, 09a53393, d38aa...",Normal


The EventSequence column is then passed to the feature extractor `fit_transform_subblocks()` method  
To demonstrate what is happening in the class the code will be dissected and shown here step by step 

Note: `data_processor.py` also contains `fit_transform()` and `transform()`, these functions are to be used when you don't want to create time images (more on this to come)

In [97]:
# The way the code is called normally
events_values = events_df["EventSequence"].values
fe = FeatureExtractor()
subblocks_train = fe.fit_transform_subblocks(
        events_values, term_weighting="tf-idf", rolling=True
                )


Train data shape:  (4, 19, 10)


Here we will define the parameters of the method call

In [98]:
rolling = True
term_weighting = "tf-idf"

X_seq = events_values
unique_events = set()
for i in X_seq:
    unique_events.update(i)
events = unique_events

Next, we will be turning each event sequence into a time image.  
This is done by splitting the event sequence into 5% increments, where each row is the 5% increment in time and the columns are the event sequence label.  
Additionally when the rolling parameter is `True` the time image rows are rolling window summed, with window size of 2, to give the rows of our time image better temporal context.

In [99]:
# Convert into bag of words
all_blocks_count = []
for block in X_seq:
    # multiply block by 20 for 5% partitions
    block_rep = np.repeat(block, 20)
    # now split into 5% partitions
    block_split = np.split(block_rep, 20)
    block_counts = []
    for sub_block in block_split:
        # count each sub_block
        subset_count = Counter(sub_block)
        block_counts.append(subset_count)
    # put into dataframe to add nas to missing events
    # divide by 20 as original operation multiplied by 20
    block_df = pd.DataFrame(block_counts, columns=events) / 20
    block_df = block_df.reindex(sorted(block_df.columns), axis=1)
    block_df = block_df.fillna(0)
    if rolling:
        block_df = block_df.rolling(window=2).sum()
        block_df = block_df.dropna()

    block_np = block_df.to_numpy()
    all_blocks_count.append(block_np)
X = np.stack(all_blocks_count)

Looking at the first time image from the `X` numpy array:

In [100]:
print(X.shape)
X[0]

(4, 19, 10)


array([[3.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.5 , 0.  , 0.  ],
       [1.75, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 2.  , 0.  , 0.75],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 2.5 , 0.  , 2.  ],
       [0.  , 0.  , 0.  , 1.25, 0.  , 0.  , 0.  , 1.  , 0.  , 2.25],
       [0.  , 0.  , 0.  , 3.  , 0.  , 0.  , 0.  , 0.  , 0.5 , 1.  ],
       [1.  , 0.  , 0.  , 1.75, 0.  , 0.75, 0.  , 0.  , 1.  , 0.  ],
       [2.  , 0.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  , 1.5 , 0.  ],
       [1.  , 0.  , 1.  , 0.25, 0.  , 0.25, 1.  , 0.  , 1.  , 0.  ],
       [0.  , 0.  , 1.  , 2.  , 0.5 , 0.  , 1.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  , 1.75, 1.  , 0.  , 0.  , 0.  , 0.75, 0.  ],
       [1.  , 0.  , 1.  , 0.  , 0.5 , 0.  , 0.  , 0.  , 2.  , 0.  ],
       [2.  , 0.  , 0.  , 0.  , 0.  , 1.  , 0.25, 0.  , 1.25, 0.  ],
       [1.  , 0.  , 0.  , 1.5 , 0.  , 1.  , 1.  , 0.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  , 2.  , 0.  , 0.  , 0.75, 0.  , 0.75, 0.  ],
       [2.  , 0.  , 0.  , 0.5 , 0.

From the shape of (4, 19, 10), there are 4 time images of size 19 rows and 5 columns.  
The original time image has 5% increments, so 20 rows, but one is lost when the rolling sum function is applied.  
5 columns means in this data sub set that there are 5 unique events.

Next, if the fit_transform_subblocks's term_weighting = "tf-idf" is True then the following transformation will be applied

Since the data is 3-dimensional (an array of time images) to apply tf-idf the array is reshaped to 2-dimensional, then after its again reshaped, back to the original 3-dimensions

In [101]:
# applies tf-idf if pararmeter
if term_weighting == "tf-idf":

    # Set up sizing
    num_instance, _, _ = X.shape
    dim1, dim2, dim3 = X.shape
    X = X.reshape(-1, dim3)

    # apply tf-idf
    df_vec = np.sum(X > 0, axis=0)
    idf_vec = np.log(num_instance / (df_vec + 1e-8))
    idf_tile = np.tile(idf_vec, (num_instance * dim2, 1))
    idf_matrix = X * idf_tile
    X = idf_matrix

    # reshape to original dimensions
    X = X.reshape(dim1, dim2, dim3)

x_train = X

In [104]:
x_train.shape

(4, 19, 10)

Then once the fit_transform has been applied, the columns, tf-idf information, and other processing parameters are saved to be applied to the test_set with `fit_transform`

In [105]:
y_train = events_df["Label"]

In [82]:
print(X_new.shape)
print()


(4, 19, 10)