Question about feature extraction on bgl dataset #10

cherishwsx · 2020-06-12T05:50:45Z

Hi, it's me again. :)

I'm trying to perform the Deeplog model on bgl dataset. So far, I was able to understand the logic and generate the event sequences from structured bgl log dataset using this sample_bgl.py that you provided (many thanks!!).

It basically slides a 30-min window with 12-min step size on the structured bgl log. And in this case, we will end up having event sequence that contains either huge amount of events (e.g. I found a event sequence with 12514 events in it...) or event sequence with one or no event in it (since there is no event happen at that time period in the sliding window).

After generating the event sequences, I deleted the event sequences with no event and ended up getting a file with 65 non-empty event sequences. And I randomly picked 60 event sequences as my training sequences and rest of the 5 will be validation data.

And this is when my questions kick in.

When I tried to generate the sequential feature for the training dataset, should I do the same thing for hdfs dataset like using sliding window of 10 or other size on the event sequence and the next event to the current window will be the label for this current window. But in this case, how should I deal with the event sequence with only 1 event?
And I do remember that you mentioned in the other post, if using bgl dataset, then it can direcly use the event sequence for the sequential vector since it is generated using the sliding window already, but in this case, in my understanding, each event sequence (except for the last event) will directly be a sequential vector, then the label for this vector will be the last event in that event sequence? Then what about the event sequence with only 1 event?

Look forward to your valueble feedback!! And thank you for answering all of my questions!!!

d0ng1ee · 2020-06-12T08:37:45Z

If you use machine learning methods, you can directly use the sequence obtained from the sliding window to extract features. I have tried to use the bgl data set on https://github.com/logpai/loglizer. I set window_size=1h and step_size=0.5h to get a better result,
I did not continue to do experiments on the bgl data set on the lstm model :(

But the lstm method requires the input length to be consistent, so you need to set a fixed window like hdfs for unsupervised learning.

I think the event sequence with only 1 event can be ignored as noise during training（donot join training）, and can be simply padding during testing. Of course, this is my simple understanding. . .

cherishwsx · 2020-06-12T17:53:46Z

Thank you for the reply!

I think I will try tuning the window_size and step_size to generate the event sequences for bgl data to get a more evenly distributed length of event sequences (ideally not having sequence length varies from 1, 2, 3 to 12541...), then I can proceed to set a better fixed window on the event sequences just like what we do on hdfs to generate sequential vector and then fit the lstm model. :)

wuqinglaojieshen · 2024-09-26T06:08:23Z

谢谢你的回复！

我想我会尝试调整window_size和step_size来为bgl数据生成事件序列，可以更均匀地分配事件序列长度（理想情况下，序列长度不要从1、2、3到12541...），然后我可以继续在事件序列上设置一个更好的固定窗口，就像我们在hdfs上所做的那样，生成序列支持，然后生成lstm模型。:)

Hi, what is your window_size and step_size in the end? I also encountered the same problem.
Thanks for your reply

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about feature extraction on bgl dataset #10

Question about feature extraction on bgl dataset #10

cherishwsx commented Jun 12, 2020

d0ng1ee commented Jun 12, 2020

cherishwsx commented Jun 12, 2020

wuqinglaojieshen commented Sep 26, 2024

Question about feature extraction on bgl dataset #10

Question about feature extraction on bgl dataset #10

Comments

cherishwsx commented Jun 12, 2020

d0ng1ee commented Jun 12, 2020

cherishwsx commented Jun 12, 2020

wuqinglaojieshen commented Sep 26, 2024