Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about feature extraction on bgl dataset #10

Open
cherishwsx opened this issue Jun 12, 2020 · 3 comments
Open

Question about feature extraction on bgl dataset #10

cherishwsx opened this issue Jun 12, 2020 · 3 comments

Comments

@cherishwsx
Copy link

Hi, it's me again. :)

I'm trying to perform the Deeplog model on bgl dataset. So far, I was able to understand the logic and generate the event sequences from structured bgl log dataset using this sample_bgl.py that you provided (many thanks!!).

It basically slides a 30-min window with 12-min step size on the structured bgl log. And in this case, we will end up having event sequence that contains either huge amount of events (e.g. I found a event sequence with 12514 events in it...) or event sequence with one or no event in it (since there is no event happen at that time period in the sliding window).

After generating the event sequences, I deleted the event sequences with no event and ended up getting a file with 65 non-empty event sequences. And I randomly picked 60 event sequences as my training sequences and rest of the 5 will be validation data.

And this is when my questions kick in.

  1. When I tried to generate the sequential feature for the training dataset, should I do the same thing for hdfs dataset like using sliding window of 10 or other size on the event sequence and the next event to the current window will be the label for this current window. But in this case, how should I deal with the event sequence with only 1 event?

  2. And I do remember that you mentioned in the other post, if using bgl dataset, then it can direcly use the event sequence for the sequential vector since it is generated using the sliding window already, but in this case, in my understanding, each event sequence (except for the last event) will directly be a sequential vector, then the label for this vector will be the last event in that event sequence? Then what about the event sequence with only 1 event?

Look forward to your valueble feedback!! And thank you for answering all of my questions!!!

@d0ng1ee
Copy link
Owner

d0ng1ee commented Jun 12, 2020

If you use machine learning methods, you can directly use the sequence obtained from the sliding window to extract features. I have tried to use the bgl data set on https://github.com/logpai/loglizer. I set window_size=1h and step_size=0.5h to get a better result,
I did not continue to do experiments on the bgl data set on the lstm model :(

But the lstm method requires the input length to be consistent, so you need to set a fixed window like hdfs for unsupervised learning.

I think the event sequence with only 1 event can be ignored as noise during training(donot join training), and can be simply padding during testing. Of course, this is my simple understanding. . .

@cherishwsx
Copy link
Author

Thank you for the reply!

I think I will try tuning the window_size and step_size to generate the event sequences for bgl data to get a more evenly distributed length of event sequences (ideally not having sequence length varies from 1, 2, 3 to 12541...), then I can proceed to set a better fixed window on the event sequences just like what we do on hdfs to generate sequential vector and then fit the lstm model. :)

@wuqinglaojieshen
Copy link

谢谢你的回复!

我想我会尝试调整window_size和step_size来为bgl数据生成事件序列,可以更均匀地分配事件序列长度(理想情况下,序列长度不要从1、2、3到12541...),然后我可以继续在事件序列上设置一个更好的固定窗口,就像我们在hdfs上所做的那样,生成序列支持,然后生成lstm模型。:)

Hi, what is your window_size and step_size in the end? I also encountered the same problem.
Thanks for your reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants