Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This update adds group_by support for multinomial, numeric and combiner fields
Description
Added an optional field (
group_by
) that can be specified forMultinomialField
,MultinomialFieldCombiner
andNumericalField
, which changes the behaviour of OSAS to build the statistical models around mini-groups of data. This enables better statistical modeling.Related Issue
This PR is based on an internal change request
Motivation and Context
Previously, OSAS had issues modeling and tagging anomalies for under-represented classes. For instance, if you would try to build a model for login anomalies based on username and origin country (
MultinomialField
), or average CPU/memory usage based on host (NumericalField
), you would find it difficult to cope for users that have a small number of events, when compared to the other users. An example could be a dataset, with 99 users that each have 5000 events and a user with only 10 events. Though all his login could originate from the same country, they will always be tagged as anomalies, because they are under-represented in the overall dataset. With thegroup_by
option, you can simply group the login country based on the username and the statistical models will be relative per user, thus better modeling anomalies.How Has This Been Tested?
This change has been validated by our TH team internally, using real datasets. We checked that the statistical model are correctly build and that the tags are assigned as expeted.
Screenshots (if appropriate):
Types of changes
Checklist: