[ML] Anomaly detection for multiple bucket features #175

tveasey · 2018-07-30T14:23:59Z

Currently, we perform anomaly detection only on a single bucket feature, typically derived from some aggregation applied to all the documents which fall in that bucket's time interval. However, In many cases interesting events span multiple buckets.

The most important case is where any given bucket is not especially unusual, given the variation w.r.t. our predictions we've historically seen, but a collection of contiguous buckets are jointly unusual.

Such cases can be detected by modelling and detecting unusual events w.r.t. collective properties of a number of contiguous buckets, i.e. features derived from multiple buckets; this PR uses the cumulative prediction error in a sliding window ending at the current bucket. This has proved effective for picking up the sort of events we want to detect, i.e. extended periods where are model is systematically in error.

This constitutes a reasonably significant change to results: increasing the relative importance we'll report for longer lasting anomalies. However, it is giving consistently better area under the P-R curve against our internal test suit. In addition, we pay some extra memory costs (both to compute the feature and for the extra model). This has been more than recouped in other preparatory changes, i.e. #127, #146, etc.

…t has been superseded

…ure anomalousness for anomaly modelling using the base model probabilities

…matting fixes

…fix some bugs

…introduce significant noise

hendrikmuhs · 2018-07-31T07:49:28Z

bin/autodetect/CCmdLineParser.cc

@@ -118,7 +117,7 @@ bool CCmdLineParser::parse(int argc,
            ("multivariateByFields",
                        "Optional flag to enable multi-variate analysis of correlated by fields")
            ("multipleBucketspans",  boost::program_options::value<std::string>(),
-                        "Optional comma-separated list of additional bucketspans - must be direct multiples of the main bucketspan")
+                        "Deprecated - ignored")


what's the reason to keep it? Are there clients using this?

I'm happy to completely remove this, although it does then commit us to removing the corresponding options on the Java side. @dimitris-athanasiou said he would do this. The functionality to silently drop any settings related to this functionality (which was never documented and not fully tested) will then live in the Java code.

I think it's fine to drop everything from the c++ side. I'll prepare the java side and we will merge them both in 6.5.

I raised elastic/elasticsearch#32496. @tveasey, this means you can completely remove the multipleBucketspans param in this PR.

Ok great. I'll tidy.

hendrikmuhs · 2018-07-31T08:00:21Z

include/maths/CTimeSeriesBulkFeatures.h

+#include <boost/bind.hpp>
+#include <boost/circular_buffer.hpp>
+#include <boost/iterator/counting_iterator.hpp>
+


does not seem to be used in this place

I think this was left over from earlier versions, I'll tidy.

hendrikmuhs · 2018-07-31T08:18:42Z

lib/maths/CTimeSeriesModel.cc


    if (m_Correlations != nullptr) {
        m_Correlations->addSamples(m_Id, params, samples, multiplier);
    }

    if (randomSample) {
-        m_SlidingWindow.push_back({randomSample->first, randomSample->second});
+        m_RecentSamples.push_back({randomSample->first, randomSample->second});


could be an emplace_back?

This is a boost circular buffer which doesn't support emplace_back unfortunately.

hendrikmuhs · 2018-07-31T08:50:05Z

I had a first scan over it, looks good.

One remark with respect to naming: I find multibucket more descriptive than bulk. I think it would be easier to read if the code would also use multibucket.

dimitris-athanasiou · 2018-07-31T13:16:08Z

include/maths/CTimeSeriesBulkFeatures.h

+    //! Univariate implementation returns zero.
+    template<typename T>
+    static double conformable(const T& /*x*/, double value) {
+        return value;


The comment says this should be returning 0.

Indeed, comment is out of date.

…egating

tveasey · 2018-08-08T16:41:10Z

@hendrikmuhs and @dimitris-athanasiou I addressed all your review comments. I also refined aggregation to take account of correlation between the multi-bucket and bucket features, in this commit. Can you take another look please?

hendrikmuhs · 2018-08-09T06:58:17Z

include/maths/CTimeSeriesMultibucketFeatures.h

+//! time series from a user's perspective.
+class CTimeSeriesMultibucketFeatures {
+public:
+    //! The geometric weight applied the window.


nit: "... applied to the window" ?

hendrikmuhs · 2018-08-09T07:11:25Z

include/model/CAnomalyDetectorModelConfig.h

@@ -162,6 +162,10 @@ class MODEL_EXPORT CAnomalyDetectorModelConfig {
    //! The default maximum time to test for a change point in a time series.
    static const core_t::TTime DEFAULT_MAXIMUM_TIME_TO_TEST_FOR_CHANGE;

+    //! The default number of time buckets used to generate multibucket features
+    //! for anomaly detection.
+    static const std::size_t MULTIBUCKET_FEATURE_WINDOW_LENGTH;


nit: MULTIBUCKET_FEATURE_WINDOW_LENGTH in other places it is MULTIBUCKET_FEATURES_WINDOW_LENGTH, feature - > features

hendrikmuhs · 2018-08-09T07:16:34Z

lib/maths/CTimeSeriesModel.cc

-const std::string MEAN_ERROR_TAG{"a"};
-const std::string ANOMALIES_TAG{"b"};
-const std::string ANOMALY_FEATURE_MODEL_TAG{"d"};
+// Version 6.4


hendrikmuhs · 2018-08-09T07:26:58Z

include/maths/CTimeSeriesMultibucketFeatures.h

+//! anomaly detection. Specifically, unusual values of certain properties
+//! of extended time ranges are often the most interesting events in a
+//! time series from a user's perspective.
+class CTimeSeriesMultibucketFeatures {


nit: final?

hendrikmuhs · 2018-08-09T07:57:25Z

lib/maths/unittest/CTimeSeriesMultibucketFeaturesTest.cc

+void CTimeSeriesMultibucketFeaturesTest::testMean() {
+    // Test we get the value and weight we expect.
+
+    LOG_DEBUG(<< "Univariate");


nit: why not make 2 tests?

hendrikmuhs · 2018-08-09T07:58:15Z

lib/maths/unittest/CTimeSeriesMultibucketFeaturesTest.h

+class CTimeSeriesMultibucketFeaturesTest : public CppUnit::TestFixture {
+public:
+    void testMean();
+    void testContrast();


hendrikmuhs · 2018-08-09T08:28:46Z

I had another pass, only nitpicks except the version tags which I think should be bumped to 6.5.

dimitris-athanasiou

LGTM (Let's merge this in!)

…support additional features in future

Backport of #175.

tveasey added 30 commits June 18, 2018 10:17

Initial work on features spanning multiple buckets

e9c6cd7

Remove partially implemented multi-bucket data gatherering support: i…

7de5096

…t has been superseded

Given we now have explicit multi-bucket features, it's better to meas…

f7ef031

…ure anomalousness for anomaly modelling using the base model probabilities

Bug fixes and compiler warnings

6412977

Support disabling multibucket feature modelling

874fb1d

More work

2dd2357

Multivariate bulk features

8545b01

Unit test bulk features. Improve weight calculation for contrast. For…

73cac7f

…matting fixes

Towards fixing model tests

a8512fe

Finish up fixing tests + bug fixes

6c2d266

Merge master

88a453b

Merge master

1cae145

We can't upgrade the anomaly model because the features have changed

56f019c

Update test thresholds

37a04b6

Merge branch 'master' into feature/multiple-bucket-detection

b441515

Merge branch 'master' into feature/multiple-bucket-detection

c6210c3

The contrast feature wasn't helping enough in the average case. Also …

9f798d9

…fix some bugs

Bug fix

5dcd1ab

It is a good idea to compute weighted means since outliers otherwise …

3b602ce

…introduce significant noise

Improve function for combining feature probabilities

8f611a6

Towards fixing unit tests

092e803

Update test expected result

d880e4d

Fix linux compilation

93f41b1

Another linux fix

91e49cc

Another linux fix

db8ba71

Formatting fixes

85a2db9

Merge branch 'master' into feature/multiple-bucket-detection

a379e11

Fix unit tests and some bug fixes to correlation models

a2a9532

Fix unit test

af25b54

Tweak to feature probability aggregation

fda494a

tveasey added the v6.5.0 label Jul 30, 2018

hendrikmuhs reviewed Jul 31, 2018

View reviewed changes

dimitris-athanasiou reviewed Jul 31, 2018

View reviewed changes

tveasey added 4 commits August 1, 2018 11:25

Review comments and documentation

67b7205

Merge branch 'master' into feature/multiple-bucket-detection

f2b01c1

Support correlation between multi-bucket and bucket feature when aggr…

3937f40

…egating

Formatting fixes

f4962df

hendrikmuhs reviewed Aug 9, 2018

View reviewed changes

Review comments

c32772a

dimitris-athanasiou approved these changes Aug 16, 2018

View reviewed changes

Rework multi-bucket features to better encapsulate functionality and …

20df95c

…support additional features in future

tveasey force-pushed the feature/multiple-bucket-detection branch from a35419c to 20df95c Compare August 17, 2018 09:12

Merge branch 'master' into feature/multiple-bucket-detection

878b038

tveasey merged commit 548222b into elastic:master Aug 17, 2018

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Aug 17, 2018

[ML] Anomaly detection for multiple bucket features (elastic#175)

df7105d

tveasey mentioned this pull request Aug 17, 2018

[6.5][ML] Anomaly detection for multiple bucket features #185

Merged

tveasey added a commit that referenced this pull request Aug 17, 2018

[6.5][ML] Anomaly detection for multiple bucket features (#185)

1fb820e

Backport of #175.

peteharverson mentioned this pull request Sep 20, 2018

[ML] Enhancements to improve clarity of results for multi bucket features elastic/kibana#23365

Open

7 tasks

peteharverson mentioned this pull request Oct 3, 2018

[ML] Indicate multi-bucket anomalies in results dashboards elastic/kibana#23746

Merged

tveasey deleted the feature/multiple-bucket-detection branch May 1, 2019 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Anomaly detection for multiple bucket features #175

[ML] Anomaly detection for multiple bucket features #175

tveasey commented Jul 30, 2018

hendrikmuhs Jul 31, 2018

droberts195 Jul 31, 2018

dimitris-athanasiou Jul 31, 2018

dimitris-athanasiou Jul 31, 2018

tveasey Aug 1, 2018

hendrikmuhs Jul 31, 2018

tveasey Aug 1, 2018

hendrikmuhs Jul 31, 2018

tveasey Aug 1, 2018

hendrikmuhs commented Jul 31, 2018 •

edited

Loading

dimitris-athanasiou Jul 31, 2018

tveasey Aug 1, 2018

tveasey commented Aug 8, 2018

hendrikmuhs Aug 9, 2018

hendrikmuhs Aug 9, 2018

hendrikmuhs Aug 9, 2018

hendrikmuhs Aug 9, 2018

hendrikmuhs Aug 9, 2018

hendrikmuhs Aug 9, 2018

hendrikmuhs commented Aug 9, 2018

dimitris-athanasiou left a comment

[ML] Anomaly detection for multiple bucket features #175

[ML] Anomaly detection for multiple bucket features #175

Conversation

tveasey commented Jul 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs commented Jul 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey commented Aug 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs commented Aug 9, 2018

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

hendrikmuhs commented Jul 31, 2018 •

edited

Loading