Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Anomaly detection for multiple bucket features #175

Merged
merged 40 commits into from
Aug 17, 2018

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jul 30, 2018

Currently, we perform anomaly detection only on a single bucket feature, typically derived from some aggregation applied to all the documents which fall in that bucket's time interval. However, In many cases interesting events span multiple buckets.

The most important case is where any given bucket is not especially unusual, given the variation w.r.t. our predictions we've historically seen, but a collection of contiguous buckets are jointly unusual.

Such cases can be detected by modelling and detecting unusual events w.r.t. collective properties of a number of contiguous buckets, i.e. features derived from multiple buckets; this PR uses the cumulative prediction error in a sliding window ending at the current bucket. This has proved effective for picking up the sort of events we want to detect, i.e. extended periods where are model is systematically in error.

This constitutes a reasonably significant change to results: increasing the relative importance we'll report for longer lasting anomalies. However, it is giving consistently better area under the P-R curve against our internal test suit. In addition, we pay some extra memory costs (both to compute the feature and for the extra model). This has been more than recouped in other preparatory changes, i.e. #127, #146, etc.

…ure anomalousness for anomaly modelling using the base model probabilities
@tveasey tveasey added the v6.5.0 label Jul 30, 2018
@@ -118,7 +117,7 @@ bool CCmdLineParser::parse(int argc,
("multivariateByFields",
"Optional flag to enable multi-variate analysis of correlated by fields")
("multipleBucketspans", boost::program_options::value<std::string>(),
"Optional comma-separated list of additional bucketspans - must be direct multiples of the main bucketspan")
"Deprecated - ignored")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to keep it? Are there clients using this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to completely remove this, although it does then commit us to removing the corresponding options on the Java side. @dimitris-athanasiou said he would do this. The functionality to silently drop any settings related to this functionality (which was never documented and not fully tested) will then live in the Java code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to drop everything from the c++ side. I'll prepare the java side and we will merge them both in 6.5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised elastic/elasticsearch#32496. @tveasey, this means you can completely remove the multipleBucketspans param in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great. I'll tidy.

#include <boost/bind.hpp>
#include <boost/circular_buffer.hpp>
#include <boost/iterator/counting_iterator.hpp>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not seem to be used in this place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was left over from earlier versions, I'll tidy.


if (m_Correlations != nullptr) {
m_Correlations->addSamples(m_Id, params, samples, multiplier);
}

if (randomSample) {
m_SlidingWindow.push_back({randomSample->first, randomSample->second});
m_RecentSamples.push_back({randomSample->first, randomSample->second});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be an emplace_back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a boost circular buffer which doesn't support emplace_back unfortunately.

@hendrikmuhs
Copy link
Contributor

hendrikmuhs commented Jul 31, 2018

I had a first scan over it, looks good.

One remark with respect to naming: I find multibucket more descriptive than bulk. I think it would be easier to read if the code would also use multibucket.

//! Univariate implementation returns zero.
template<typename T>
static double conformable(const T& /*x*/, double value) {
return value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this should be returning 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, comment is out of date.

@tveasey
Copy link
Contributor Author

tveasey commented Aug 8, 2018

@hendrikmuhs and @dimitris-athanasiou I addressed all your review comments. I also refined aggregation to take account of correlation between the multi-bucket and bucket features, in this commit. Can you take another look please?

//! time series from a user's perspective.
class CTimeSeriesMultibucketFeatures {
public:
//! The geometric weight applied the window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "... applied to the window" ?

@@ -162,6 +162,10 @@ class MODEL_EXPORT CAnomalyDetectorModelConfig {
//! The default maximum time to test for a change point in a time series.
static const core_t::TTime DEFAULT_MAXIMUM_TIME_TO_TEST_FOR_CHANGE;

//! The default number of time buckets used to generate multibucket features
//! for anomaly detection.
static const std::size_t MULTIBUCKET_FEATURE_WINDOW_LENGTH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MULTIBUCKET_FEATURE_WINDOW_LENGTH in other places it is MULTIBUCKET_FEATURES_WINDOW_LENGTH, feature - > features

const std::string MEAN_ERROR_TAG{"a"};
const std::string ANOMALIES_TAG{"b"};
const std::string ANOMALY_FEATURE_MODEL_TAG{"d"};
// Version 6.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to 6.5?

//! anomaly detection. Specifically, unusual values of certain properties
//! of extended time ranges are often the most interesting events in a
//! time series from a user's perspective.
class CTimeSeriesMultibucketFeatures {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: final?

void CTimeSeriesMultibucketFeaturesTest::testMean() {
// Test we get the value and weight we expect.

LOG_DEBUG(<< "Univariate");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not make 2 tests?

class CTimeSeriesMultibucketFeaturesTest : public CppUnit::TestFixture {
public:
void testMean();
void testContrast();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used

@hendrikmuhs
Copy link
Contributor

I had another pass, only nitpicks except the version tags which I think should be bumped to 6.5.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (Let's merge this in!)

@tveasey tveasey force-pushed the feature/multiple-bucket-detection branch from a35419c to 20df95c Compare August 17, 2018 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants