Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve adaption of the modelling of cyclic components to very localised features #134

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jun 28, 2018

This change primarily targets fixing an edge case where we can fail to learn periodic spikes and so generate repeated anomalies for predictable behaviour. Whilst investigating resulting changes in the unit tests I made two further minor enhancements: 1) an improvement to the trend component confidence interval calculation for forecasting, 2) an improvement to avoid a source of undesirable step changes in our predictions. I decided not to factor these out into separate commits because they are small and in a related area. Finally, this change also pushes up common interface from the derived classes to maths::CAdaptiveBucketing, since there was enough that I think it makes sense to inherit publicly from this class instead.

By way of a little extra background on the motivating problem and fix, we currently use a family of regression models, parameterised by offset in the period, to model cyclic components (calendar and seasonal). We optimise the "placement" of these models by minimising an estimate of the mean error introduced by interpolating them. Over time this learns an effective placement based on significant features, such as spikes, steps, etc. However, at short bucket lengths there is an edge case due to the interaction with the approach we use to robustify our predictions. We can effectively miss important features and as a result generate repeated false positives. For example:
screen shot 2018-06-28 at 15 33 34

The main part of this change is to maintain additional information to detect this problem (by looking for a statistically significant non-uniform distribution of large errors in the period) and adding additional models where necessary. The result on the same data set is, with this change:
screen shot 2018-06-28 at 15 32 56

}

CAdaptiveBucketing::CAdaptiveBucketing(double decayRate, double minimumBucketLength)
: m_DecayRate{std::max(decayRate, MINIMUM_DECAY_RATE)}, m_MinimumBucketLength{minimumBucketLength} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the members are not initialized, is that ok?

Copy link
Contributor Author

@tveasey tveasey Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are no? The only non-class types are the literals m_TargetSize, m_LastLargeErrorBucket and m_LastLargeErrorPeriod which all have default member initialisers. Everything else will be default constructed.

CTools::safeCdfComplement(binomial, m_LargeErrorCounts[i - 1])};
m_LargeErrorCountSignificances.add({oneMinusCdf, i - 1});
} catch (const std::exception& e) {
LOG_ERROR(<< "Failed to calculate splitting significance: " << e.what());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case we hit this, does it make sense to continue the for-loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's somewhat moot, i.e. this really shouldn't happen, but I think it is ok to continue: we'll find the highest significance for which there was no error.

count += CBasicStatistics::count(value);
}
count /= (endpoints[m] - endpoints[0]);
count /= (oldEndpoints[m] - oldEndpoints[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any possibility this can become 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice no, this is full period being modelled which in this case is always 1 day in seconds. We also rely on the fact that we force the individual endpoints to be at least the data bucketing interval apart.

Arguably, we could make all this code safe, i.e. if things went wrong we'd end up in 0/0 situations which we'd want to treat as one. I think on balance though I'd prefer to keep the code simpler and simply assume this invariant holds.

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tveasey tveasey merged commit e16816e into elastic:master Jun 29, 2018
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jul 2, 2018
tveasey added a commit that referenced this pull request Jul 2, 2018
@tveasey tveasey deleted the enhancement/trend-model-split-merge-cyclic-components branch May 1, 2019 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants