-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve adaption of the modelling of cyclic components to very localised features #134
[ML] Improve adaption of the modelling of cyclic components to very localised features #134
Conversation
…based on large error statistics
} | ||
|
||
CAdaptiveBucketing::CAdaptiveBucketing(double decayRate, double minimumBucketLength) | ||
: m_DecayRate{std::max(decayRate, MINIMUM_DECAY_RATE)}, m_MinimumBucketLength{minimumBucketLength} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of the members are not initialized, is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are no? The only non-class types are the literals m_TargetSize
, m_LastLargeErrorBucket
and m_LastLargeErrorPeriod
which all have default member initialisers. Everything else will be default constructed.
CTools::safeCdfComplement(binomial, m_LargeErrorCounts[i - 1])}; | ||
m_LargeErrorCountSignificances.add({oneMinusCdf, i - 1}); | ||
} catch (const std::exception& e) { | ||
LOG_ERROR(<< "Failed to calculate splitting significance: " << e.what()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case we hit this, does it make sense to continue the for-loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's somewhat moot, i.e. this really shouldn't happen, but I think it is ok to continue: we'll find the highest significance for which there was no error.
count += CBasicStatistics::count(value); | ||
} | ||
count /= (endpoints[m] - endpoints[0]); | ||
count /= (oldEndpoints[m] - oldEndpoints[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any possibility this can become 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice no, this is full period being modelled which in this case is always 1 day in seconds. We also rely on the fact that we force the individual endpoints to be at least the data bucketing interval apart.
Arguably, we could make all this code safe, i.e. if things went wrong we'd end up in 0/0 situations which we'd want to treat as one. I think on balance though I'd prefer to keep the code simpler and simply assume this invariant holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ocalised features (elastic#134)
This change primarily targets fixing an edge case where we can fail to learn periodic spikes and so generate repeated anomalies for predictable behaviour. Whilst investigating resulting changes in the unit tests I made two further minor enhancements: 1) an improvement to the trend component confidence interval calculation for forecasting, 2) an improvement to avoid a source of undesirable step changes in our predictions. I decided not to factor these out into separate commits because they are small and in a related area. Finally, this change also pushes up common interface from the derived classes to
maths::CAdaptiveBucketing
, since there was enough that I think it makes sense to inherit publicly from this class instead.By way of a little extra background on the motivating problem and fix, we currently use a family of regression models, parameterised by offset in the period, to model cyclic components (calendar and seasonal). We optimise the "placement" of these models by minimising an estimate of the mean error introduced by interpolating them. Over time this learns an effective placement based on significant features, such as spikes, steps, etc. However, at short bucket lengths there is an edge case due to the interaction with the approach we use to robustify our predictions. We can effectively miss important features and as a result generate repeated false positives. For example:
![screen shot 2018-06-28 at 15 33 34](https://user-images.githubusercontent.com/7591487/42042697-ac1eb088-7aec-11e8-971b-a6cee1e2a668.png)
The main part of this change is to maintain additional information to detect this problem (by looking for a statistically significant non-uniform distribution of large errors in the period) and adding additional models where necessary. The result on the same data set is, with this change:
![screen shot 2018-06-28 at 15 32 56](https://user-images.githubusercontent.com/7591487/42042723-b6ac8f16-7aec-11e8-998f-ad63cb069a83.png)