-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] hard_limit memory error fix #243
Conversation
Changed the calculation to decrease the byte limit margin to account for bucket spans greater than 1 day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch Ed.
As per my comment I think we should avoid the big change in behaviour by having a different policy for buckets <= 1 day and > 1 day.
Also, it would be good to add some test coverage for long bucket lengths in CAnomalyJobLimitTest::testModelledEntityCountForFixedMemoryLimit
.
lib/model/CResourceMonitor.cc
Outdated
m_ByteLimitMargin = 1.0 - scale * (1.0 - m_ByteLimitMargin); | ||
if (elapsedTime <= core::constants::DAY) { | ||
double scale{1.0 - static_cast<double>(elapsedTime) / | ||
static_cast<double>(core::constants::DAY)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I think a cleaner fix is just to upper bound elapsedTime
, i.e. something like static_cast<double>(std::min(elapsedTime, x)) / static_cast<double>(core::constants::DAY)
. This way the rate at which m_ByteLimitMargin
is increased is a smooth function of the bucket length. As it is now there is a very big discontinuity when the bucket length is changed from 1 day to 1 day + eps.
I'd recommended choosing a value for x
around 2 hours. For longer bucket lengths the common source of extra memory will be weekly periodicity. This takes longer to detect. We wantm_ByteLimitMargin
to approach 1 on this time scale for long bucket lengths and this limit would achieves this.
Also, I think the comment could be clarified a bit. We'll aim for m_ByteLimitMargin
* target memory. So whilst we do want to increase m_ByteLimitMargin
by doing this we're effectively decreasing the safety margin we're applying to the process memory. The function name refers to this safety margin whilst the comment refers to m_ByteLimitMargin
.
jenkins test this please |
The CI failure is this:
Since it relates to memory usage I suspect it is a result of this change rather than some spurious timing thing. If it doesn't reproduce on Mac it must be that we're on the borderline of what the test is asserting and the slight differences in object sizes between platforms are making it work on Mac/Windows but not Linux. If you want to debug it on a Linux box you can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter value looks off to me. Also, just a question mark around test time.
lib/model/CResourceMonitor.cc
Outdated
@@ -24,6 +24,7 @@ namespace model { | |||
const core_t::TTime CResourceMonitor::MINIMUM_PRUNE_FREQUENCY(60 * 60); | |||
const std::size_t CResourceMonitor::DEFAULT_MEMORY_LIMIT_MB(4096); | |||
const double CResourceMonitor::DEFAULT_BYTE_LIMIT_MARGIN(0.7); | |||
const core_t::TTime CResourceMonitor::MAXIMUM_BYTE_LIMIT_MARGIN_PERIOD(172800); // 2 hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 2 days. I think this should actually be 2 hours no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe use core/Constants.h HOUR
variable.
{3600, 550, 5800, 300, 30, 25, 20}, | ||
{7200, 550, 5000, 300, 30, 26, 10}, | ||
{172800, 150, 850, 120, 6, 6, 3}, | ||
{604800, 150, 850, 120, 6, 6, 3}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, but I wonder how long this unit test runs for now? This was already pretty long, it may be we want to cut the cases, or only optionally run some of them, if the total runtime is very high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've trimmed a few of the cases - runtime is now ~35s (was originally ~30s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good enough to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports elastic#243
Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports elastic#243
Changed the calculation to decrease the byte limit margin to account
for bucket spans greater than 1 day