[ML] hard_limit memory error fix #243

edsavage · 2018-10-05T19:29:32Z

Changed the calculation to decrease the byte limit margin to account
for bucket spans greater than 1 day

Changed the calculation to decrease the byte limit margin to account for bucket spans greater than 1 day

tveasey

Good catch Ed.

As per my comment I think we should avoid the big change in behaviour by having a different policy for buckets <= 1 day and > 1 day.

Also, it would be good to add some test coverage for long bucket lengths in CAnomalyJobLimitTest::testModelledEntityCountForFixedMemoryLimit.

tveasey · 2018-10-07T10:00:29Z

lib/model/CResourceMonitor.cc

-    m_ByteLimitMargin = 1.0 - scale * (1.0 - m_ByteLimitMargin);
+    if (elapsedTime <= core::constants::DAY) {
+        double scale{1.0 - static_cast<double>(elapsedTime) /
+                               static_cast<double>(core::constants::DAY)};


Good catch!

I think a cleaner fix is just to upper bound elapsedTime, i.e. something like static_cast<double>(std::min(elapsedTime, x)) / static_cast<double>(core::constants::DAY). This way the rate at which m_ByteLimitMargin is increased is a smooth function of the bucket length. As it is now there is a very big discontinuity when the bucket length is changed from 1 day to 1 day + eps.

I'd recommended choosing a value for x around 2 hours. For longer bucket lengths the common source of extra memory will be weekly periodicity. This takes longer to detect. We wantm_ByteLimitMargin to approach 1 on this time scale for long bucket lengths and this limit would achieves this.

Also, I think the comment could be clarified a bit. We'll aim for m_ByteLimitMargin * target memory. So whilst we do want to increase m_ByteLimitMargin by doing this we're effectively decreasing the safety margin we're applying to the process memory. The function name refers to this safety margin whilst the comment refers to m_ByteLimitMargin.

droberts195 · 2018-10-10T12:14:13Z

jenkins test this please

droberts195 · 2018-10-13T12:20:14Z

The CI failure is this:

!!!FAILURES!!!
Test Results:
Run:  178   Failures: 1   Errors: 0


1) test: CAnomalyJobLimitTest::testModelledEntityCountForFixedMemoryLimit (F) line: 510 CAnomalyJobLimitTest.cc
assertion failed
- Expression: used.s_OverFields > testParam.s_ExpectedOverFields && used.s_OverFields < 7000

Since it relates to memory usage I suspect it is a result of this change rather than some spurious timing thing.

If it doesn't reproduce on Mac it must be that we're on the borderline of what the test is asserting and the slight differences in object sizes between platforms are making it work on Mac/Windows but not Linux. If you want to debug it on a Linux box you can use marple.

tveasey

The parameter value looks off to me. Also, just a question mark around test time.

tveasey · 2018-10-15T11:29:54Z

lib/model/CResourceMonitor.cc

@@ -24,6 +24,7 @@ namespace model {
 const core_t::TTime CResourceMonitor::MINIMUM_PRUNE_FREQUENCY(60 * 60);
 const std::size_t CResourceMonitor::DEFAULT_MEMORY_LIMIT_MB(4096);
 const double CResourceMonitor::DEFAULT_BYTE_LIMIT_MARGIN(0.7);
+const core_t::TTime CResourceMonitor::MAXIMUM_BYTE_LIMIT_MARGIN_PERIOD(172800); // 2 hours


This is 2 days. I think this should actually be 2 hours no?

Also, maybe use core/Constants.h HOUR variable.

tveasey · 2018-10-15T11:31:43Z

lib/api/unittest/CAnomalyJobLimitTest.cc

+                      {3600, 550, 5800, 300, 30, 25, 20},
+                      {7200, 550, 5000, 300, 30, 26, 10},
+                      {172800, 150, 850, 120, 6, 6, 3},
+                      {604800, 150, 850, 120, 6, 6, 3}};


This is great, but I wonder how long this unit test runs for now? This was already pretty long, it may be we want to cut the cases, or only optionally run some of them, if the total runtime is very high.

I've trimmed a few of the cases - runtime is now ~35s (was originally ~30s).

That sounds good enough to me.

tveasey

LGTM

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports elastic#243

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports #243

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports elastic#243

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports #243

[ML] hard_limit memory error fix

41d8b9f

Changed the calculation to decrease the byte limit margin to account for bucket spans greater than 1 day

edsavage added >bug v7.0.0 :ml v6.5.0 labels Oct 5, 2018

tveasey reviewed Oct 7, 2018

View reviewed changes

edsavage added 2 commits October 11, 2018 18:08

Attending to code review comments

1d82737

Updated Changelog

b5e9877

tveasey reviewed Oct 15, 2018

View reviewed changes

Adjusted limits in unit tests to be more lenient

c76291d

tveasey approved these changes Oct 16, 2018

View reviewed changes

edsavage merged commit 8581185 into elastic:master Oct 17, 2018

edsavage added a commit to edsavage/ml-cpp that referenced this pull request Oct 17, 2018

[6.5][ML] hard_limit memory error fix (elastic#243)

1dcf9a0

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports elastic#243

edsavage mentioned this pull request Oct 17, 2018

[6.5][ML] hard_limit memory error fix (#243) #256

Merged

droberts195 added the v6.4.3 label Oct 17, 2018

edsavage added a commit that referenced this pull request Oct 17, 2018

[6.5][ML] hard_limit memory error fix (#243) (#256)

39f6227

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports #243

edsavage added a commit to edsavage/ml-cpp that referenced this pull request Oct 17, 2018

[6.4][ML] hard_limit memory error fix (elastic#243)

91edc54

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports elastic#243

edsavage mentioned this pull request Oct 17, 2018

[6.4][ML] hard_limit memory error fix (#243) #258

Merged

edsavage added a commit that referenced this pull request Oct 17, 2018

[6.4][ML] hard_limit memory error fix (#243) (#258)

da6598b

Changed the calculation to decrease the byte limit margin to take into account bucket spans greater than 1 day Backports #243

edsavage deleted the hard_limit_error_inv branch October 17, 2018 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] hard_limit memory error fix #243

[ML] hard_limit memory error fix #243

edsavage commented Oct 5, 2018

tveasey left a comment

tveasey Oct 7, 2018 •

edited

Loading

droberts195 commented Oct 10, 2018

droberts195 commented Oct 13, 2018

tveasey left a comment

tveasey Oct 15, 2018

tveasey Oct 15, 2018

tveasey Oct 15, 2018

edsavage Oct 16, 2018

tveasey Oct 16, 2018

tveasey left a comment

[ML] hard_limit memory error fix #243

[ML] hard_limit memory error fix #243

Conversation

edsavage commented Oct 5, 2018

tveasey left a comment

Choose a reason for hiding this comment

tveasey Oct 7, 2018 • edited Loading

Choose a reason for hiding this comment

droberts195 commented Oct 10, 2018

droberts195 commented Oct 13, 2018

tveasey left a comment

Choose a reason for hiding this comment

tveasey Oct 15, 2018

Choose a reason for hiding this comment

tveasey Oct 15, 2018

Choose a reason for hiding this comment

tveasey Oct 15, 2018

Choose a reason for hiding this comment

edsavage Oct 16, 2018

Choose a reason for hiding this comment

tveasey Oct 16, 2018

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

tveasey Oct 7, 2018 •

edited

Loading