Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] hard_limit memory error fix #243

Merged
merged 4 commits into from
Oct 17, 2018

Conversation

edsavage
Copy link
Contributor

@edsavage edsavage commented Oct 5, 2018

Changed the calculation to decrease the byte limit margin to account
for bucket spans greater than 1 day

Changed the calculation to decrease the byte limit margin to account
for bucket spans greater than 1 day
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch Ed.

As per my comment I think we should avoid the big change in behaviour by having a different policy for buckets <= 1 day and > 1 day.

Also, it would be good to add some test coverage for long bucket lengths in CAnomalyJobLimitTest::testModelledEntityCountForFixedMemoryLimit.

m_ByteLimitMargin = 1.0 - scale * (1.0 - m_ByteLimitMargin);
if (elapsedTime <= core::constants::DAY) {
double scale{1.0 - static_cast<double>(elapsedTime) /
static_cast<double>(core::constants::DAY)};
Copy link
Contributor

@tveasey tveasey Oct 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I think a cleaner fix is just to upper bound elapsedTime, i.e. something like static_cast<double>(std::min(elapsedTime, x)) / static_cast<double>(core::constants::DAY). This way the rate at which m_ByteLimitMargin is increased is a smooth function of the bucket length. As it is now there is a very big discontinuity when the bucket length is changed from 1 day to 1 day + eps.

I'd recommended choosing a value for x around 2 hours. For longer bucket lengths the common source of extra memory will be weekly periodicity. This takes longer to detect. We wantm_ByteLimitMargin to approach 1 on this time scale for long bucket lengths and this limit would achieves this.

Also, I think the comment could be clarified a bit. We'll aim for m_ByteLimitMargin * target memory. So whilst we do want to increase m_ByteLimitMargin by doing this we're effectively decreasing the safety margin we're applying to the process memory. The function name refers to this safety margin whilst the comment refers to m_ByteLimitMargin.

@droberts195
Copy link
Contributor

jenkins test this please

@droberts195
Copy link
Contributor

The CI failure is this:

!!!FAILURES!!!
Test Results:
Run:  178   Failures: 1   Errors: 0


1) test: CAnomalyJobLimitTest::testModelledEntityCountForFixedMemoryLimit (F) line: 510 CAnomalyJobLimitTest.cc
assertion failed
- Expression: used.s_OverFields > testParam.s_ExpectedOverFields && used.s_OverFields < 7000

Since it relates to memory usage I suspect it is a result of this change rather than some spurious timing thing.

If it doesn't reproduce on Mac it must be that we're on the borderline of what the test is asserting and the slight differences in object sizes between platforms are making it work on Mac/Windows but not Linux. If you want to debug it on a Linux box you can use marple.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter value looks off to me. Also, just a question mark around test time.

@@ -24,6 +24,7 @@ namespace model {
const core_t::TTime CResourceMonitor::MINIMUM_PRUNE_FREQUENCY(60 * 60);
const std::size_t CResourceMonitor::DEFAULT_MEMORY_LIMIT_MB(4096);
const double CResourceMonitor::DEFAULT_BYTE_LIMIT_MARGIN(0.7);
const core_t::TTime CResourceMonitor::MAXIMUM_BYTE_LIMIT_MARGIN_PERIOD(172800); // 2 hours
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 2 days. I think this should actually be 2 hours no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe use core/Constants.h HOUR variable.

{3600, 550, 5800, 300, 30, 25, 20},
{7200, 550, 5000, 300, 30, 26, 10},
{172800, 150, 850, 120, 6, 6, 3},
{604800, 150, 850, 120, 6, 6, 3}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, but I wonder how long this unit test runs for now? This was already pretty long, it may be we want to cut the cases, or only optionally run some of them, if the total runtime is very high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've trimmed a few of the cases - runtime is now ~35s (was originally ~30s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good enough to me.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edsavage edsavage merged commit 8581185 into elastic:master Oct 17, 2018
edsavage added a commit to edsavage/ml-cpp that referenced this pull request Oct 17, 2018
Changed the calculation to decrease the byte limit margin to take into account
bucket spans greater than 1 day

Backports elastic#243
edsavage added a commit that referenced this pull request Oct 17, 2018
Changed the calculation to decrease the byte limit margin to take into account
bucket spans greater than 1 day

Backports #243
edsavage added a commit to edsavage/ml-cpp that referenced this pull request Oct 17, 2018
Changed the calculation to decrease the byte limit margin to take into account
bucket spans greater than 1 day

Backports elastic#243
edsavage added a commit that referenced this pull request Oct 17, 2018
Changed the calculation to decrease the byte limit margin to take into account
bucket spans greater than 1 day

Backports #243
@edsavage edsavage deleted the hard_limit_error_inv branch October 17, 2018 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants