Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] process can hang if large forecast job fails to delete temporary storage #350

Closed
hendrikmuhs opened this issue Dec 21, 2018 · 13 comments · Fixed by #354
Closed

[ML] process can hang if large forecast job fails to delete temporary storage #350

hendrikmuhs opened this issue Dec 21, 2018 · 13 comments · Fixed by #354

Comments

@hendrikmuhs
Copy link
Contributor

found using integration tests:

if - for unknown reason - forecasting fails to delete tmp storage[*], the worker thread returns, potentially leaving open jobs in the queue, which causes the process to hang on job close.

[*] Note, this is a bug in itself, but this problem should still not lead to the issue above

@javanna
Copy link
Member

javanna commented Dec 21, 2018

Thanks for opening this issue! I am available to dig into the "why" of the initial error when cleaning up the temporary folder. I get "permission denied" at this line which causes ForecastIT#testOverflowToDisk to hang. This is the output I see, nothing more, the test is killed by the suite timeout after 20 minutes:

[2018-12-21T16:49:13,128][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@347] Forecast of large model requested (requires 40 MB), using disk.
[2018-12-21T16:49:19,006][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@124] Start forecasting from 2018-12-21T13:00:00+0000 to 2018-12-21T14:00:00+0000
[2018-12-21T16:49:25,725][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@213] Finished forecasting, wrote 8000 records
[2018-12-21T16:49:25,726][WARN ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@227] Failed to cleanup temporary data from: /home/javanna/elasticsearch/x-pack/plugin/ml/qa/native-multi-node-tests/build/cluster/integTestCluster node2/elasticsearch-7.0.0-SNAPSHOT/data/ml-local-data/tmp/forecast-it-test-overflow-to-disk/W2h00WcBd-rMI0jD0QIa error Permission denied
[2018-12-21T16:49:31,958][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@347] Forecast of large model requested (requires 40 MB), using disk.
[2018-12-21T16:50:10,567][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-2] Closing job [forecast-it-test-overflow-to-disk], because [close job (api)]
[2018-12-21T16:50:10,568][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CCmdSkeleton.cc@45] Handled 120000 records
[2018-12-21T16:50:10,580][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CAnomalyJob.cc@1356] Pruning all models

Disclaimer: this happens on my linux box that I use to run CI, which I have just reinstalled. This test fails consistently the same way from within jenkins as well as when manually run. I am not getting what may be causing the permission denied error. I tried disabling selinux but that does not help. I would expect the cause to be some subtle misconfiguration but I haven't figured it out yet. The operating system is Fedora 29. I have Fedora 28 on my laptop where the test runs fine.

@hendrikmuhs
Copy link
Contributor Author

One possibility of the deletion problem: We install a system call filter on linux:

https://github.com/elastic/ml-cpp/blob/master/lib/seccomp/CSystemCallFilter_Linux.cc

It might be that forecast overflow requires additional whitelisting in this filter on Fedora 29.

Unfortunately it is not easy to find out, seccomp can not be easily disabled, you need a custom build of either ml-cpp or the linux kernel.

We could consider a special environment variable or cmdline parameter to disable seccomp if my assumption is true and if we get more reports like this.

@droberts195
Copy link
Contributor

We could consider a special environment variable or cmdline parameter to disable seccomp if my assumption is true and if we get more reports like this.

I don't think we should do this. If we did then any malware that wanted to use an exploit that seccomp is currently blocking would just switch it off, so we might as well just remove seccomp entirely. See also elastic/elasticsearch#27645 (comment)

Unfortunately it is not easy to find out

I think we need to put effort into finding out. @davidkyle isn't there a way to get seccomp to report which system call it's blocking? I seem to remember you used something like that during initial development. Then we just need to hire a Fedora 29 VM in EC2 or GCP and copy over a test program that uses boost::filesystem::remove_all() to that VM.

This is not desperately urgent as Fedora 29 is not a supported platform. But whatever changes are in its glibc will eventually be included in some supported platform, so we need to head off the problem before an affected platform is added to the support matrix.

@javanna
Copy link
Member

javanna commented Jan 3, 2019

I am here to help, as I said I can reproduce this 100% of the times. I would love if this could be given some kind of priority so I can run the full suite of tests on my server again, either the fix that Hendrik mentioned for the bug that's causing the test to hang, or digging deeper to find out what the problem is with boost::filesystem::remove_all if somebody has time.

@droberts195
Copy link
Contributor

I would love if this could be given some kind of priority

I believe the fix to the symptoms of the problem in #352 will allow the tests to pass again, and it should be possible to merge that the day @hendrikmuhs is back at work.

@davidkyle
Copy link
Member

The only way I found to reliably track which syscalls are blocked across all linux's is to change the seccomp filter to return SECCOMP_RET_KILL instead of SECCOMP_RET_ERRNO. This kills the calling thread, which should be the main thread killing the process and logs a message with the offending syscall number which then must be looked up in syscall.h.

I think this is worth investigating even if fedora isn't officially supported as a blocked syscall silently fails and it very hard to debug. It's a pretty easy change to make, I would have to go back to look at exactly what I did before and create a unit test for it.

boost::filesystem::remove_all() is called from the main thread isn't it? Kernal 4.14 and later has much better support for seccomp including the SECCOMP_RET_LOG and SECCOMP_RET_KILL_PROCESS options.

@javanna
Copy link
Member

javanna commented Jan 3, 2019

great @droberts195 I had not seen that PR, thank you for pointing me to that. I can help testing it out if needed, not sure how though :) ping me if I can do anything.

@hendrikmuhs
Copy link
Contributor Author

hendrikmuhs commented Jan 3, 2019

My PR only fixes the hang but not the issue that on Fedora 29 we very likely block a system call as part of the seccomp filter which we should not. I am pretty sure it is the seccomp filter because "eaccess" is what we get reported.

Killing the process would be quite dramatic as we sacrifice the process for a failed forecast. IMHO to much to make this the default.

I suggest that we do this only on snaphot builds but keep the current behavior on release builds. It requires some makefile and #ifdef's, should not take more than half a day to implement.

Note that the unit tests are not executed if you use the binary blops. we get more coverage with the above.

LBNL we have to check that the crashhandler reports the root cause (blocked syscall) correctly.

@droberts195
Copy link
Contributor

I suggest that we do this only on snaphot build

I wasn't suggesting we change this for any official build, just as a one-off debug build someone does on their laptop to find the offending syscall number. This problem only affects us when one of the OS libraries we use is changed to use different kernel calls to implement an existing user land OS function. I don't imagine that happens very often.

@davidkyle
Copy link
Member

I wasn't suggesting we change the response to kill the process only that this was the hack I made to find which syscalls were being called as I was developing the PR. More recent kernels (after 4.14) have better debugging for BPF but I needed to test this on all the versions we support.

There is a unit test for CSystemCallFilter_Linux.cc that asserts which operations are allowed/disallowed after installing the seccomp filter. This test is by no means exhaustive but the next step should be to add calls to boost::filesystem::remove_all() to the test then run the test on Fedora 29.

@droberts195
Copy link
Contributor

I'm pretty sure the missing syscall is sys_unlinkat. The Linux implementation of rmdir() in glibc uses it: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/generic/rmdir.c;h=8ef4bbe8da0a59aecc7e6cd6d36562f7c5365f3f;hb=HEAD

In fact, given that code it's surprising that this worked on any version of Linux from the last decade or so. I guess older glibc's must have had some sort of workaround that I can't find that fell back to sys_rmdir if sys_unlinkat didn't work for some reason.

@davidkyle
Copy link
Member

I guess older glibc's must have had some sort of workaround that I can't find that fell back to...

I noticed this a couple of times where if a syscall was blocked another was tried. For example @hendrikmuhs found a bug where the high resolution timer was blocked so it fell back to the 1 second timer.

Regarding the unit test for boost::filesystem::remove_all() one already exists:
https://github.com/elastic/ml-cpp/blob/master/lib/seccomp/unittest/CSystemCallFilterTest.cc#L233

We need to run the unit tests on Fedora 29 hopefully it will fail and hopefully adding sys_unlinkat will fix it.

droberts195 added a commit to droberts195/ml-cpp that referenced this issue Jan 4, 2019
Fedora 29 uses different system calls to platforms
we've previously tested on, and hence suffers from
certain functionality failing due to the seccomp
filter.

This commit permits 3 additional system calls:

1. __NR_gettimeofday
2. __NR_unlinkat
3. __NR_getdents64

(It is likely that other Linux distributions using
modern glibc would also hit one or more of these
system calls.  Non-fatal problems probably got
progressively worse in the lead up to the fatal
problem that surfaced in Fedora 29.)

Fixes elastic#350
droberts195 added a commit that referenced this issue Jan 4, 2019
Fedora 29 uses different system calls to platforms
we've previously tested on, and hence suffers from
certain functionality failing due to the seccomp
filter.

This commit permits 3 additional system calls:

1. __NR_gettimeofday
2. __NR_unlinkat
3. __NR_getdents64

(It is likely that other Linux distributions using
modern glibc would also hit one or more of these
system calls.  Non-fatal problems probably got
progressively worse in the lead up to the fatal
problem that surfaced in Fedora 29.)

Fixes #350
droberts195 added a commit to droberts195/ml-cpp that referenced this issue Jan 4, 2019
Fedora 29 uses different system calls to platforms
we've previously tested on, and hence suffers from
certain functionality failing due to the seccomp
filter.

This commit permits 3 additional system calls:

1. __NR_gettimeofday
2. __NR_unlinkat
3. __NR_getdents64

(It is likely that other Linux distributions using
modern glibc would also hit one or more of these
system calls.  Non-fatal problems probably got
progressively worse in the lead up to the fatal
problem that surfaced in Fedora 29.)

Fixes elastic#350
Backport of elastic#354
droberts195 added a commit that referenced this issue Jan 4, 2019
Fedora 29 uses different system calls to platforms
we've previously tested on, and hence suffers from
certain functionality failing due to the seccomp
filter.

This commit permits 3 additional system calls:

1. __NR_gettimeofday
2. __NR_unlinkat
3. __NR_getdents64

(It is likely that other Linux distributions using
modern glibc would also hit one or more of these
system calls.  Non-fatal problems probably got
progressively worse in the lead up to the fatal
problem that surfaced in Fedora 29.)

Fixes #350
Backport of #354
@javanna
Copy link
Member

javanna commented Jan 7, 2019

thanks a lot everybody involved!

hendrikmuhs pushed a commit that referenced this issue Jan 8, 2019
fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails.
This can cause the autodetect process to hang on exit, if more forecast requests are in the queue

relates to #350
hendrikmuhs pushed a commit that referenced this issue Jan 8, 2019
…353)

fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails.
This can cause the autodetect process to hang on exit, if more forecast requests are in the queue

relates to #350
hendrikmuhs pushed a commit that referenced this issue Jan 8, 2019
…357)

fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails.
This can cause the autodetect process to hang on exit, if more forecast requests are in the queue

relates to #350
hendrikmuhs pushed a commit that referenced this issue Jan 8, 2019
…356)

fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails.
This can cause the autodetect process to hang on exit, if more forecast requests are in the queue

relates to #350
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants