[ML] process can hang if large forecast job fails to delete temporary storage #350

hendrikmuhs · 2018-12-21T16:44:34Z

found using integration tests:

if - for unknown reason - forecasting fails to delete tmp storage[*], the worker thread returns, potentially leaving open jobs in the queue, which causes the process to hang on job close.

[*] Note, this is a bug in itself, but this problem should still not lead to the issue above

javanna · 2018-12-21T19:53:54Z

Thanks for opening this issue! I am available to dig into the "why" of the initial error when cleaning up the temporary folder. I get "permission denied" at this line which causes ForecastIT#testOverflowToDisk to hang. This is the output I see, nothing more, the test is killed by the suite timeout after 20 minutes:

[2018-12-21T16:49:13,128][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@347] Forecast of large model requested (requires 40 MB), using disk.
[2018-12-21T16:49:19,006][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@124] Start forecasting from 2018-12-21T13:00:00+0000 to 2018-12-21T14:00:00+0000
[2018-12-21T16:49:25,725][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@213] Finished forecasting, wrote 8000 records
[2018-12-21T16:49:25,726][WARN ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@227] Failed to cleanup temporary data from: /home/javanna/elasticsearch/x-pack/plugin/ml/qa/native-multi-node-tests/build/cluster/integTestCluster node2/elasticsearch-7.0.0-SNAPSHOT/data/ml-local-data/tmp/forecast-it-test-overflow-to-disk/W2h00WcBd-rMI0jD0QIa error Permission denied
[2018-12-21T16:49:31,958][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CForecastRunner.cc@347] Forecast of large model requested (requires 40 MB), using disk.
[2018-12-21T16:50:10,567][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-2] Closing job [forecast-it-test-overflow-to-disk], because [close job (api)]
[2018-12-21T16:50:10,568][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CCmdSkeleton.cc@45] Handled 120000 records
[2018-12-21T16:50:10,580][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-2] [forecast-it-test-overflow-to-disk] [autodetect/17799] [CAnomalyJob.cc@1356] Pruning all models

Disclaimer: this happens on my linux box that I use to run CI, which I have just reinstalled. This test fails consistently the same way from within jenkins as well as when manually run. I am not getting what may be causing the permission denied error. I tried disabling selinux but that does not help. I would expect the cause to be some subtle misconfiguration but I haven't figured it out yet. The operating system is Fedora 29. I have Fedora 28 on my laptop where the test runs fine.

hendrikmuhs · 2018-12-28T11:07:51Z

One possibility of the deletion problem: We install a system call filter on linux:

https://github.com/elastic/ml-cpp/blob/master/lib/seccomp/CSystemCallFilter_Linux.cc

It might be that forecast overflow requires additional whitelisting in this filter on Fedora 29.

Unfortunately it is not easy to find out, seccomp can not be easily disabled, you need a custom build of either ml-cpp or the linux kernel.

We could consider a special environment variable or cmdline parameter to disable seccomp if my assumption is true and if we get more reports like this.

droberts195 · 2019-01-03T10:20:38Z

We could consider a special environment variable or cmdline parameter to disable seccomp if my assumption is true and if we get more reports like this.

I don't think we should do this. If we did then any malware that wanted to use an exploit that seccomp is currently blocking would just switch it off, so we might as well just remove seccomp entirely. See also elastic/elasticsearch#27645 (comment)

Unfortunately it is not easy to find out

I think we need to put effort into finding out. @davidkyle isn't there a way to get seccomp to report which system call it's blocking? I seem to remember you used something like that during initial development. Then we just need to hire a Fedora 29 VM in EC2 or GCP and copy over a test program that uses boost::filesystem::remove_all() to that VM.

This is not desperately urgent as Fedora 29 is not a supported platform. But whatever changes are in its glibc will eventually be included in some supported platform, so we need to head off the problem before an affected platform is added to the support matrix.

javanna · 2019-01-03T10:31:11Z

I am here to help, as I said I can reproduce this 100% of the times. I would love if this could be given some kind of priority so I can run the full suite of tests on my server again, either the fix that Hendrik mentioned for the bug that's causing the test to hang, or digging deeper to find out what the problem is with boost::filesystem::remove_all if somebody has time.

droberts195 · 2019-01-03T10:47:41Z

I would love if this could be given some kind of priority

I believe the fix to the symptoms of the problem in #352 will allow the tests to pass again, and it should be possible to merge that the day @hendrikmuhs is back at work.

davidkyle · 2019-01-03T10:47:59Z

The only way I found to reliably track which syscalls are blocked across all linux's is to change the seccomp filter to return SECCOMP_RET_KILL instead of SECCOMP_RET_ERRNO. This kills the calling thread, which should be the main thread killing the process and logs a message with the offending syscall number which then must be looked up in syscall.h.

I think this is worth investigating even if fedora isn't officially supported as a blocked syscall silently fails and it very hard to debug. It's a pretty easy change to make, I would have to go back to look at exactly what I did before and create a unit test for it.

boost::filesystem::remove_all() is called from the main thread isn't it? Kernal 4.14 and later has much better support for seccomp including the SECCOMP_RET_LOG and SECCOMP_RET_KILL_PROCESS options.

javanna · 2019-01-03T10:50:21Z

great @droberts195 I had not seen that PR, thank you for pointing me to that. I can help testing it out if needed, not sure how though :) ping me if I can do anything.

hendrikmuhs · 2019-01-03T12:28:23Z

My PR only fixes the hang but not the issue that on Fedora 29 we very likely block a system call as part of the seccomp filter which we should not. I am pretty sure it is the seccomp filter because "eaccess" is what we get reported.

Killing the process would be quite dramatic as we sacrifice the process for a failed forecast. IMHO to much to make this the default.

I suggest that we do this only on snaphot builds but keep the current behavior on release builds. It requires some makefile and #ifdef's, should not take more than half a day to implement.

Note that the unit tests are not executed if you use the binary blops. we get more coverage with the above.

LBNL we have to check that the crashhandler reports the root cause (blocked syscall) correctly.

droberts195 · 2019-01-03T12:55:13Z

I suggest that we do this only on snaphot build

I wasn't suggesting we change this for any official build, just as a one-off debug build someone does on their laptop to find the offending syscall number. This problem only affects us when one of the OS libraries we use is changed to use different kernel calls to implement an existing user land OS function. I don't imagine that happens very often.

davidkyle · 2019-01-03T14:11:06Z

I wasn't suggesting we change the response to kill the process only that this was the hack I made to find which syscalls were being called as I was developing the PR. More recent kernels (after 4.14) have better debugging for BPF but I needed to test this on all the versions we support.

There is a unit test for CSystemCallFilter_Linux.cc that asserts which operations are allowed/disallowed after installing the seccomp filter. This test is by no means exhaustive but the next step should be to add calls to boost::filesystem::remove_all() to the test then run the test on Fedora 29.

droberts195 · 2019-01-04T10:03:44Z

I'm pretty sure the missing syscall is sys_unlinkat. The Linux implementation of rmdir() in glibc uses it: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/generic/rmdir.c;h=8ef4bbe8da0a59aecc7e6cd6d36562f7c5365f3f;hb=HEAD

In fact, given that code it's surprising that this worked on any version of Linux from the last decade or so. I guess older glibc's must have had some sort of workaround that I can't find that fell back to sys_rmdir if sys_unlinkat didn't work for some reason.

davidkyle · 2019-01-04T12:17:08Z

I guess older glibc's must have had some sort of workaround that I can't find that fell back to...

I noticed this a couple of times where if a syscall was blocked another was tried. For example @hendrikmuhs found a bug where the high resolution timer was blocked so it fell back to the 1 second timer.

Regarding the unit test for boost::filesystem::remove_all() one already exists:
https://github.com/elastic/ml-cpp/blob/master/lib/seccomp/unittest/CSystemCallFilterTest.cc#L233

We need to run the unit tests on Fedora 29 hopefully it will fail and hopefully adding sys_unlinkat will fix it.

Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes elastic#350

Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes #350

Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes elastic#350 Backport of elastic#354

Fedora 29 uses different system calls to platforms we've previously tested on, and hence suffers from certain functionality failing due to the seccomp filter. This commit permits 3 additional system calls: 1. __NR_gettimeofday 2. __NR_unlinkat 3. __NR_getdents64 (It is likely that other Linux distributions using modern glibc would also hit one or more of these system calls. Non-fatal problems probably got progressively worse in the lead up to the fatal problem that surfaced in Fedora 29.) Fixes #350 Backport of #354

javanna · 2019-01-07T10:47:40Z

thanks a lot everybody involved!

fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails. This can cause the autodetect process to hang on exit, if more forecast requests are in the queue relates to #350

…353) fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails. This can cause the autodetect process to hang on exit, if more forecast requests are in the queue relates to #350

…357) fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails. This can cause the autodetect process to hang on exit, if more forecast requests are in the queue relates to #350

…356) fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails. This can cause the autodetect process to hang on exit, if more forecast requests are in the queue relates to #350

hendrikmuhs added >bug v7.0.0 :ml labels Dec 21, 2018

hendrikmuhs mentioned this issue Dec 28, 2018

[ML] do not exit the worker after warning about failed cleanup #352

Merged

droberts195 mentioned this issue Jan 4, 2019

[ML] Update seccomp filter for Fedora 29 #354

Merged

droberts195 closed this as completed in #354 Jan 4, 2019

droberts195 mentioned this issue Jan 4, 2019

[ML] Update seccomp filter for Fedora 29 #355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] process can hang if large forecast job fails to delete temporary storage #350

[ML] process can hang if large forecast job fails to delete temporary storage #350

hendrikmuhs commented Dec 21, 2018

javanna commented Dec 21, 2018

hendrikmuhs commented Dec 28, 2018

droberts195 commented Jan 3, 2019

javanna commented Jan 3, 2019

droberts195 commented Jan 3, 2019

davidkyle commented Jan 3, 2019

javanna commented Jan 3, 2019

hendrikmuhs commented Jan 3, 2019 •

edited

droberts195 commented Jan 3, 2019

davidkyle commented Jan 3, 2019

droberts195 commented Jan 4, 2019

davidkyle commented Jan 4, 2019

javanna commented Jan 7, 2019

[ML] process can hang if large forecast job fails to delete temporary storage #350

[ML] process can hang if large forecast job fails to delete temporary storage #350

Comments

hendrikmuhs commented Dec 21, 2018

javanna commented Dec 21, 2018

hendrikmuhs commented Dec 28, 2018

droberts195 commented Jan 3, 2019

javanna commented Jan 3, 2019

droberts195 commented Jan 3, 2019

davidkyle commented Jan 3, 2019

javanna commented Jan 3, 2019

hendrikmuhs commented Jan 3, 2019 • edited

droberts195 commented Jan 3, 2019

davidkyle commented Jan 3, 2019

droberts195 commented Jan 4, 2019

davidkyle commented Jan 4, 2019

javanna commented Jan 7, 2019

hendrikmuhs commented Jan 3, 2019 •

edited