[ML] do not exit the worker after warning about failed cleanup #352

hendrikmuhs · 2018-12-28T10:26:20Z

fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails. This can cause the autodetect process to hang on exit, if more forecast requests are in the queue

relates to #350

6.x #353
6.6.0 #357
6.5.5 #356

hendrikmuhs · 2018-12-28T10:49:00Z

because of the backport to 6.x I made no changelog entry for master

I think we need to revisit the changelog-status CI check

tveasey · 2019-01-03T09:35:50Z

lib/api/CForecastRunner.cc

@@ -225,7 +225,6 @@ void CForecastRunner::forecastWorker() {
                    LOG_WARN(<< "Failed to cleanup temporary data from: "
                             << forecastJob.s_TemporaryFolder << " error "
                             << errorCode.message());
-                    return;


I wonder whether it is really a good idea to continue the loop in this case. Are there definitely no side effects from having stale directories at the point we run subsequent forecasts?

If there are definitely no side effects from having failed to clean up then I guess this is ok. Alternatively, since you delete all pending forecasts at the end of this function, another option might be to set the m_Shutdown flag here, break out of the loop and check the worker hasn't shutdown in the code which schedules forecasts.

I don't think continuing the loop is that bad. Remember that a new process is only started for forecasting if the job is not currently open. So exiting the loop would mean it was not possible to do a forecast for a real-time job without closing and reopening the job, and that could be a major pain for production use cases.

Maybe I'm missing something though. If shutting down the loop is determined to be the best way forward then it needs to be set to true with the mutex locked in this method and checked with the mutex locked in the push method. Otherwise there would be potential for a race if a forecast is being pushed at the same time as cleanup of another forecast is failing.

Ok, that's a good point. I'd thought the thread was spawned/joined each time a forecast was requested, but it does indeed look like the forecast thread is started only once when the CAnomalyJob object is created. In that case ever exiting this loop - except on process shutdown - seems undesirable.

I'm just wanted to raise this, since it is quite a significant change in behaviour, and wanted to be sure we'd thought about possible side effects. A couple of things that occurred to me, but I hadn't checked are: 1) what do we do about filling up the disk on the write side? 2) can there be any side effects from having stale directories for subsequent forecasts, i.e. do names definitely not get recycled?

it was never intended to shutdown the loop, this is a bug. Even the code documentation says:

// not an error: there is also cleanup code on the Java side

and the log level is warning

The name of a temporary file is also not hardcoded but random: https://github.com/elastic/ml-cpp/blob/master/lib/model/CForecastModelPersist.cc#L34

I see that the writing of forecast documents also checks the disk isn't full. I just wanted to double check.

tveasey

LGTM

droberts195 · 2019-01-04T10:05:05Z

Since the effect of this is to completely hang threads in the ES JVM and the fix is so simple I think it would be good to backport this to 6.6.0 and 6.5.5.

hendrikmuhs · 2019-01-07T08:44:23Z

@droberts195 If we backport this fix to 6.6 and 6.5 - which is a good idea - I think we should backport #354, too. Although this or #354 alone fixes the severe problem of hanging processes. But it is a bit more complete to have both fixes in, as they conceptually belong together.

droberts195 · 2019-01-07T09:36:42Z

The reason I suggested to backport this to 6.5.5 but not #354 is that this fix stops the program malfunctioning if the forecast storage directory cannot be deleted for any reason, which might not necessarily be related to seccomp. Whereas #354 is effectively an enhancement to add support for glibc 2.28. I doubt any customer will run 6.5 or 6.6 on a Linux distribution that uses glibc 2.28. Someone probably will run 6.7 on such a distro, as 6.7 will still be in use 2 years from now.

droberts195

LGTM

do not return after warning about failed cleanup

beda498

hendrikmuhs added >bug v7.0.0 :ml v6.7.0 labels Dec 28, 2018

hendrikmuhs mentioned this pull request Dec 28, 2018

[6.7][ML] do not exit the worker after warning about failed cleanup #353

Merged

tveasey reviewed Jan 3, 2019

View reviewed changes

droberts195 mentioned this pull request Jan 3, 2019

[ML] process can hang if large forecast job fails to delete temporary storage #350

Closed

tveasey approved these changes Jan 3, 2019

View reviewed changes

droberts195 added v6.6.0 v6.5.5 labels Jan 4, 2019

This was referenced Jan 7, 2019

[6.5][ML] do not exit the worker after warning about failed cleanup #356

Merged

[6.6][ML] do not exit the worker after warning about failed cleanup #357

Merged

droberts195 approved these changes Jan 7, 2019

View reviewed changes

hendrikmuhs merged commit 1af9181 into elastic:master Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] do not exit the worker after warning about failed cleanup #352

[ML] do not exit the worker after warning about failed cleanup #352

hendrikmuhs commented Dec 28, 2018 •

edited

hendrikmuhs commented Dec 28, 2018 •

edited

tveasey Jan 3, 2019

droberts195 Jan 3, 2019

tveasey Jan 3, 2019

hendrikmuhs Jan 3, 2019

tveasey Jan 3, 2019

tveasey left a comment

droberts195 commented Jan 4, 2019

hendrikmuhs commented Jan 7, 2019 •

edited

droberts195 commented Jan 7, 2019

droberts195 left a comment

[ML] do not exit the worker after warning about failed cleanup #352

[ML] do not exit the worker after warning about failed cleanup #352

Conversation

hendrikmuhs commented Dec 28, 2018 • edited

hendrikmuhs commented Dec 28, 2018 • edited

tveasey Jan 3, 2019

Choose a reason for hiding this comment

droberts195 Jan 3, 2019

Choose a reason for hiding this comment

tveasey Jan 3, 2019

Choose a reason for hiding this comment

hendrikmuhs Jan 3, 2019

Choose a reason for hiding this comment

tveasey Jan 3, 2019

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

droberts195 commented Jan 4, 2019

hendrikmuhs commented Jan 7, 2019 • edited

droberts195 commented Jan 7, 2019

droberts195 left a comment

Choose a reason for hiding this comment

hendrikmuhs commented Dec 28, 2018 •

edited

hendrikmuhs commented Dec 28, 2018 •

edited

hendrikmuhs commented Jan 7, 2019 •

edited