Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] do not exit the worker after warning about failed cleanup #352

Merged
merged 1 commit into from Jan 8, 2019

Conversation

hendrikmuhs
Copy link
Contributor

@hendrikmuhs hendrikmuhs commented Dec 28, 2018

fix a race condition if a forecast job requires overflowing to disk but cleanup of temporary storage fails. This can cause the autodetect process to hang on exit, if more forecast requests are in the queue

relates to #350

6.x #353
6.6.0 #357
6.5.5 #356

@hendrikmuhs
Copy link
Contributor Author

hendrikmuhs commented Dec 28, 2018

because of the backport to 6.x I made no changelog entry for master

I think we need to revisit the changelog-status CI check

@@ -225,7 +225,6 @@ void CForecastRunner::forecastWorker() {
LOG_WARN(<< "Failed to cleanup temporary data from: "
<< forecastJob.s_TemporaryFolder << " error "
<< errorCode.message());
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it is really a good idea to continue the loop in this case. Are there definitely no side effects from having stale directories at the point we run subsequent forecasts?

If there are definitely no side effects from having failed to clean up then I guess this is ok. Alternatively, since you delete all pending forecasts at the end of this function, another option might be to set the m_Shutdown flag here, break out of the loop and check the worker hasn't shutdown in the code which schedules forecasts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think continuing the loop is that bad. Remember that a new process is only started for forecasting if the job is not currently open. So exiting the loop would mean it was not possible to do a forecast for a real-time job without closing and reopening the job, and that could be a major pain for production use cases.

Maybe I'm missing something though. If shutting down the loop is determined to be the best way forward then it needs to be set to true with the mutex locked in this method and checked with the mutex locked in the push method. Otherwise there would be potential for a race if a forecast is being pushed at the same time as cleanup of another forecast is failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a good point. I'd thought the thread was spawned/joined each time a forecast was requested, but it does indeed look like the forecast thread is started only once when the CAnomalyJob object is created. In that case ever exiting this loop - except on process shutdown - seems undesirable.

I'm just wanted to raise this, since it is quite a significant change in behaviour, and wanted to be sure we'd thought about possible side effects. A couple of things that occurred to me, but I hadn't checked are: 1) what do we do about filling up the disk on the write side? 2) can there be any side effects from having stale directories for subsequent forecasts, i.e. do names definitely not get recycled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was never intended to shutdown the loop, this is a bug. Even the code documentation says:

// not an error: there is also cleanup code on the Java side

and the log level is warning

The name of a temporary file is also not hardcoded but random: https://github.com/elastic/ml-cpp/blob/master/lib/model/CForecastModelPersist.cc#L34

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the writing of forecast documents also checks the disk isn't full. I just wanted to double check.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195
Copy link
Contributor

Since the effect of this is to completely hang threads in the ES JVM and the fix is so simple I think it would be good to backport this to 6.6.0 and 6.5.5.

@hendrikmuhs
Copy link
Contributor Author

hendrikmuhs commented Jan 7, 2019

@droberts195 If we backport this fix to 6.6 and 6.5 - which is a good idea - I think we should backport #354, too. Although this or #354 alone fixes the severe problem of hanging processes. But it is a bit more complete to have both fixes in, as they conceptually belong together.

@droberts195
Copy link
Contributor

The reason I suggested to backport this to 6.5.5 but not #354 is that this fix stops the program malfunctioning if the forecast storage directory cannot be deleted for any reason, which might not necessarily be related to seccomp. Whereas #354 is effectively an enhancement to add support for glibc 2.28. I doubt any customer will run 6.5 or 6.6 on a Linux distribution that uses glibc 2.28. Someone probably will run 6.7 on such a distro, as 6.7 will still be in use 2 years from now.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hendrikmuhs hendrikmuhs merged commit 1af9181 into elastic:master Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants