[ML] Fix possible race condition when closing an opening job #42506

droberts195 · 2019-05-24T11:43:07Z

This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later. It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state. It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.

This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.

elasticmachine · 2019-05-24T11:43:55Z

Pinging @elastic/ml-core

droberts195 · 2019-05-24T11:45:20Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

@@ -401,16 +401,12 @@ protected void doRun() {
                                        logger.debug("Aborted opening job [{}] as it has been closed", jobId);
                                        return;
                                    }
-                                    if (processContext.getState() !=  ProcessContext.ProcessStateName.NOT_RUNNING) {


This check needs to be done under lock to prevent a close sneaking in after this check but before the open actions are performed.

droberts195 · 2019-05-24T11:47:34Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

@@ -605,10 +613,10 @@ public void closeJob(JobTask jobTask, boolean restart, String reason) {
            if (communicator == null) {
                logger.debug("Job [{}] is being closed before its process is started", jobId);
                jobTask.markAsCompleted();
-                return;


Returning here without removing the entry from processByAllocation is the cause of the eventual filling up to capacity of processByAllocation. It's nasty because jobTask.markAsCompleted() causes the externally visible effects of closing the job to all look as expected. It's only the internal data structure that's left with a spurious entry.

davidkyle

LGTM

dimitris-athanasiou

LGTM That was tricky to find! Thanks for fixing this!

droberts195 · 2019-05-24T15:36:00Z

Jenkins run elasticsearch-ci/1

droberts195 · 2019-05-24T15:59:38Z

Jenkins run elasticsearch-ci/1

droberts195 · 2019-05-24T16:39:16Z

Jenkins run elasticsearch-ci/1

droberts195 · 2019-05-24T17:52:51Z

Jenkins run elasticsearch-ci/1

This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.

…#42506) This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.

droberts195 added :ml Machine learning >bug v6.8.1 v7.1.2 v7.2.0 v7.3.0 labels May 24, 2019

droberts195 commented May 24, 2019

View reviewed changes

droberts195 requested review from davidkyle and dimitris-athanasiou May 24, 2019 11:48

davidkyle approved these changes May 24, 2019

View reviewed changes

dimitris-athanasiou approved these changes May 24, 2019

View reviewed changes

Fix comment

2260056

droberts195 merged commit 5eb38ec into elastic:master May 24, 2019

droberts195 deleted the fix_race_on_close_during_open branch May 24, 2019 19:04

droberts195 added the v8.0.0 label Jun 12, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix possible race condition when closing an opening job #42506

[ML] Fix possible race condition when closing an opening job #42506

droberts195 commented May 24, 2019

elasticmachine commented May 24, 2019

droberts195 May 24, 2019

droberts195 May 24, 2019

davidkyle left a comment

dimitris-athanasiou left a comment

droberts195 commented May 24, 2019

droberts195 commented May 24, 2019

droberts195 commented May 24, 2019

droberts195 commented May 24, 2019

[ML] Fix possible race condition when closing an opening job #42506

[ML] Fix possible race condition when closing an opening job #42506

Conversation

droberts195 commented May 24, 2019

elasticmachine commented May 24, 2019

droberts195 May 24, 2019

Choose a reason for hiding this comment

droberts195 May 24, 2019

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

droberts195 commented May 24, 2019

droberts195 commented May 24, 2019

droberts195 commented May 24, 2019

droberts195 commented May 24, 2019