New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix possible race condition when closing an opening job #42506
[ML] Fix possible race condition when closing an opening job #42506
Conversation
This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.
Pinging @elastic/ml-core |
@@ -401,16 +401,12 @@ protected void doRun() { | |||
logger.debug("Aborted opening job [{}] as it has been closed", jobId); | |||
return; | |||
} | |||
if (processContext.getState() != ProcessContext.ProcessStateName.NOT_RUNNING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check needs to be done under lock to prevent a close sneaking in after this check but before the open actions are performed.
@@ -605,10 +613,10 @@ public void closeJob(JobTask jobTask, boolean restart, String reason) { | |||
if (communicator == null) { | |||
logger.debug("Job [{}] is being closed before its process is started", jobId); | |||
jobTask.markAsCompleted(); | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning here without removing the entry from processByAllocation
is the cause of the eventual filling up to capacity of processByAllocation
. It's nasty because jobTask.markAsCompleted()
causes the externally visible effects of closing the job to all look as expected. It's only the internal data structure that's left with a spurious entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM That was tricky to find! Thanks for fixing this!
Jenkins run elasticsearch-ci/1 |
3 similar comments
Jenkins run elasticsearch-ci/1 |
Jenkins run elasticsearch-ci/1 |
Jenkins run elasticsearch-ci/1 |
This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.
This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.
This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.
This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.
…#42506) This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.
If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.
The race could be triggered by opening a job and then closing
it a tiny fraction of a second later. It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.
The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state. It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.