Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix possible race condition when closing an opening job #42506

Merged
merged 2 commits into from May 24, 2019

Conversation

@droberts195
Copy link
Contributor

droberts195 commented May 24, 2019

This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later. It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state. It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.

This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

elasticmachine commented May 24, 2019

@@ -401,16 +401,12 @@ protected void doRun() {
logger.debug("Aborted opening job [{}] as it has been closed", jobId);
return;
}
if (processContext.getState() != ProcessContext.ProcessStateName.NOT_RUNNING) {

This comment has been minimized.

Copy link
@droberts195

droberts195 May 24, 2019

Author Contributor

This check needs to be done under lock to prevent a close sneaking in after this check but before the open actions are performed.

@@ -605,10 +613,10 @@ public void closeJob(JobTask jobTask, boolean restart, String reason) {
if (communicator == null) {
logger.debug("Job [{}] is being closed before its process is started", jobId);
jobTask.markAsCompleted();
return;

This comment has been minimized.

Copy link
@droberts195

droberts195 May 24, 2019

Author Contributor

Returning here without removing the entry from processByAllocation is the cause of the eventual filling up to capacity of processByAllocation. It's nasty because jobTask.markAsCompleted() causes the externally visible effects of closing the job to all look as expected. It's only the internal data structure that's left with a spurious entry.

Copy link
Member

davidkyle left a comment

LGTM

Copy link
Contributor

dimitris-athanasiou left a comment

LGTM That was tricky to find! Thanks for fixing this!

@droberts195

This comment has been minimized.

Copy link
Contributor Author

droberts195 commented May 24, 2019

Jenkins run elasticsearch-ci/1

3 similar comments
@droberts195

This comment has been minimized.

Copy link
Contributor Author

droberts195 commented May 24, 2019

Jenkins run elasticsearch-ci/1

@droberts195

This comment has been minimized.

Copy link
Contributor Author

droberts195 commented May 24, 2019

Jenkins run elasticsearch-ci/1

@droberts195

This comment has been minimized.

Copy link
Contributor Author

droberts195 commented May 24, 2019

Jenkins run elasticsearch-ci/1

@droberts195 droberts195 merged commit 5eb38ec into elastic:master May 24, 2019
8 checks passed
8 checks passed
CLA All commits in pull request signed
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/bwc Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details
@droberts195 droberts195 deleted the droberts195:fix_race_on_close_during_open branch May 24, 2019
droberts195 added a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
droberts195 added a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
droberts195 added a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
droberts195 added a commit that referenced this pull request May 24, 2019
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
…#42506)

This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
@droberts195 droberts195 added the v8.0.0 label Jun 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.