[ML][Data Frame] moves failure state transition for MT safety #45676

benwtrent · 2019-08-16T20:45:54Z

With how failures are handled now, if we have to transition into a FAILED state on the task side, there is a chance that the indexer will get triggered again while we are still processing the failure.

To prevent this, onFailure is now called BEFORE the indexer is transitioned between its states and before doSaveState is called. This fixes two bugs:

The race condition where another trigger could fire while we are still in the middle of failing
Having doSaveState be called with incorrect statistics as the current position at the time of failure will have to be started over again.

Looking through how rollups utilizes onFailure just a log message is made, so moving when that message is written out should not change their behavior.

closes #45664

elasticmachine · 2019-08-16T20:45:56Z

Pinging @elastic/ml-core

davidkyle

LGTM

…c#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports

…45627) (#45656) * [ML][Data frame] fixing failure state transitions and race condition (#45627) There is a small window for a race condition while we are flagging a task as failed. Here are the steps where the race condition occurs: 1. A failure occurs 2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following: a. `finishAndSetState()` which sets the IndexerState to STARTED b. `doSaveState(...)` which attempts to save the current state of the indexer 3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs. The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP. Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another. I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state. closes #45609 Relates to #45562 * [ML][Data Frame] moves failure state transition for MT safety (#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports

After the PR #45676 onFailure is now called before the indexer state has transitioned out of indexing. To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.

) After the PR elastic#45676 onFailure is now called before the indexer state has transitioned out of indexing. To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.

…45814) After the PR #45676 onFailure is now called before the indexer state has transitioned out of indexing. To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.

[ML][Data Frame] moves failure state transition for MT safety

6b91ff5

benwtrent added >bug v8.0.0 :ml/Transform Transform v7.4.0 labels Aug 16, 2019

removing unused imports

1d3c348

davidkyle approved these changes Aug 19, 2019

View reviewed changes

benwtrent merged commit 27fab09 into elastic:master Aug 19, 2019

benwtrent deleted the feature/ml-df-fix-force-start-failed-transform-test branch August 19, 2019 11:43

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 19, 2019

[ML][Data Frame] moves failure state transition for MT safety (elasti…

27405af

…c#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports

benwtrent mentioned this pull request Aug 19, 2019

[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656

Merged

benwtrent mentioned this pull request Aug 21, 2019

Fixing rollup state tests after onFailure ordering change #45784

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML][Data Frame] moves failure state transition for MT safety #45676

[ML][Data Frame] moves failure state transition for MT safety #45676

benwtrent commented Aug 16, 2019 •

edited

Loading

elasticmachine commented Aug 16, 2019

davidkyle left a comment

[ML][Data Frame] moves failure state transition for MT safety #45676

[ML][Data Frame] moves failure state transition for MT safety #45676

Conversation

benwtrent commented Aug 16, 2019 • edited Loading

elasticmachine commented Aug 16, 2019

davidkyle left a comment

Choose a reason for hiding this comment

benwtrent commented Aug 16, 2019 •

edited

Loading