Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML][Data Frame] moves failure state transition for MT safety #45676

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Aug 16, 2019

With how failures are handled now, if we have to transition into a FAILED state on the task side, there is a chance that the indexer will get triggered again while we are still processing the failure.

To prevent this, onFailure is now called BEFORE the indexer is transitioned between its states and before doSaveState is called. This fixes two bugs:

  • The race condition where another trigger could fire while we are still in the middle of failing
  • Having doSaveState be called with incorrect statistics as the current position at the time of failure will have to be started over again.

Looking through how rollups utilizes onFailure just a log message is made, so moving when that message is written out should not change their behavior.

closes #45664

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 27fab09 into elastic:master Aug 19, 2019
@benwtrent benwtrent deleted the feature/ml-df-fix-force-start-failed-transform-test branch August 19, 2019 11:43
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 19, 2019
…c#45676)

* [ML][Data Frame] moves failure state transition for MT safety

* removing unused imports
benwtrent added a commit that referenced this pull request Aug 20, 2019
…45627) (#45656)

* [ML][Data frame] fixing failure state transitions and race condition (#45627)

There is a small window for a race condition while we are flagging a task as failed.

Here are the steps where the race condition occurs:
1. A failure occurs
2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following:
   a. `finishAndSetState()` which sets the IndexerState to STARTED
   b. `doSaveState(...)` which attempts to save the current state of the indexer
3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs.

The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.

Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another.

I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state.

closes #45609

Relates to #45562

* [ML][Data Frame] moves failure state transition for MT safety (#45676)

* [ML][Data Frame] moves failure state transition for MT safety

* removing unused imports
benwtrent added a commit that referenced this pull request Aug 21, 2019
After the PR #45676 onFailure is now called before the indexer state has transitioned out of indexing.

To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 21, 2019
)

After the PR elastic#45676 onFailure is now called before the indexer state has transitioned out of indexing.

To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.
benwtrent added a commit that referenced this pull request Aug 21, 2019
…45814)

After the PR #45676 onFailure is now called before the indexer state has transitioned out of indexing.

To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testForceStartFailedTransform fails with STARTED expected FAILED
4 participants