-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Data Frame] moves failure state transition for MT safety #45676
Merged
benwtrent
merged 2 commits into
elastic:master
from
benwtrent:feature/ml-df-fix-force-start-failed-transform-test
Aug 19, 2019
Merged
[ML][Data Frame] moves failure state transition for MT safety #45676
benwtrent
merged 2 commits into
elastic:master
from
benwtrent:feature/ml-df-fix-force-start-failed-transform-test
Aug 19, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pinging @elastic/ml-core |
davidkyle
approved these changes
Aug 19, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
benwtrent
added a commit
to benwtrent/elasticsearch
that referenced
this pull request
Aug 19, 2019
…c#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports
benwtrent
added a commit
that referenced
this pull request
Aug 20, 2019
…45627) (#45656) * [ML][Data frame] fixing failure state transitions and race condition (#45627) There is a small window for a race condition while we are flagging a task as failed. Here are the steps where the race condition occurs: 1. A failure occurs 2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following: a. `finishAndSetState()` which sets the IndexerState to STARTED b. `doSaveState(...)` which attempts to save the current state of the indexer 3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs. The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP. Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another. I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state. closes #45609 Relates to #45562 * [ML][Data Frame] moves failure state transition for MT safety (#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports
benwtrent
added a commit
that referenced
this pull request
Aug 21, 2019
After the PR #45676 onFailure is now called before the indexer state has transitioned out of indexing. To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.
benwtrent
added a commit
to benwtrent/elasticsearch
that referenced
this pull request
Aug 21, 2019
) After the PR elastic#45676 onFailure is now called before the indexer state has transitioned out of indexing. To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With how failures are handled now, if we have to transition into a
FAILED
state on the task side, there is a chance that the indexer will get triggered again while we are still processing the failure.To prevent this,
onFailure
is now called BEFORE the indexer is transitioned between its states and beforedoSaveState
is called. This fixes two bugs:doSaveState
be called with incorrect statistics as the current position at the time of failure will have to be started over again.Looking through how rollups utilizes
onFailure
just a log message is made, so moving when that message is written out should not change their behavior.closes #45664