New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in RolloverAction causes failures #35313
Comments
Pinging @elastic/es-core-infra |
I intend to work on this, I assigned everyone on ILM for visibility. |
I'm on board with this plan! Personally, I think it would be safe for this timeout to be rather small since we should just be waiting on the time it takes two cluster-state updates to complete. Since there is no reason to risk things, setting the timeout to 10 minutes sounds like a reasonably safe thing to do. |
Fixed by #35524 |
heya @gwbrown I still see similar failures once in a while on my CI, could it be the same problem as before? I can't repro with the same seed unfortunately, it looks like a timing issue.
|
@javanna That wouldn't be this issue, this would cause failures while in the *: There's a couple IOExceptions in late march, but those are errors creating an index so I doubt it's related. |
There are still scenarios in which RolloverAction can incorrectly cause the index to be moved to the Error Step, particularly at low polling intervals, such as those we use in our tests. During a rollover, there is a gap between each of these steps:
RolloverInfo
being added to the original index.#35168 prevented errors in the case where two
RolloverStep
s fired simultaneously for the same index, and one completed step 1 above before the other. However, an error can still occur if twoRolloverStep
s (R1 and R2) fire simultaneously in this case:ResourceAlreadyExistsException
and moves to the next step.UpdateRolloverLifecycleDateStep
. R2 fails because R1 has not yet completed step 3 (i.e. has not yet attached theRolloverInfo
) and the index moves to the error step.This is the root cause of #35244
Proposed Solution
@colings86, @talevy and I discussed this briefly. The solution we came up with is to add a step (say,
VerifyRolloverStep
), which checks that the alias no longer points to the original index and that theRolloverInfo
has been attached to the original index before moving on toUpdateRolloverLifecycleDateStep
.However, there is no way to detect the difference between "There is a Rollover request in flight that has not yet completed, so we should wait" and "Someone manually created a subsequent index, and we should move to the error step" - therefore, if someone manually creates a subsequent index, the policy would simply wait at
VerifyRolloverStep
forever. To resolve this,VerifyRolloverStep
should implement a timeout, based onstep_time
, to wait some length of time significantly longer than we believe a rollover should take to complete (10 minutes?) which will allow us to notify the user that something has, in fact, gone wrong.The text was updated successfully, but these errors were encountered: