Race condition in RolloverAction causes failures #35313

gwbrown · 2018-11-06T19:17:24Z

There are still scenarios in which RolloverAction can incorrectly cause the index to be moved to the Error Step, particularly at low polling intervals, such as those we use in our tests. During a rollover, there is a gap between each of these steps:

The creation of the new index.
The alias being swapped to the new index.
The RolloverInfo being added to the original index.

#35168 prevented errors in the case where two RolloverSteps fired simultaneously for the same index, and one completed step 1 above before the other. However, an error can still occur if two RolloverSteps (R1 and R2) fire simultaneously in this case:

R1 completes step 1.
R2 issues a rollover request, fails with ResourceAlreadyExistsException and moves to the next step.
R2 finishes, and moves on to running UpdateRolloverLifecycleDateStep. R2 fails because R1 has not yet completed step 3 (i.e. has not yet attached the RolloverInfo) and the index moves to the error step.
R1 completes steps 2 and 3.

This is the root cause of #35244

Proposed Solution

@colings86, @talevy and I discussed this briefly. The solution we came up with is to add a step (say, VerifyRolloverStep), which checks that the alias no longer points to the original index and that the RolloverInfo has been attached to the original index before moving on to UpdateRolloverLifecycleDateStep.

However, there is no way to detect the difference between "There is a Rollover request in flight that has not yet completed, so we should wait" and "Someone manually created a subsequent index, and we should move to the error step" - therefore, if someone manually creates a subsequent index, the policy would simply wait at VerifyRolloverStep forever. To resolve this, VerifyRolloverStep should implement a timeout, based on step_time, to wait some length of time significantly longer than we believe a rollover should take to complete (10 minutes?) which will allow us to notify the user that something has, in fact, gone wrong.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-06T19:17:26Z

Pinging @elastic/es-core-infra

gwbrown · 2018-11-06T19:17:45Z

I intend to work on this, I assigned everyone on ILM for visibility.

talevy · 2018-11-06T19:42:39Z

I'm on board with this plan! Personally, I think it would be safe for this timeout to be rather small since we should just be waiting on the time it takes two cluster-state updates to complete. Since there is no reason to risk things, setting the timeout to 10 minutes sounds like a reasonably safe thing to do.

gwbrown · 2018-11-16T17:48:01Z

Fixed by #35524

javanna · 2019-05-15T15:31:56Z

heya @gwbrown I still see similar failures once in a while on my CI, could it be the same problem as before? I can't repro with the same seed unfortunately, it looks like a timing issue.

REPRODUCE WITH: ./gradlew :x-pack:plugin:ilm:qa:multi-node:integTestRunner --tests "org.elasticsearch.xpack.indexlifecycle.ChangePolicyforIndexIT.testChangePolicyForIndex" -Dtests.seed=1B8BDEDB443ABCA2 -Dtests.security.manager=true -Dtests.locale=en-DK -Dtests.timezone=Africa/Dakar -Dcompiler.java=12 -Druntime.java=12
02:06:22   2> org.junit.ComparisonFailure: expected:<[completed]> but was:<[warm]>
02:06:22         at __randomizedtesting.SeedInfo.seed([1B8BDEDB443ABCA2:FA9A8A9A897F6F52]:0)
02:06:22         at org.junit.Assert.assertEquals(Assert.java:115)
02:06:22         at org.junit.Assert.assertEquals(Assert.java:144)
02:06:22         at org.elasticsearch.xpack.indexlifecycle.ChangePolicyforIndexIT.assertStep(ChangePolicyforIndexIT.java:140)
02:06:22         at org.elasticsearch.xpack.indexlifecycle.ChangePolicyforIndexIT.lambda$testChangePolicyForIndex$2(ChangePolicyforIndexIT.java:115)
02:06:22         at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:862)
02:06:22         at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
02:06:22         at org.elasticsearch.xpack.indexlifecycle.ChangePolicyforIndexIT.testChangePolicyForIndex(ChangePolicyforIndexIT.java:115)

gwbrown · 2019-05-15T17:25:59Z

@javanna That wouldn't be this issue, this would cause failures while in the hot phase, while the policy has progressed to warm in your case. I don't see any failures* for this test on build-stats for this, I suspect it's just the assertBusy timing out - that can happen on slower systems for the ILM tests generally. I'm not sure what to recommend for you, unfortunately.

*: There's a couple IOExceptions in late march, but those are errors creating an index so I doubt it's related.

gwbrown added >bug blocker :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 6, 2018

gwbrown assigned dakrone, colings86, talevy and gwbrown Nov 6, 2018

This was referenced Nov 8, 2018

Introduce step to check rollover happened properly #35391

Closed

Split RolloverStep into Wait and Action steps #35524

Merged

gwbrown closed this as completed Nov 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in RolloverAction causes failures #35313

Race condition in RolloverAction causes failures #35313

gwbrown commented Nov 6, 2018

elasticmachine commented Nov 6, 2018

gwbrown commented Nov 6, 2018 •

edited

talevy commented Nov 6, 2018

gwbrown commented Nov 16, 2018

javanna commented May 15, 2019

gwbrown commented May 15, 2019

Race condition in RolloverAction causes failures #35313

Race condition in RolloverAction causes failures #35313

Comments

gwbrown commented Nov 6, 2018

Proposed Solution

elasticmachine commented Nov 6, 2018

gwbrown commented Nov 6, 2018 • edited

talevy commented Nov 6, 2018

gwbrown commented Nov 16, 2018

javanna commented May 15, 2019

gwbrown commented May 15, 2019

gwbrown commented Nov 6, 2018 •

edited