Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SnapshotLifecycleServiceTests testPolicyCRUD failing #107506

Closed
kkrik-es opened this issue Apr 16, 2024 · 4 comments · Fixed by #107736
Closed

[CI] SnapshotLifecycleServiceTests testPolicyCRUD failing #107506

kkrik-es opened this issue Apr 16, 2024 · 4 comments · Fixed by #107736
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@kkrik-es
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/pkx7io37rscbo/tests/:x-pack:plugin:slm:test/org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests/testPolicyCRUD

Reproduction line:

./gradlew ':x-pack:plugin:slm:test' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests.testPolicyCRUD" -Dtests.seed=1C3639A446730B8B -Dtests.locale=es-PA -Dtests.timezone=America/Lower_Princes -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests#testPolicyCRUD

Failure excerpt:

java.lang.AssertionError: 
Expected: <2>
     but: was <3>

  at __randomizedtesting.SeedInfo.seed([1C3639A446730B8B:902A253AEE0CCCAA]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
  at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2151)
  at org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests.lambda$testPolicyCRUD$13(SnapshotLifecycleServiceTests.java:282)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1278)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1251)
  at org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests.testPolicyCRUD(SnapshotLifecycleServiceTests.java:282)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@kkrik-es kkrik-es added :Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI Team:Data Management Meta label for data/management team labels Apr 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke
Copy link
Member

It looks like assertBusy didn't fix #44997.

@masseyke
Copy link
Member

I can reproduce this 100% of the time in the debugger by just putting a breakpoint on the final int currentCount2 = triggerCount.get(); line. Immediately after that line, triggerCount is incremented in this block:

            trigger.set(e -> {
                triggeredJobs.add(e.getJobName());
                triggerCount.incrementAndGet();
            });

@masseyke
Copy link
Member

I think we have a race condition, and I think the failing assertion is just invalid because we don't want to fix this race condition. Here's the relevant code, with comments added:

            final int currentCount2 = triggerCount.get();   // At this point we get the number of jobs. Say it's 2.
            previousState = state;
            // Create a state simulating the policy being deleted
            state = createState(
                new SnapshotLifecycleMetadata(Collections.emptyMap(), OperationMode.RUNNING, new SnapshotLifecycleStats()),
                true
            );
            event = new ClusterChangedEvent("5", state, previousState);
            sls.clusterChanged(event);    // At this point, we cancel any future jobs. But it is entirely possible that a job has run between the first line above and this one. So now triggerCount could be 3 (or even more)
            clock.fastForwardSeconds(2);

            // The existing job should be cancelled and no longer trigger
            assertBusy(() -> assertThat(triggerCount.get(), equalTo(currentCount2))); // If the race condition happened, this will never be true

I don't think we want any additional locking, or to change SchedulerEngine$ActiveSchedule.cancel() to set a flag that listeners need to check, because it's not worth the complexity or risk. I think we just need to accept that sometimes triggerCount might be greater than currentCount2.

@dakrone dakrone added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants