[CI] SnapshotLifecycleServiceTests testPolicyCRUD failing #107506

kkrik-es · 2024-04-16T06:55:42Z

Build scan:
https://gradle-enterprise.elastic.co/s/pkx7io37rscbo/tests/:x-pack:plugin:slm:test/org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests/testPolicyCRUD

Reproduction line:

./gradlew ':x-pack:plugin:slm:test' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests.testPolicyCRUD" -Dtests.seed=1C3639A446730B8B -Dtests.locale=es-PA -Dtests.timezone=America/Lower_Princes -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests#testPolicyCRUD

Failure excerpt:

java.lang.AssertionError: 
Expected: <2>
     but: was <3>

  at __randomizedtesting.SeedInfo.seed([1C3639A446730B8B:902A253AEE0CCCAA]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
  at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2151)
  at org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests.lambda$testPolicyCRUD$13(SnapshotLifecycleServiceTests.java:282)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1278)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1251)
  at org.elasticsearch.xpack.slm.SnapshotLifecycleServiceTests.testPolicyCRUD(SnapshotLifecycleServiceTests.java:282)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-04-16T06:56:06Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2024-04-22T20:34:41Z

It looks like assertBusy didn't fix #44997.

masseyke · 2024-04-22T20:58:32Z

I can reproduce this 100% of the time in the debugger by just putting a breakpoint on the final int currentCount2 = triggerCount.get(); line. Immediately after that line, triggerCount is incremented in this block:

            trigger.set(e -> {
                triggeredJobs.add(e.getJobName());
                triggerCount.incrementAndGet();
            });

masseyke · 2024-04-22T21:26:19Z

I think we have a race condition, and I think the failing assertion is just invalid because we don't want to fix this race condition. Here's the relevant code, with comments added:

            final int currentCount2 = triggerCount.get();   // At this point we get the number of jobs. Say it's 2.
            previousState = state;
            // Create a state simulating the policy being deleted
            state = createState(
                new SnapshotLifecycleMetadata(Collections.emptyMap(), OperationMode.RUNNING, new SnapshotLifecycleStats()),
                true
            );
            event = new ClusterChangedEvent("5", state, previousState);
            sls.clusterChanged(event);    // At this point, we cancel any future jobs. But it is entirely possible that a job has run between the first line above and this one. So now triggerCount could be 3 (or even more)
            clock.fastForwardSeconds(2);

            // The existing job should be cancelled and no longer trigger
            assertBusy(() -> assertThat(triggerCount.get(), equalTo(currentCount2))); // If the race condition happened, this will never be true

I don't think we want any additional locking, or to change SchedulerEngine$ActiveSchedule.cancel() to set a flag that listeners need to check, because it's not worth the complexity or risk. I think we just need to accept that sometimes triggerCount might be greater than currentCount2.

kkrik-es added :Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI Team:Data Management Meta label for data/management team labels Apr 16, 2024

elasticsearchmachine added the blocker label Apr 16, 2024

masseyke mentioned this issue Apr 22, 2024

Removing invalid assertion from SnapshotLifecycleServiceTests #107736

Merged

dakrone added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Apr 23, 2024

masseyke closed this as completed in #107736 Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SnapshotLifecycleServiceTests testPolicyCRUD failing #107506

[CI] SnapshotLifecycleServiceTests testPolicyCRUD failing #107506

kkrik-es commented Apr 16, 2024

elasticsearchmachine commented Apr 16, 2024

masseyke commented Apr 22, 2024

masseyke commented Apr 22, 2024

masseyke commented Apr 22, 2024

[CI] SnapshotLifecycleServiceTests testPolicyCRUD failing #107506

[CI] SnapshotLifecycleServiceTests testPolicyCRUD failing #107506

Comments

kkrik-es commented Apr 16, 2024

elasticsearchmachine commented Apr 16, 2024

masseyke commented Apr 22, 2024

masseyke commented Apr 22, 2024

masseyke commented Apr 22, 2024