Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DatafeedJobsRestID: Cancelled recovery not cleaning up #100589

Closed
DaveCTurner opened this issue Oct 10, 2023 · 1 comment · Fixed by #100610
Closed

[CI] DatafeedJobsRestID: Cancelled recovery not cleaning up #100589

DaveCTurner opened this issue Oct 10, 2023 · 1 comment · Fixed by #100610
Labels
blocker :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Although it's an ML test suite, the problems all look to be because an index was deleted while a recovery was ongoing and but the recovery task never goes away.

Build scan:
https://gradle-enterprise.elastic.co/s/e3zjigozbae3y/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.PyTorchModelIT/testInferWithMultipleDocs
Reproduction line:

gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.PyTorchModelIT.testInferWithMultipleDocs" -Dtests.seed=E0D151F0DED2B412 -Dtests.locale=ca-ES -Dtests.timezone=America/Montreal -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.PyTorchModelIT&tests.test=testInferWithMultipleDocs
Failure excerpt:

java.lang.AssertionError: 2 active tasks found:
internal:index/shard/recovery/start_recovery YDcumwItTqWvln7Kvk5LkA:17655 -                            transport  1696930099813 09:28:19 44.7m       127.0.0.1 javaRestTest-0 recovery of [low_memory_analysis_source_index][7] to {javaRestTest-1}{6XF9PkkYR8GpW47PmXaydA}{czoYWR-xROq8xLJ8x2kjHQ}{javaRestTest-1}{127.0.0.1}{127.0.0.1:63790}{dilm}{8.12.0}{7000099-8500003} [recoveryId=86, targetAllocationId=4vEsDrkWS6qFol7KVThGwA, clusterStateVersion=2460, startingSeqNo=-2, primaryRelocation=false, canDownloadSnapshotFiles=true]
internal:index/shard/recovery/start_recovery qCRYNw1uTX-Vd_NKudrooQ:12297 -                            transport  1696930100318 09:28:20 44.7m       127.0.0.1 javaRestTest-2 recovery of [low_memory_analysis_source_index][6] to {javaRestTest-0}{YDcumwItTqWvln7Kvk5LkA}{PvyMHe6XS6uh4Wag1jfuLA}{javaRestTest-0}{127.0.0.1}{127.0.0.1:63717}{dim}{8.12.0}{7000099-8500003} [recoveryId=84, targetAllocationId=0lxnkFTcScutcVJdpRzG0g, clusterStateVersion=2463, startingSeqNo=-2, primaryRelocation=false, canDownloadSnapshotFiles=true]
 expected:<0> but was:<2>

  at __randomizedtesting.SeedInfo.seed([E0D151F0DED2B412:92CC60B79BA0B63F]:0)
  at org.junit.Assert.fail(Assert.java:88)
  at org.junit.Assert.failNotEquals(Assert.java:834)
  at org.junit.Assert.assertEquals(Assert.java:645)
  at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForPendingTasks$2(ESRestTestCase.java:506)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1196)
  at org.elasticsearch.test.rest.ESRestTestCase.waitForPendingTasks(ESRestTestCase.java:477)
  at org.elasticsearch.test.rest.ESRestTestCase.waitForPendingTasks(ESRestTestCase.java:463)
  at org.elasticsearch.xpack.ml.integration.PyTorchModelRestTestCase.cleanup(PyTorchModelRestTestCase.java:86)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@DaveCTurner DaveCTurner added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI blocker labels Oct 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Oct 10, 2023
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 10, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in elastic#95270.

Closes elastic#96578
Closes elastic#100589
DaveCTurner added a commit that referenced this issue Oct 11, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in #95270.

Closes #96578
Closes #100589
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 11, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in elastic#95270.

Closes elastic#96578
Closes elastic#100589
elasticsearchmachine pushed a commit that referenced this issue Oct 11, 2023
`IndexShard#markAllocationIdAsInSync` is interruptible because it may
block the thread on a monitor waiting for the local checkpoint to
advance, but we lost the ability to interrupt it on a recovery
cancellation in #95270.

Closes #96578
Closes #100589
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants