Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AzureRepositoriesMeteringIT testStatsAreStoredIntoANewCounterInstanceAfterRepoConfigUpdate failing #108939

Closed
pxsalehi opened this issue May 23, 2024 · 3 comments
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@pxsalehi
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/ltoyff5l4wmos/tests/:x-pack:plugin:repositories-metering-api:qa:azure:javaRestTest/org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT/testStatsAreStoredIntoANewCounterInstanceAfterRepoConfigUpdate

Reproduction line:

./gradlew ':x-pack:plugin:repositories-metering-api:qa:azure:javaRestTest' --tests "org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT.testStatsAreStoredIntoANewCounterInstanceAfterRepoConfigUpdate" -Dtests.seed=11EC2144999959D6 -Dtests.locale=ar-LB -Dtests.timezone=Etc/GMT+4 -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
No

Failure history:
Failure dashboard for org.elasticsearch.xpack.repositories.metering.azure.AzureRepositoriesMeteringIT#testStatsAreStoredIntoANewCounterInstanceAfterRepoConfigUpdate

Failure excerpt:

org.elasticsearch.client.ResponseException: method [POST], host [http://[::1]:46351], URI [_snapshot/zrtgekrruw/oqgimbogwx/_restore?wait_for_completion=true], status line [HTTP/1.1 500 Internal Server Error]
{"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[zrtgekrruw:oqgimbogwx/iTR0pYKfReefFpjYSDHHGw] cannot restore index [ygfmurexlq] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"}],"type":"snapshot_restore_exception","reason":"[zrtgekrruw:oqgimbogwx/iTR0pYKfReefFpjYSDHHGw] cannot restore index [ygfmurexlq] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"},"status":500}

  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
  at org.elasticsearch.test.rest.ESRestTestCase.restoreSnapshot(ESRestTestCase.java:2049)
  at org.elasticsearch.xpack.repositories.metering.AbstractRepositoriesMeteringAPIRestTestCase.snapshotAndRestoreIndex(AbstractRepositoriesMeteringAPIRestTestCase.java:252)
  at org.elasticsearch.xpack.repositories.metering.AbstractRepositoriesMeteringAPIRestTestCase.testStatsAreStoredIntoANewCounterInstanceAfterRepoConfigUpdate(AbstractRepositoriesMeteringAPIRestTestCase.java:161)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:47)
  at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@pxsalehi pxsalehi added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels May 23, 2024
@pxsalehi pxsalehi self-assigned this May 23, 2024
@pxsalehi pxsalehi added the medium-risk An open issue or test failure that is a medium risk to future releases label May 23, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label May 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@pxsalehi
Copy link
Member Author

This is the only true failure from #105864.

@pxsalehi
Copy link
Member Author

These are rest tests that use a cluster setup by gradle and the cluster is shared by all tests in this suite. Therefore, the other test within the same run that timed out due to java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE] has impacted this one I think. The timeout happens during snapshotting, which means the test didn't get to removing the index. Therefore, when this test runs, it snapshots the old index too and cannot restore it because the index hasn't been removed. There is a @After in the base class that should clean up the cluster, but since it also similar to the createSnapshot steps relies on a REST call to wipe all indices, it could have also failed leaving the index in the cluster for the next test to fail on. (cluster logs)

This seems to be a side-effect of the rare infra issue. We could only snapshot and restore the specific index in each test to avoid such collisions. I am not sure if that is necessary for now. If there are no other ideas, I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

2 participants