Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MultiNodesStatsTests testMultipleNodes failing #96374

Closed
pxsalehi opened this issue May 26, 2023 · 4 comments
Closed

[CI] MultiNodesStatsTests testMultipleNodes failing #96374

pxsalehi opened this issue May 26, 2023 · 4 comments
Assignees
Labels
:Data Management/Monitoring medium-risk An open issue or test failure that is a medium risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@pxsalehi
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/yuj7ojmrescem/tests/:x-pack:plugin:monitoring:test/org.elasticsearch.xpack.monitoring.MultiNodesStatsTests/testMultipleNodes

Reproduction line:

./gradlew ':x-pack:plugin:monitoring:test' --tests "org.elasticsearch.xpack.monitoring.MultiNodesStatsTests.testMultipleNodes" -Dtests.seed=46451D8F4246CA13 -Dtests.locale=es-EC -Dtests.timezone=SystemV/CST6CDT -Druntime.java=20

Applicable branches:
main

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.monitoring.MultiNodesStatsTests&tests.test=testMultipleNodes

Failure excerpt:

java.lang.Exception: Test abandoned because suite timeout was reached.

  at __randomizedtesting.SeedInfo.seed([46451D8F4246CA13]:0)

@pxsalehi pxsalehi added :Data Management/Monitoring >test-failure Triaged test failures from CI labels May 26, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label May 26, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@dakrone dakrone added the medium-risk An open issue or test failure that is a medium risk to future releases label Oct 12, 2023
@masseyke
Copy link
Member

There are so many weird things in the log for this one. Here are some notes so far:
(1) The .monitoring-es-mb template fails to load because Failed to parse mapping: No handler for type [wildcard] declared on field [name]. I have no idea why wildcard would be an unknown type. If I artificially force this though, it doesn't have any impact on the test.
(2) We see org.elasticsearch.index.IndexNotFoundException: no such index [.monitoring-es-7-2023.05.25]. I'm guessing that this happens because indexing runs asynchronously, and the index has been deleting when these happen. If I artificially force this for all monitoring docs, the test fails but quickly and in a different way.
(3) For some reason node_t2 just doesn't shut down. All 5 other nodes shut down, leaving node_t2 looking for its peers for the next 20 minutes. This seems to be the thing that fails the test. I don't know why node_t2 stays up, and I don't know why it is waiting for a replication that will never happen.

@masseyke
Copy link
Member

I'm thinking that this is a race condition for @elastic/es-perf. But this failure was a year ago and the line numbers are all fairly different now. And for some reason I cant' find anything in failure history to see if it always looks like this. So I'm going to re-enable this test. If it comes back we might need to consult with es-perf to figure it out.

Here's the attempt to stop node_t2, waiting on node_t2 to finish up ongoing recoveries:

"TEST-MultiNodesStatsTests.testMultipleNodes-seed#[46451D8F4246CA13]" ID=55 WAITING on org.elasticsearch.action.support.PlainActionFuture$Sync@21084823
	at java.base@20.0.1/jdk.internal.misc.Unsafe.park(Native Method)
	- waiting on org.elasticsearch.action.support.PlainActionFuture$Sync@21084823
	at java.base@20.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
	at java.base@20.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
	at java.base@20.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
	at app//org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:326)
	at app//org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:95)
	at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
	at app//org.elasticsearch.indices.recovery.PeerRecoverySourceService$OngoingRecoveries.awaitEmpty(PeerRecoverySourceService.java:283)
	at app//org.elasticsearch.indices.recovery.PeerRecoverySourceService.doStop(PeerRecoverySourceService.java:110)
	at app//org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:63)
	- locked org.elasticsearch.common.component.Lifecycle@384ff200
	at app//org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:76)
	- locked org.elasticsearch.common.component.Lifecycle@384ff200
	at app//org.elasticsearch.core.IOUtils.close(IOUtils.java:71)
	at app//org.elasticsearch.core.IOUtils.close(IOUtils.java:119)
	at app//org.elasticsearch.node.Node.close(Node.java:1689)
	- locked org.elasticsearch.node.MockNode@44a3728c
	at app//org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1057)
	at app//org.elasticsearch.core.IOUtils.close(IOUtils.java:71)
	at app//org.elasticsearch.core.IOUtils.close(IOUtils.java:119)
	at app//org.elasticsearch.test.InternalTestCluster.close(InternalTestCluster.java:907)
	- locked org.elasticsearch.test.InternalTestCluster@7a962c2
	at app//org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:584)
	at app//org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2293)
	at java.base@20.0.1/java.lang.invoke.DirectMethodHandle$Holder.invokeSpecial(DirectMethodHandle$Holder)
	at java.base@20.0.1/java.lang.invoke.LambdaForm$MH/0x000000080109c800.invoke(LambdaForm$MH)
	at java.base@20.0.1/java.lang.invoke.Invokers$Holder.invokeExact_MT(Invokers$Holder)
	at java.base@20.0.1/jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(DirectMethodHandleAccessor.java:154)
	at java.base@20.0.1/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base@20.0.1/java.lang.reflect.Method.invoke(Method.java:578)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
	at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$$Lambda$514/0x0000000801308cd0.run(Unknown Source)
	at java.base@20.0.1/java.lang.Thread.runWith(Thread.java:1636)
	at java.base@20.0.1/java.lang.Thread.run(Thread.java:1623)

And here's where node_t2 is blocked, and there's nothing running that will every call updateFromMaster() or updateLocalCheckpoint() to wake it up:


"elasticsearch[node_t2][generic][T#4]" ID=157 WAITING on org.elasticsearch.index.seqno.ReplicationTracker@66ee47a9
	at java.base@20.0.1/java.lang.Object.wait0(Native Method)
	- waiting on org.elasticsearch.index.seqno.ReplicationTracker@66ee47a9
	at java.base@20.0.1/java.lang.Object.wait(Object.java:366)
	at java.base@20.0.1/java.lang.Object.wait(Object.java:339)
	at app//org.elasticsearch.index.seqno.ReplicationTracker.waitForLocalCheckpointToAdvance(ReplicationTracker.java:1567)
	at app//org.elasticsearch.index.seqno.ReplicationTracker.markAllocationIdAsInSync(ReplicationTracker.java:1274)
	at app//org.elasticsearch.index.shard.IndexShard.markAllocationIdAsInSync(IndexShard.java:2687)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$48(RecoverySourceHandler.java:1253)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5844/0x0000000801cfecb0.run(Unknown Source)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$21(RecoverySourceHandler.java:430)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5709/0x0000000801cc3b40.get(Unknown Source)
	at app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:302)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$22(RecoverySourceHandler.java:429)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5708/0x0000000801cc3918.accept(Unknown Source)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$19(RecoverySourceHandler.java:419)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5399/0x0000000801c599b0.accept(Unknown Source)
	at app//org.elasticsearch.action.ActionListener.run(ActionListener.java:357)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$runUnderPrimaryPermit$20(RecoverySourceHandler.java:397)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5398/0x0000000801c59528.accept(Unknown Source)
	at app//org.elasticsearch.action.ActionListenerImplementations$DelegatingFailureActionListener.onResponse(ActionListenerImplementations.java:151)
	at app//org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$26(IndexShard.java:3335)
	at app//org.elasticsearch.index.shard.IndexShard$$Lambda$5256/0x0000000801c1f378.accept(Unknown Source)
	at app//org.elasticsearch.action.ActionListenerImplementations$DelegatingFailureActionListener.onResponse(ActionListenerImplementations.java:151)
	at app//org.elasticsearch.action.ActionListener$5.onResponse(ActionListener.java:333)
	at app//org.elasticsearch.index.shard.IndexShardOperationPermits.innerAcquire(IndexShardOperationPermits.java:239)
	at app//org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:191)
	at app//org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3311)
	at app//org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3305)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.runUnderPrimaryPermit(RecoverySourceHandler.java:397)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.runUnderPrimaryPermit(RecoverySourceHandler.java:429)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.finalizeRecovery(RecoverySourceHandler.java:1252)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$recoverToTarget$17(RecoverySourceHandler.java:350)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5568/0x0000000801c8b768.accept(Unknown Source)
	at app//org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:158)
	at app//org.elasticsearch.action.support.SubscribableListener$SuccessResult.complete(SubscribableListener.java:270)
	at app//org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:190)
	at app//org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:219)
	at app//org.elasticsearch.action.support.SubscribableListener.onResponse(SubscribableListener.java:133)
	at app//org.elasticsearch.action.StepListener.onResponse(StepListener.java:53)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase2$47(RecoverySourceHandler.java:1137)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$5771/0x0000000801ce1da8.accept(Unknown Source)
	at app//org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:158)
	at app//org.elasticsearch.action.support.SubscribableListener$SuccessResult.complete(SubscribableListener.java:270)
	at app//org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:190)
	at app//org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:219)
	at app//org.elasticsearch.action.support.SubscribableListener.onResponse(SubscribableListener.java:133)
	at app//org.elasticsearch.action.StepListener.onResponse(StepListener.java:53)
	at app//org.elasticsearch.indices.recovery.MultiChunkTransfer.onCompleted(MultiChunkTransfer.java:149)
	at app//org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:113)
	at app//org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:72)
	at app//org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97)
	at app//org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85)
	at app//org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:73)
	at app//org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:83)
	at app//org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$4(MultiChunkTransfer.java:120)
	at app//org.elasticsearch.indices.recovery.MultiChunkTransfer$$Lambda$5546/0x0000000801c8ddc8.accept(Unknown Source)
	at app//org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:158)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender.lambda$executeChunkRequest$0(RecoverySourceHandler.java:1222)
	at app//org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender$$Lambda$5792/0x0000000801ceddf0.accept(Unknown Source)
	at app//org.elasticsearch.action.ActionListenerImplementations$DelegatingFailureActionListener.onResponse(ActionListenerImplementations.java:151)
	at app//org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:93)
	at app//org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:210)
	at app//org.elasticsearch.action.ActionListener$5.onResponse(ActionListener.java:333)
	at app//org.elasticsearch.action.ActionListener$5.onResponse(ActionListener.java:333)
	at app//org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:169)
	at app//org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:210)
	at app//org.elasticsearch.action.ActionListener$5.onResponse(ActionListener.java:333)
	at app//org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43)
	at app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1409)
	at app//org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:419)
	at app//org.elasticsearch.transport.InboundHandler$2.doRun(InboundHandler.java:376)
	at app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
	at app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base@20.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@20.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@20.0.1/java.lang.Thread.runWith(Thread.java:1636)
	at java.base@20.0.1/java.lang.Thread.run(Thread.java:1623)
	Locked synchronizers:
	- java.util.concurrent.ThreadPoolExecutor$Worker@67be7694

@nielsbauman
Copy link
Contributor

Closing this as we haven't seen any occurrences after Keith's unmute PR. Will re-open if we see this coming back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Monitoring medium-risk An open issue or test failure that is a medium risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants