Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SuggestStatsIT testSimpleStats failing #104651

Closed
DaveCTurner opened this issue Jan 23, 2024 · 8 comments · Fixed by #105293
Closed

[CI] SuggestStatsIT testSimpleStats failing #104651

DaveCTurner opened this issue Jan 23, 2024 · 8 comments · Fixed by #105293
Assignees
Labels
:Distributed/Network Http and internode communication implementations low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Looks like a genuine leak to me, perhaps we only just started to see this because of #104588?

Build scan:
https://gradle-enterprise.elastic.co/s/xszrggp3wtbyk/tests/:server:internalClusterTest/org.elasticsearch.index.suggest.stats.SuggestStatsIT/testSimpleStats

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.index.suggest.stats.SuggestStatsIT.testSimpleStats" -Dtests.seed=A612D72765A35276 -Dtests.locale=hr-HR -Dtests.timezone=Asia/Irkutsk -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.index.suggest.stats.SuggestStatsIT#testSimpleStats

Failure excerpt:

java.lang.AssertionError: 
Expected: an empty collection
     but: <[LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:390)
	io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169)
	org.elasticsearch.transport.netty4.NettyAllocator$NoDirectBuffers.heapBuffer(NettyAllocator.java:287)
	org.elasticsearch.transport.netty4.NettyAllocator$1.obtain(NettyAllocator.java:135)
	org.elasticsearch.common.io.stream.RecyclerBytesStreamOutput.ensureCapacityFromPosition(RecyclerBytesStreamOutput.java:235)
	org.elasticsearch.common.io.stream.RecyclerBytesStreamOutput.ensureCapacity(RecyclerBytesStreamOutput.java:224)
	org.elasticsearch.common.io.stream.RecyclerBytesStreamOutput.writeBytes(RecyclerBytesStreamOutput.java:75)
	org.elasticsearch.common.io.Streams$FlushOnCloseOutputStream.writeBytes(Streams.java:220)
	org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:491)
	com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2203)
	com.fasterxml.jackson.core.json.UTF8JsonGenerator.close(UTF8JsonGenerator.java:1227)
	org.elasticsearch.xcontent.provider.json.JsonXContentGenerator.close(JsonXContentGenerator.java:592)
	org.elasticsearch.xcontent.XContentBuilder.close(XContentBuilder.java:1259)
	org.elasticsearch.common.bytes.BytesReference.bytes(BytesReference.java:36)
	org.elasticsearch.rest.RestResponse.<init>(RestResponse.java:138)
	org.elasticsearch.rest.RestResponse.<init>(RestResponse.java:108)
	org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:55)
	org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:40)
	org.elasticsearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:87)
	org.elasticsearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:81)
	org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:314)
	org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:202)
	org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:196)
	org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307)
	org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:314)
	org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95)
	org.elasticsearch.health.HealthService.validateResultsAndNotifyListener(HealthService.java:196)
	org.elasticsearch.health.HealthService$1.lambda$calculateFilteredIndicatorsRunnable$1(HealthService.java:156)
	org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)
	org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	java.base/java.lang.Thread.run(Thread.java:1583)]>

  at __randomizedtesting.SeedInfo.seed([A612D72765A35276:E3D6AE0D9C1FC786]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
  at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2119)
  at org.elasticsearch.test.ESTestCase.checkStaticState(ESTestCase.java:729)
  at org.elasticsearch.test.ESTestCase.after(ESTestCase.java:520)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@DaveCTurner DaveCTurner added :Distributed/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Jan 23, 2024
@elasticsearchmachine elasticsearchmachine added blocker Team:Distributed Meta label for distributed team labels Jan 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@arteam arteam self-assigned this Feb 1, 2024
@arteam arteam added medium-risk An open issue or test failure that is a medium risk to future releases low-risk An open issue or test failure that is a low risk to future releases and removed blocker medium-risk An open issue or test failure that is a medium risk to future releases labels Feb 2, 2024
@arteam
Copy link
Contributor

arteam commented Feb 2, 2024

Setting it to low-risk since the failure happened only once on CI and is not reproducible locally.

@DaveCTurner
Copy link
Contributor Author

Hmm I disagree, I think this is a genuine memory leak somewhere in the REST layer so it seems like it could affect many (any) other REST APIs. The stack trace suggests to me that the leak is actually somewhere in org.elasticsearch.health.GetHealthCancellationIT, it's just that we don't run the leak detector until the next GC.

@DaveCTurner DaveCTurner added blocker and removed low-risk An open issue or test failure that is a low risk to future releases labels Feb 5, 2024
@arteam
Copy link
Contributor

arteam commented Feb 6, 2024

@DaveCTurner How did you figure out that the leak is in GetHealthCancellationIT? Is it due to the call from org.elasticsearch.health.HealthService in the stack trace?

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Feb 6, 2024

Yeah, I looked at the other tests that ran earlier in the same JVM and GetHealthCancellationIT seems like the most likely culprit. There wasn't another one that would obviously result in these stack frames, which indicate that the HTTP channel was closed by the time the response arrived:

	org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:55)
	org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:40)
...
	org.elasticsearch.health.HealthService.validateResultsAndNotifyListener(HealthService.java:196)

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Feb 8, 2024

It looks like this has been around a while, I see the same issue in an 8.10.4 cluster in production:

LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
    io.netty.buffer@4.1.94.Final/io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:390)
    io.netty.buffer@4.1.94.Final/io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169)
    org.elasticsearch.transport.netty4@8.10.4/org.elasticsearch.transport.netty4.NettyAllocator$NoDirectBuffers.heapBuffer(NettyAllocator.java:287)
    org.elasticsearch.transport.netty4@8.10.4/org.elasticsearch.transport.netty4.NettyAllocator$1.obtain(NettyAllocator.java:135)
    org.elasticsearch.server@8.10.4/org.elasticsearch.common.io.stream.RecyclerBytesStreamOutput.ensureCapacityFromPosition(RecyclerBytesStreamOutput.java:235)
    org.elasticsearch.server@8.10.4/org.elasticsearch.common.io.stream.RecyclerBytesStreamOutput.ensureCapacity(RecyclerBytesStreamOutput.java:224)
    org.elasticsearch.server@8.10.4/org.elasticsearch.common.io.stream.RecyclerBytesStreamOutput.writeBytes(RecyclerBytesStreamOutput.java:75)
    org.elasticsearch.server@8.10.4/org.elasticsearch.common.io.Streams$FlushOnCloseOutputStream.writeBytes(Streams.java:220)
    org.elasticsearch.server@8.10.4/org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:488)
    com.fasterxml.jackson.core@2.15.0/com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2203)
    com.fasterxml.jackson.core@2.15.0/com.fasterxml.jackson.core.json.UTF8JsonGenerator.close(UTF8JsonGenerator.java:1227)
    org.elasticsearch.xcontent.impl@8.10.4/org.elasticsearch.xcontent.provider.json.JsonXContentGenerator.close(JsonXContentGenerator.java:567)
    org.elasticsearch.xcontent@8.10.4/org.elasticsearch.xcontent.XContentBuilder.close(XContentBuilder.java:1252)
    org.elasticsearch.server@8.10.4/org.elasticsearch.common.bytes.BytesReference.bytes(BytesReference.java:36)
    org.elasticsearch.server@8.10.4/org.elasticsearch.rest.RestResponse.<init>(RestResponse.java:125)
    org.elasticsearch.server@8.10.4/org.elasticsearch.rest.RestResponse.<init>(RestResponse.java:103)
    org.elasticsearch.server@8.10.4/org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:55)
    org.elasticsearch.server@8.10.4/org.elasticsearch.rest.action.RestCancellableNodeClient$1.onFailure(RestCancellableNodeClient.java:96)
    org.elasticsearch.server@8.10.4/org.elasticsearch.client.internal.node.NodeClient$SafelyWrappedActionListener.onFailure(NodeClient.java:171)
    org.elasticsearch.server@8.10.4/org.elasticsearch.tasks.TaskManager$1.onFailure(TaskManager.java:217)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:39)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onFailure(ActionListenerImplementations.java:308)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$2(TransportMasterNodeAction.java:232)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.client.internal.node.NodeClient$SafelyWrappedActionListener.onFailure(NodeClient.java:171)
    org.elasticsearch.server@8.10.4/org.elasticsearch.tasks.TaskManager$1.onFailure(TaskManager.java:217)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:39)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onFailure(ActionListenerImplementations.java:308)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$2(TransportMasterNodeAction.java:232)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:185)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:185)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:185)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:39)
    org.elasticsearch.xcore@8.10.4/org.elasticsearch.xpack.core.security.ScrollHelper$1.onFailure(ScrollHelper.java:123)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:39)
    org.elasticsearch.server@8.10.4/org.elasticsearch.client.internal.node.NodeClient$SafelyWrappedActionListener.onFailure(NodeClient.java:171)
    org.elasticsearch.server@8.10.4/org.elasticsearch.tasks.TaskManager$1.onFailure(TaskManager.java:217)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:39)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations$RunAfterActionListener.onFailure(ActionListenerImplementations.java:269)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:752)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:729)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:419)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:761)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:513)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:350)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:54)
    org.elasticsearch.server@8.10.4/org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:630)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.TransportService$UnregisterChildTransportResponseHandler.handleException(TransportService.java:1707)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1424)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundHandler.doHandleException(InboundHandler.java:440)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:427)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:418)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:141)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:96)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:831)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:124)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:96)
    org.elasticsearch.server@8.10.4/org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:61)
    org.elasticsearch.transport.netty4@8.10.4/org.elasticsearch.transport.netty4.Netty4MessageInboundHandler.channelRead(Netty4MessageInboundHandler.java:48)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    io.netty.codec@4.1.94.Final/io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1383)
    io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1246)
    io.netty.handler@4.1.94.Final/io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1295)
    io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
    io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
    io.netty.codec@4.1.94.Final/io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
    io.netty.transport@4.1.94.Final/io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    io.netty.transport@4.1.94.Final/io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    io.netty.transport@4.1.94.Final/io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
    io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)
    io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)
    io.netty.transport@4.1.94.Final/io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    io.netty.common@4.1.94.Final/io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    io.netty.common@4.1.94.Final/io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    java.base/java.lang.Thread.run(Thread.java:1583)

Still a bad bug but I don't think we can call it a blocker if it's already been released.

@DaveCTurner DaveCTurner added medium-risk An open issue or test failure that is a medium risk to future releases and removed blocker labels Feb 8, 2024
@DaveCTurner
Copy link
Contributor Author

In the failure in the OP I see also this in the logs for GetHealthCancellationIT:

[2024-01-23T22:51:48,950][ERROR][i.n.u.c.D.rejectedExecution] [node_t0] Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
        at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:934) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:351) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:344) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:836) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute0(SingleThreadEventExecutor.java:827) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:817) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:862) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:500) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:110) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:89) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1174) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:972) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:934) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1020) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:311) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
        at org.elasticsearch.http.netty4.Netty4HttpChannel.sendResponse(Netty4HttpChannel.java:34) ~[transport-netty4-8.13.0-SNAPSHOT.jar:8.13.0-SNAPSHOT]
        at org.elasticsearch.http.AbstractHttpServerTransport$RequestTrackingHttpChannel.sendResponse(AbstractHttpServerTransport.java:646) ~[main/:?]
        at org.elasticsearch.http.DefaultRestChannel.sendResponse(DefaultRestChannel.java:189) ~[main/:?]
        at org.elasticsearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:832) ~[main/:?]
        at org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:55) ~[main/:?]
        at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:40) ~[main/:?]
        at org.elasticsearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:87) ~[main/:?]
        at org.elasticsearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:81) ~[main/:?]
        at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:314) ~[main/:?]
        at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:202) ~[main/:?]
        at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:196) ~[main/:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[main/:?]
        at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:314) ~[main/:?]
        at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95) ~[main/:?]
        at org.elasticsearch.health.HealthService.validateResultsAndNotifyListener(HealthService.java:196) ~[main/:?]
        at org.elasticsearch.health.HealthService$1.lambda$calculateFilteredIndicatorsRunnable$1(HealthService.java:156) ~[main/:?]
        at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95) ~[main/:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[main/:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[main/:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]

I think that'd explain it, we already shut down the network event loop so we can't do any more writes or even complete any more channel promises.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 8, 2024
Today a `HttpResponse` is always released via a `ChannelPromise` which
means the release happens on a network thread. However, it's possible we
try and send a `HttpResponse` after the node has got far enough through
shutdown that it doesn't have any running network threads left, which
means the response just leaks.

This is no big deal in production, it becomes irrelevant when the
process exits, but in tests we start and stop many nodes within the same
process so mustn't leak anything.

At this point in shutdown, all HTTP channels are now closed, so it's
sufficient to check whether the channel is open first, and to fail the
listener on the calling thread if not. That's what this commit does.

Closes elastic#104651
@DaveCTurner
Copy link
Contributor Author

Given that this test failure was a leak caused by a slightly unexpected situation at shutdown, it's not really a big deal in production, so now I agree that this is low-risk 😁

@DaveCTurner DaveCTurner added low-risk An open issue or test failure that is a low risk to future releases and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Feb 8, 2024
@DaveCTurner DaveCTurner assigned DaveCTurner and unassigned arteam Feb 8, 2024
elasticsearchmachine pushed a commit that referenced this issue Feb 8, 2024
Today a `HttpResponse` is always released via a `ChannelPromise` which
means the release happens on a network thread. However, it's possible we
try and send a `HttpResponse` after the node has got far enough through
shutdown that it doesn't have any running network threads left, which
means the response just leaks.

This is no big deal in production, it becomes irrelevant when the
process exits, but in tests we start and stop many nodes within the same
process so mustn't leak anything.

At this point in shutdown, all HTTP channels are now closed, so it's
sufficient to check whether the channel is open first, and to fail the
listener on the calling thread if not. That's what this commit does.

Closes #104651
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 8, 2024
Today a `HttpResponse` is always released via a `ChannelPromise` which
means the release happens on a network thread. However, it's possible we
try and send a `HttpResponse` after the node has got far enough through
shutdown that it doesn't have any running network threads left, which
means the response just leaks.

This is no big deal in production, it becomes irrelevant when the
process exits, but in tests we start and stop many nodes within the same
process so mustn't leak anything.

At this point in shutdown, all HTTP channels are now closed, so it's
sufficient to check whether the channel is open first, and to fail the
listener on the calling thread if not. That's what this commit does.

Closes elastic#104651
elasticsearchmachine pushed a commit that referenced this issue Feb 8, 2024
* Fix leaked HTTP response sent after close (#105293)

Today a `HttpResponse` is always released via a `ChannelPromise` which
means the release happens on a network thread. However, it's possible we
try and send a `HttpResponse` after the node has got far enough through
shutdown that it doesn't have any running network threads left, which
means the response just leaks.

This is no big deal in production, it becomes irrelevant when the
process exits, but in tests we start and stop many nodes within the same
process so mustn't leak anything.

At this point in shutdown, all HTTP channels are now closed, so it's
sufficient to check whether the channel is open first, and to fail the
listener on the calling thread if not. That's what this commit does.

Closes #104651

* Fix backport
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants