Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncaught exception stops KinesisAsyncClient #3335

Closed
MediaMarco opened this issue Aug 2, 2022 · 12 comments
Closed

Uncaught exception stops KinesisAsyncClient #3335

MediaMarco opened this issue Aug 2, 2022 · 12 comments
Assignees
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue

Comments

@MediaMarco
Copy link

MediaMarco commented Aug 2, 2022

Describe the bug

We use the KinesisAsyncClient to read messages from Kinesis like that:

GetRecordsResponse response = kinesisClient.getRecords(GetRecordsRequest.builder().shardIterator(id).build()).join();

This worked fine for ages, but since SDK version 2.17.115 the reader sometimes stalls and the join() blocks forever.

It seems to have something to do with this change:

2.17.115 2022-01-20

Netty NIO HTTP Client

  • Bugfixes

  • Verify that the content-length header matches the content returned by the service.

Expected Behavior

The CompletableFuture returned by GetRecords should complete exceptionally when the netty I/O error occurs to be able to handle it gracefully in application code.

Current Behavior

kinesisClient.getRecords(GetRecordsRequest.builder().shardIterator(id).build()).join();

blocks forever.

When the blocking occurs, we see the following Exception in our logs, nothing else:

Message: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.

Logger: io.netty.channel.DefaultChannelPipeline

Thread: aws-java-sdk-NettyEventLoop-3-7

Stack Trace:
software.amazon.awssdk.http.nio.netty.internal.FutureCancelledException: java.io.IOException: Response had content-length of 25 bytes, but only received 0 bytes before the connection was closed. at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.lambda$null$3(NettyRequestExecutor.java:136) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106) at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.io.IOException: Response had content-length of 25 bytes, but only received 0 bytes before the connection was closed. at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.validateResponseContentLength(ResponseHandler.java:163) at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.access$700(ResponseHandler.java:75) at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:369) at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.complete(HandlerPublisher.java:447) at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.channelInactive(HandlerPublisher.java:430) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe$2.run(AbstractHttp2StreamChannel.java:737) ... 7 common frames omitted

Reproduction Steps

Read a huge amount of records from kinesis using the KinesisAsyncClient with the getRecords() function.

It's hard to reproduce, the issue occurs only if the content length check fails.

Possible Solution

Let the CompletableFuture complete exceptionally as when other I/O errors occur.

Additional Information/Context

No response

AWS Java SDK version used

2.17.233

JDK version used

openjdk version "18.0.1" 2022-04-19 OpenJDK Runtime Environment Temurin-18.0.1+10 (build 18.0.1+10) OpenJDK 64-Bit Server VM Temurin-18.0.1+10 (build 18.0.1+10, mixed mode, sharing)

Operating System and version

Ubuntu 22.04.1 LTS

@MediaMarco MediaMarco added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 2, 2022
@debora-ito
Copy link
Member

Hi @MediaMarco apologies for taking this long to get a response.

We have other reports of this issue and we have some leads on what can be the cause, but as you said it's not easy to reproduce. We'll investigate.

How often do you see the error, does it happen intermittently? At what rate?

@debora-ito debora-ito removed the needs-triage This issue or PR still needs to be triaged. label Jan 25, 2023
@debora-ito debora-ito self-assigned this Jan 25, 2023
@MediaMarco
Copy link
Author

Hi @MediaMarco apologies for taking this long to get a response.

We have other reports of this issue and we have some leads on what can be the cause, but as you said it's not easy to reproduce. We'll investigate.

How often do you see the error, does it happen intermittently? At what rate?

Hi @debora-ito, it's been some time. I'll have to check the logs. We've built a workaround with a timeout around the whole call that retries if it is stalled.

@debora-ito debora-ito added the p2 This is a standard priority issue label Mar 7, 2023
@Maxranderson-itv
Copy link

Hello folks! I'm having a similar issue but with S3AsyncClient. When we do a get operation, the process stalls when we receive that message. Do you want to create another issue? I'm using version 2.17.204.

@debora-ito
Copy link
Member

@Maxranderson-itv can you reproduce it reliably? If so, can you share a repro code and the stacktrace with the error?

@Maxranderson-itv
Copy link

Here is the stacktrace. Let me see how I can reproduce it, I'll tell you as soon as possible.

software.amazon.awssdk.http.nio.netty.internal.FutureCancelledException: java.io.IOException: Response had content-length of 36373931584 bytes, but only received 34154305346 bytes before the connection was closed.
	at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.lambda$null$3(NettyRequestExecutor.java:136)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Response had content-length of 36373931584 bytes, but only received 34154305346 bytes before the connection was closed.
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.validateResponseContentLength(ResponseHandler.java:163)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.access$700(ResponseHandler.java:75)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:369)
	at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.complete(HandlerPublisher.java:447)
	at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.channelInactive(HandlerPublisher.java:430)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.handler.logging.LoggingHandler.channelInactive(LoggingHandler.java:206)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:418)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:392)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:357)
	at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:326)
	at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:221)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:392)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:357)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1074)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
	... 7 common frames omitted

@dagnir
Copy link
Contributor

dagnir commented Mar 22, 2023

Hi @MediaMarco and @Maxranderson-itv, can you share how your configuring your SDK clients?

@Maxranderson-itv does your .join() call also block indefinitely, or does it eventually timeout and throw an exception?

@Maxranderson-itv
Copy link

Sure, the config is below. For the .join, it blocks indefinitely causing us to restart the service.

S3AsyncClient
  .builder()
  .region(Region.EU_WEST_1)
  .credentialsProvider(credentials)
  .httpClient(
    NettyNioAsyncHttpClient
      .builder()
      .readTimeout(3600 seconds)
      .tcpKeepAlive(true)
      .maxConcurrency(100)
      .build()
  )
  .build()

@Maxranderson-itv
Copy link

@debora-ito , sorry for not giving your feedback earlier. Unfortunately, I still did not have time to reproduce the issue and that is not frequent in our production environment. We did the same approach to fix it, wrapping the code with a timeout.

@dagnir
Copy link
Contributor

dagnir commented Mar 24, 2023

@MediaMarco @Maxranderson-itv I was not able to precisely reproduce this bug (could not get .join()) to hang indefinitely, but I was able to reproduce the the error you're seeing with the uncaught FutureCancelledException after the validation, and observed that it leads to dangling connections in the pool. Perhaps this a different symptom of the same issue.

A fix for it was released in in 2.20.32. Can you try this version and see if it resolves your issue?

@Maxranderson-itv
Copy link

Cool @dagnir! Thank you very much! I'll try to update the version and check if the issue disappears.

@debora-ito
Copy link
Member

Marking this auto close soon. Let us know if you have any comments.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

5 participants