Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudWatch putMetricData sometimes produce a java.io.IOException: Server failed to send complete response #1380

Closed
oripwk opened this issue Aug 11, 2019 · 34 comments
Labels
investigating This issue is being investigated and/or work is in progress to resolve the issue. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days.

Comments

@oripwk
Copy link

oripwk commented Aug 11, 2019

This issue is exactly as #452, but instead of S3 it happens in CloudWatch.

Expected Behavior

putMetricData(…) should not throw an exception.

Current Behavior

java.io.IOException: Server failed to send complete response is thrown several times a minute. Most of the requests seem to succeed.

java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException
	at software.amazon.awssdk.utils.CompletableFutureUtils.errorAsCompletionException(CompletableFutureUtils.java:61)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncExecutionFailureExceptionReportingStage.lambda$execute$0(AsyncExecutionFailureExceptionReportingStage.java:51)
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:75)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryErrorIfNeeded(AsyncRetryableStage.java:175)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:126)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java:107)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.onError(MakeAsyncHttpRequestStage.java:249)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.lambda$notifyIfResponseNotCompleted$2(ResponseHandler.java:397)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:180)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.notifyIfResponseNotCompleted(ResponseHandler.java:397)
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.channelInactive(ResponseHandler.java:148)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.handler.logging.LoggingHandler.channelInactive(LoggingHandler.java:167)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:420)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
	at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:282)
	at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:223)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1054)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1403)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:826)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:495)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
	at java.lang.Thread.run(Thread.java:748)
Caused by: software.amazon.awssdk.core.exception.SdkClientException: null
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
	at software.amazon.awssdk.core.internal.util.ThrowableUtils.asSdkException(ThrowableUtils.java:98)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:125)
	... 63 common frames omitted
Caused by: java.io.IOException: Server failed to send complete response
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.notifyIfResponseNotCompleted(ResponseHandler.java:396)
	... 54 common frames omitted

Possible Solution

I have no idea.

Steps to Reproduce (for bugs)

Create a CloudWatch client that sends putMetricData requests about 100 times a minute, where each request has 20 metrics, and each metric contains 150 values.

Context

Monitoring real-time streaming application.

Your Environment

  • AWS Java SDK version used: 2.7.15
  • JDK version used: 1.8
  • Operating System and version: Linux 4.14.62-65.117.amzn1.x86_64 x86_64 GNU/Linux
@sschepens
Copy link

observed the same using getMetricData

@varunnvs92
Copy link
Contributor

  1. How often are you seeing this issue?
  2. Do you have test code to consistently repo the issue?

As discussed in #452, we made some fixes which reduced this error occurrence a lot. But there is still small chance of seeing this error. Code to repo will help us debug the issue further. I would also suggest you to try with the latest version and see if there is any improvement.

@varunnvs92 varunnvs92 added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. needs-response and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Aug 29, 2019
@slingala-scout24
Copy link

We have the same issue as well. We upgraded the aws sdk to the latest version 2.7.32 in the hope that the issue will be fixed, but unfortunately it didn't help. This issue happens when the application runs for a long time without a restart. But as soon as there is an application restart, we don't see this issue anymore until the next prolonged run. It usually happens after about 24 hours of the application start. Unfortunately I do not have a reproducible code, but we just use the default metrics publishing configuration.

@oripwk
Copy link
Author

oripwk commented Sep 3, 2019

@varunnvs92

  1. As I said, this issue happens regularly, almost once a second. It never stops throwing that exception.
  2. The issue happens during production workload. In order to write code that reproduces this, I'll have to artificially create the same conditions. It might take a while. I'll keep you posted.

I think that because it's reproducing for other people and it's very similar to #452, it should deserve more attention.

@trautonen
Copy link

We've found that only solution for now to this issue and #452 is to use the sync clients which use apache http client instead of netty.

@zoewangg
Copy link
Contributor

Apologies for the delayed response. Unfortunately, I'm not able to reproduce the issue. Can you provide more details of how to reproduce the issue? For example, the apis you are using, and the number of requests per second

I can see Server failed to send complete response errors with retry disabled because it's possible that the connection gets closed remotely mid request, and normally the next retry should pick up a new connection and succeeds. Is there any chance you disabled retry on the SDK client? Can you provide code snippet of how you create the client?

@trautonen
Copy link

We are using with default config, default connection parameters.

CloudWatchAsyncClient.builder().build()

And it's not single or random requests that fail. Like @oripwk said, it's constant failure, all requests fail. We are not triggering it very heavily either. Burst of like 5 reqs in sec once in a minute. And it's surely related to the HTTP client as there are zero problems with Apache HTTP client.

@zoewangg
Copy link
Contributor

@trautonen Thank you for sharing the info. Unfortunately, still no luck reproducing the issue.

I tried sending 5 concurrent requests every minute for a total of 20 minutes and all requests succeeded; I did notice that some requests failed at the first attempt but succeeded in the next retry.

Is there anyway you can provide debug logs? Just wanted to check if all retry attempts failed of the same reason in your use case.

        StabilityTestRunner.newRunner()
                           .testName("CloudWatchAsyncStabilityTest." + testName)
                           .futureFactory(cloudWatchAsyncClient.putMetricData(b -> b.namespace(namespace)
                                                                                    .metricData(metrics)))
                           .totalRuns(20)
                           .requestCountPerRun(5)
                           .delaysBetweenEachRun(Duration.ofMinutes(1))
                           .run();

Could you try setting connectionMaxIdleTimeout to 5 seconds? In my tests, reducing the max idle timeout largely reduced the chances of the idle connection getting closed mid request on the service side.

        CloudWatchAsyncClient.builder()
                             .httpClientBuilder(NettyNioAsyncHttpClient.builder()
                                                                       .connectionMaxIdleTime(Duration.ofSeconds(5)))
                            .build();

@zoewangg
Copy link
Contributor

zoewangg commented Oct 4, 2019

We have pushed out the change to update the default connectionMaxIdleTime for cloudwatch client via #1450. The change has been released as part of 2.9.14.

@zoewangg zoewangg added the closing-soon This issue will close in 4 days unless further comments are made. label Oct 4, 2019
@sycyhy
Copy link

sycyhy commented Oct 7, 2019

I got similar issue (exactly the same exception) and I wonder if it's also gonna be fixed with this.

I create a Cloudwatch client like this

    private static final AwsCredentialsProvider CREDENTIALS_PROVIDER = AwsCredentialsProviderChain.of(
            ProfileCredentialsProvider.create(),
            EnvironmentVariableCredentialsProvider.create(),
            ContainerCredentialsProvider.builder().asyncCredentialUpdateEnabled(true).build(),
            InstanceProfileCredentialsProvider.create()
    );

    private static CloudWatchAsyncClient initializeCloudWatchClient() {
        return CloudWatchAsyncClient.builder()
                .credentialsProvider(CREDENTIALS_PROVIDER)
                .build();
    }

My metrics are send properly for a few days and then suddenly in the middle of the night most of the requests will start failing with the same exception as the author mentioned. The traffic levels are actually lower during the night so it seems to be a bit different scenario.

I will try the new version either way, but if you see anything obvious that I'm missing let me know.

@zoewangg
Copy link
Contributor

zoewangg commented Oct 9, 2019

Hi @sycyhy, have you tried the new version? Are you still experiencing the issue?

@sycyhy
Copy link

sycyhy commented Oct 9, 2019

Hi @zoewangg - yep it's on Production right now, but I will be able to confirm after a weekend because during the week there is number of deployments and the issue pops up when you leave the app without any restarts for a few days :)

@kaffepanna
Copy link

Seeing the same issue when using with KCL under no particular load

@zoewangg
Copy link
Contributor

@kaffepanna Are you still experiencing the issue with the latest version?

@kaffepanna
Copy link

@zoewangg using version 2.9.20

@sycyhy
Copy link

sycyhy commented Oct 16, 2019

Yep, my issue is still present - updating the library to 2.9.14 sadly did not help

@zoewangg zoewangg added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed closing-soon This issue will close in 4 days unless further comments are made. labels Oct 16, 2019
@zoewangg
Copy link
Contributor

@sycyhy Thank you for reporting back. Looks like the issue seems to occur after letting the app running for a while. Did it ever recover after the errors or all subsequent requests failed?

I'll continue investigating.

@sycyhy
Copy link

sycyhy commented Oct 16, 2019

@zoewangg nope it does not recover, the app needs to be restarted. I also just noticed that the IOException slightly changed with the new version ->

Caused by: s.a.a.c.e.SdkClientException: null
at s.a.a.c.e.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
at s.a.a.c.i.u.ThrowableUtils.asSdkException(ThrowableUtils.java:98)
at s.a.a.c.i.h.p.s.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:125)
... 35 common frames omitted
Caused by: java.io.IOException: Channel was closed before it could be written to.
... 28 common frames omitted"

@zoewangg
Copy link
Contributor

Were all errors "IOException: Channel was closed before it could be written to" or just some of them?

They were caused by the same reason - the server has closed the connection remotely in the middle of the request, but for this error, it seems the connection has been closed before the SDK was able to write the request.

@sycyhy
Copy link

sycyhy commented Oct 16, 2019

yep you are right, only some of them (actually only 2 of those for hundreds of the Server failed to send complete response)

@zoewangg
Copy link
Contributor

Still not able to reproduce the exact all requests failure issue, but found out one issue in the SDK. There are certain cases when conneciton=close is present in the response header, the SDK is not closing the connection as expected and will try to reuse that connection for other requests. Thus if the next request picks up that bad connection right before service closes the connection, IOException would be thrown once the connection is closed remotely.

zoewangg added a commit that referenced this issue Oct 17, 2019
…f a channel before picking it up in the pool. See #1380
zoewangg added a commit that referenced this issue Oct 18, 2019
…f a channel before picking it up in the pool. See #1380
@zoewangg
Copy link
Contributor

#1476 has been merged to address the issue with unresusable connections getting reused and the change will be included in the next release.

@robinspollak
Copy link

@sycyhy, thanks for raising this we're seeing a very similar issue and thanks to all who are working to resolve it. Lukasz, I'm wondering whether you are seeing any associated CPU spikes when this problem arises?

@zoewangg
Copy link
Contributor

Hi @sycyhy @robinspollak @kaffepanna,

The fix has been released as part of 2.9.24. Could you try with the latest version and see if the issue is resolved for your use case? I was not able to reproduce the exact issues you were seeing, so it would be great if you could help verify the fix.

@robinspollak
Copy link

robinspollak commented Oct 19, 2019

@zoewangg, 2.9.24 is in production now and I'll report back! Thanks very much

@robinspollak
Copy link

robinspollak commented Oct 19, 2019

@zoewangg, unfortunately we're still seeing the following error:

java.lang.RuntimeException: software.amazon.awssdk.core.exception.SdkClientException
 	at
software.amazon.kinesis.retrieval.AWSExceptionManager.apply(AWSExceptionManager.java:65) 	
at software.amazon.kinesis.metrics.CloudWatchMetricsPublisher.blockingExecute(CloudWatchMetricsPublisher.java:89) 	
at software.amazon.kinesis.metrics.CloudWatchMetricsPublisher.publishMetrics(CloudWatchMetricsPublisher.java:74) 	
at software.amazon.kinesis.metrics.CloudWatchPublisherRunnable.runOnce(CloudWatchPublisherRunnable.java:138) 	
at software.amazon.kinesis.metrics.CloudWatchPublisherRunnable.run(CloudWatchPublisherRunnable.java:84) 	
at java.lang.Thread.run(Thread.java:748) Caused by: software.amazon.awssdk.core.exception.SdkClientException: null 	
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97) 	
at software.amazon.awssdk.core.internal.util.ThrowableUtils.asSdkException(ThrowableUtils.java:98) 	
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:125) 	
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java:107) 	
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) 	
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) 	
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) 	
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.onError(MakeAsyncHttpRequestStage.java:249) 	
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.lambda$notifyIfResponseNotCompleted$2(ResponseHandler.java:391) 	
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:174) 	
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.notifyIfResponseNotCompleted(ResponseHandler.java:391) 	
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.channelInactive(ResponseHandler.java:142) 	
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) 	
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) 	
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
--
.
.
.

It is the same as the one we were seeing before the new release, let me know if a full stack trace or a sample of how we are setting up our kinesis client would help.

@zoewangg
Copy link
Contributor

@robinspollak Thanks for reporting back. Yes, it'd be helpful if you could share the sample code of how you set up the cloudwatch client. Did all requests fail or just some requests failed? Are you running it on an EC2 instance? If so, could you share what type of that instance is?

@robinspollak
Copy link

robinspollak commented Oct 21, 2019

@zoewangg We set up our (scala) KCL integration as follows:

class KinesisConfigurationProvider @Inject()(
    configuration: Configuration,
    recordProcessorFactory: ShardRecordProcessorFactory
) {

  def apply(streamName: String): ConfigsBuilder = {
    val workerId = InetAddress.getLocalHost.getCanonicalHostName + ":" + UUID
      .randomUUID()
    val kinesisAsyncClient = KinesisAsyncClient.builder().region(Region.EU_WEST_1).build()
    val dynamoDBClient = DynamoDbAsyncClient.builder().region(Region.EU_WEST_1).build()
    val cloudWatchClient = CloudWatchAsyncClient.builder().region(Region.EU_WEST_1).build()
    new ConfigsBuilder(
      streamName,
      s"app-$streamName-processor",
      kinesisAsyncClient,
      dynamoDBClient,
      cloudWatchClient,
      workerId,
      recordProcessorFactory
    )
  }
}

It seems that only some requests fail. Upon deploy or reboot we won't see the log lines for about 3 hours. They then escalate in frequency for around 24 hours and, eventually, the server stops responding and needs to be rebooted. Our servers run on an m4.xlarge EC2 instance. Hope this helps and I'm happy to provide more information

@zoewangg
Copy link
Contributor

@robinspollak Thanks for the info. A few follow up questions: did you see any other non cloudwatch errors before the server stopped responding? When the server stopped responding, did all request fail of "Server failed to send complete response " or it just got stuck somehow? What KCL version are you using?

@sycyhy are you also using KCL or just using cloudwatch async client?

@robinspollak
Copy link

Hi @zoewangg, sorry for the delayed response. We don't really see any other non-cloudwatch errors but we're also trying our best to determine with certainty that cloudwatch is causing the problems. Thus far, we haven't been able to rule out another source of the problem.

In terms of the server behavior when it stops responding... the entire server just ceases to respond to any requests and there are no responses to any requests and thus no errors in response to requests. We're using KCL 2.2.4.

Thanks again for all your help.

@sycyhy
Copy link

sycyhy commented Oct 24, 2019

@zoewangg I use it via micrometer and not KCL which is a layer of abstraction of the async client -> which I passed manually when I configure the reporter.

btw -> I needed to revert to the legacy client due this (I could not wake up more people during the night sadly haha).

@zoewangg
Copy link
Contributor

Hi @robinspollak, for the server not responding issue, my guess is that it might not be related to the cloudwatch errors. I'd recommend taking a thread dump or heap dump to see if there's a blocking thread or increased resource usage causing the server to stop responding.

We have pushed out a couple of fixes #1481 and #1479 (available in 2.10.0) that could cause the server failed to complete the response errors. KCL has also released a new version 2.2.5. Could you try the new versions?

@sycyhy Sorry to hear that you had to revert the client code 😞 We are currently working on setting up more tests scenarios that cover more real-world use cases to catch the issues.

@zoewangg
Copy link
Contributor

Closing the issue due to lack of activity. Please create a new issue if you continue to see this error with the latest SDK.

@cgallom
Copy link

cgallom commented May 2, 2024

@oripwk Hi, I have read this post and others like it and I notice that you never could check if the updates on aws-sdk-java-v2 worked to fix your bug. I am currently using version 2.10.41 and the error persists in this version. Did you find out how to fix this bug ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating This issue is being investigated and/or work is in progress to resolve the issue. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days.
Projects
None yet
Development

No branches or pull requests