Client blocks indefinitely on high volume #1185

steamstreetjon · 2024-01-21T22:21:11Z

Describe the bug

Calls with the client (in this case to Dynamo), periodically block indefinitely.

Expected behavior

Calls shouldn't block and should return in a reasonable amount of time.

Current behavior

In my lambda that is invoked about 500,000/hr, there are periodic failures that occur when the SDK blocks indefinitely. This happens in only 5-10 invocations an hour, but that's enough that it makes the SDK unusable.

Steps to Reproduce

This is difficult to reproduce, as it only happens in high volume situations and very infrequently. But with no fix available, the failure rate is still too high to use the SDK. The problem occurs when using either the OKHttp client or the Crt client.

Possible Solution

There is no workaround. I have been unable to diagnose the problem, as the blocking occurs deep in the SDK.

Context

I can't use the Kotlin AWS SDK and would advise others to avoid it as well for any high volume application in Lambda.

AWS Kotlin SDK version used

1.0.37

Platform (JVM/JS/Native)

JVM

Operating System and version

Lambda Java 17

ianbotsf · 2024-01-22T18:09:39Z

Hi @steamstreetjon, thank you for the bug report. Without more information it's a little difficult to know how to proceed. Could you please provide a bit more detail to help us diagnose the issue?

How do you know the SDK is blocking? Are you seeing exceptions/timeouts?
You mentioned ~500k/hr requests—how is that split across your function instances? Is each request a separate instance or does a single instance share multiple simultaneous requests?
How are you wrapping the call to the SDK's suspend operation to get a coroutine context?
Can you share any relevant code?
Have you examined the logs at info/debug/trace levels for pertinent messages? Can you share any relevant log lines?

steamstreetjon · 2024-01-22T22:09:42Z

I'll answer a few of this, but I'd like some help in understanding the best way to give you what you need.

The 500k is a single lambda that is provisioned as normal. There are certainly spikes in activity, but the lambdas are pretty standard, triggered by EventBridge events.

The initial tell was that the Lambda was timing out. As I dug in further I could see that it was happening at a specific point. I see the log for the request, but then nothing until the Lambda times out. Since it's event bridge, the event is retried shortly after and the exact same details work fine.

Here's a screenshot of the logs. I can share more logging details, just didn't want to be too public about it, but you can see it gets to the GetItem POST and then hangs until the lambda is complete.

For reference, when it runs the second time a minute later, the 'healthy' version gets a response and carries on:

I tried to wrap the code in a withTimeout coroutine, and that is not firing, which makes me thing it's legit blocked... like there's some synchronization deadlock in multi-threaded code (otherwise the coroutine should be able to handle the timeout).

I'd really like to help track this down with you, so let me know if you'd like to do some debugging or if there is some logging to enable to help you understand it better.

Also, here's the configuration of the client right now (I added the timeouts to see if I could get the client to fail):

            DynamoDbClient.fromEnvironment {
                logMode = LogMode.LogRequestWithBody + LogMode.LogResponseWithBody
                httpClient {
                    socketReadTimeout = 5.seconds
                    connectTimeout = 5.seconds
                }
            }

ianbotsf · 2024-01-23T18:22:57Z

Thanks for the additional info. It's particularly curious to me that withTimeout around the DDB call isn't having an effect. As you mention, that may be an indication of a hard block somewhere. Interestingly, though, you mentioned this happens with both OkHttp and CRT engines, both of which use wildly different parallelism models. That points to a potential problem outside of the HTTP engines.

Unfortunately, there's still not enough info present to assist in root causing. We have various tests of the SDK in our release pipeline, including some which test high concurrency and high throughput, which have not exhibited this issue before. To dive deeper on this, would you be willing to provide repro code and full trace-level logs for a succeeding and failing request?

If you have an AWS Support plan, you can submit a support case through the AWS Console with your logs/code which will make it available for us internally. If not, you can attach or paste your logs/code in this issue after redacting sensitive info like access keys, internal endpoints, etc.

github-actions · 2024-01-28T21:01:16Z

It looks like this issue has not been active for more than 5 days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

steamstreetjon added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 21, 2024

ianbotsf added response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 22, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days. label Jan 23, 2024

ianbotsf added the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days. label Jan 23, 2024

github-actions bot added closing-soon This issue will automatically close in 2 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 2 days unless further comments are made. labels Jan 28, 2024

github-actions bot closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client blocks indefinitely on high volume #1185

Client blocks indefinitely on high volume #1185

steamstreetjon commented Jan 21, 2024

ianbotsf commented Jan 22, 2024

steamstreetjon commented Jan 22, 2024 •

edited

Loading

ianbotsf commented Jan 23, 2024

github-actions bot commented Jan 28, 2024

Client blocks indefinitely on high volume #1185

Client blocks indefinitely on high volume #1185

Comments

steamstreetjon commented Jan 21, 2024

Describe the bug

Expected behavior

Current behavior

Steps to Reproduce

Possible Solution

Context

AWS Kotlin SDK version used

Platform (JVM/JS/Native)

Operating System and version

ianbotsf commented Jan 22, 2024

steamstreetjon commented Jan 22, 2024 • edited Loading

ianbotsf commented Jan 23, 2024

github-actions bot commented Jan 28, 2024

steamstreetjon commented Jan 22, 2024 •

edited

Loading