Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client blocks indefinitely on high volume #1185

Closed
steamstreetjon opened this issue Jan 21, 2024 · 4 comments
Closed

Client blocks indefinitely on high volume #1185

steamstreetjon opened this issue Jan 21, 2024 · 4 comments
Labels
bug This issue is a bug. closed-for-staleness response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days.

Comments

@steamstreetjon
Copy link

Describe the bug

Calls with the client (in this case to Dynamo), periodically block indefinitely.

Expected behavior

Calls shouldn't block and should return in a reasonable amount of time.

Current behavior

In my lambda that is invoked about 500,000/hr, there are periodic failures that occur when the SDK blocks indefinitely. This happens in only 5-10 invocations an hour, but that's enough that it makes the SDK unusable.

Steps to Reproduce

This is difficult to reproduce, as it only happens in high volume situations and very infrequently. But with no fix available, the failure rate is still too high to use the SDK. The problem occurs when using either the OKHttp client or the Crt client.

Possible Solution

There is no workaround. I have been unable to diagnose the problem, as the blocking occurs deep in the SDK.

Context

I can't use the Kotlin AWS SDK and would advise others to avoid it as well for any high volume application in Lambda.

AWS Kotlin SDK version used

1.0.37

Platform (JVM/JS/Native)

JVM

Operating System and version

Lambda Java 17

@steamstreetjon steamstreetjon added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 21, 2024
@ianbotsf
Copy link
Contributor

Hi @steamstreetjon, thank you for the bug report. Without more information it's a little difficult to know how to proceed. Could you please provide a bit more detail to help us diagnose the issue?

  • How do you know the SDK is blocking? Are you seeing exceptions/timeouts?
  • You mentioned ~500k/hr requests—how is that split across your function instances? Is each request a separate instance or does a single instance share multiple simultaneous requests?
  • How are you wrapping the call to the SDK's suspend operation to get a coroutine context?
  • Can you share any relevant code?
  • Have you examined the logs at info/debug/trace levels for pertinent messages? Can you share any relevant log lines?

@ianbotsf ianbotsf added response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 22, 2024
@steamstreetjon
Copy link
Author

steamstreetjon commented Jan 22, 2024

I'll answer a few of this, but I'd like some help in understanding the best way to give you what you need.

The 500k is a single lambda that is provisioned as normal. There are certainly spikes in activity, but the lambdas are pretty standard, triggered by EventBridge events.

The initial tell was that the Lambda was timing out. As I dug in further I could see that it was happening at a specific point. I see the log for the request, but then nothing until the Lambda times out. Since it's event bridge, the event is retried shortly after and the exact same details work fine.

Here's a screenshot of the logs. I can share more logging details, just didn't want to be too public about it, but you can see it gets to the GetItem POST and then hangs until the lambda is complete.

Screenshot 2024-01-22 at 2 06 53 PM

For reference, when it runs the second time a minute later, the 'healthy' version gets a response and carries on:

image

I tried to wrap the code in a withTimeout coroutine, and that is not firing, which makes me thing it's legit blocked... like there's some synchronization deadlock in multi-threaded code (otherwise the coroutine should be able to handle the timeout).

I'd really like to help track this down with you, so let me know if you'd like to do some debugging or if there is some logging to enable to help you understand it better.

Also, here's the configuration of the client right now (I added the timeouts to see if I could get the client to fail):

            DynamoDbClient.fromEnvironment {
                logMode = LogMode.LogRequestWithBody + LogMode.LogResponseWithBody
                httpClient {
                    socketReadTimeout = 5.seconds
                    connectTimeout = 5.seconds
                }
            }

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days. label Jan 23, 2024
@ianbotsf
Copy link
Contributor

Thanks for the additional info. It's particularly curious to me that withTimeout around the DDB call isn't having an effect. As you mention, that may be an indication of a hard block somewhere. Interestingly, though, you mentioned this happens with both OkHttp and CRT engines, both of which use wildly different parallelism models. That points to a potential problem outside of the HTTP engines.

Unfortunately, there's still not enough info present to assist in root causing. We have various tests of the SDK in our release pipeline, including some which test high concurrency and high throughput, which have not exhibited this issue before. To dive deeper on this, would you be willing to provide repro code and full trace-level logs for a succeeding and failing request?

If you have an AWS Support plan, you can submit a support case through the AWS Console with your logs/code which will make it available for us internally. If not, you can attach or paste your logs/code in this issue after redacting sensitive info like access keys, internal endpoints, etc.

@ianbotsf ianbotsf added the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days. label Jan 23, 2024
Copy link

It looks like this issue has not been active for more than 5 days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

@github-actions github-actions bot added closing-soon This issue will automatically close in 2 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 2 days unless further comments are made. labels Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 5 days.
Projects
None yet
Development

No branches or pull requests

2 participants