Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dapr GRPC proxy connection leak #6734

Closed
jsheetzmt opened this issue Jul 27, 2023 · 29 comments · Fixed by #6774
Closed

Dapr GRPC proxy connection leak #6734

jsheetzmt opened this issue Jul 27, 2023 · 29 comments · Fixed by #6774
Assignees
Labels
kind/bug Something isn't working P1
Milestone

Comments

@jsheetzmt
Copy link

jsheetzmt commented Jul 27, 2023

In what area(s)?

/area runtime

What version of Dapr?

1.1.x

Expected Behavior

Proxying GRPC calls through dapr sidecar doesn't open many connections to destination app during load

Actual Behavior

During load, dapr sidecar is opening many connections to the destination app causing high TCP usage and memory leak in our .NET Core application. When I remove the dapr sidecar and make a direct connection between the server and client, I only have a single connection open on port 8081 (HTTP2 traffic).

dotnet-monitor
image

ss
image

Steps to Reproduce the Problem

Proxy GRPC calls through dapr sidecar with high load

Possible related issue

#4937

@jsheetzmt jsheetzmt added the kind/bug Something isn't working label Jul 27, 2023
@yaron2
Copy link
Member

yaron2 commented Jul 27, 2023

So to be clear, in the following call chain: App 1 --> Dapr 1 -------> Dapr 2 --> App 2

You see connections being opened and not re-used in the path of Dapr 2 --> App 2?

@yaron2 yaron2 added this to the v1.12 milestone Jul 27, 2023
@jsheetzmt
Copy link
Author

Yep, that is exactly right @yaron2 but there is some connection reuse between Dapr 2 --> App 2. If I am doing a single request at a time in something like Postman, connection count stays at 1. As soon as I throw JMeter with 10 threads at it, connection count grows steadily, but it doesn't increase by 1 for every request.

@yaron2
Copy link
Member

yaron2 commented Jul 27, 2023

Does the connection count continue growing after you've finished opening the 10 threads with JMeter?

@jsheetzmt
Copy link
Author

It stops growing but the connections remain open.

image

@yaron2
Copy link
Member

yaron2 commented Jul 27, 2023

Ok thanks a lot of the info, will check this out. Do you have any special connection max age/timeout settings on App 2's server?

@jsheetzmt
Copy link
Author

jsheetzmt commented Jul 27, 2023

I have this configured to keep alive the HTTP2 connection so it doesn't die while idle. I tried without this logic and still saw the same behavior with dapr sidecar --> App 2. This logic exists on App 1 (client).

private static readonly SocketsHttpHandler _handler = new SocketsHttpHandler
{
    PooledConnectionIdleTimeout = Timeout.InfiniteTimeSpan,
    KeepAlivePingDelay = TimeSpan.FromSeconds(60),
    KeepAlivePingTimeout = TimeSpan.FromSeconds(30),
    EnableMultipleHttp2Connections = true
};

@yaron2
Copy link
Member

yaron2 commented Aug 1, 2023

I believe i'm able to reproduce this locally.. tracking this for 1.12 @jsheetzmt.

@jsheetzmt
Copy link
Author

Awesome, thanks for the update @yaron2!

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

@jsheetzmt I've made some progress and am able to correlate new connection creation to the number of parallel requests coming in to the app. Can you provide more details on the concurrency parameters for your JMeter test? You mentioned 10 threads, but how many requests are sent in total?

@jsheetzmt
Copy link
Author

With the 10 threads, I was just letting it run indefinitely and would see a new connection every 100 requests or so. Each thread sends one request to the app, gets the response and then repeats the test.

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

And how long would it take to for 100 requests to come back?

@jsheetzmt
Copy link
Author

Just ran a fresh test. One thing I forgot is that App 1 makes two requests to App 2.

  1. Request comes in to App 1
  2. App 1 makes call to App 2
  3. Based on that response, App 1 makes second request to App 2
  4. App 1 returns response to client

If I run the test with 10 threads and loop count of 10, this is what I see in dotnet-counters:
image

In total it takes about 5 seconds to run the JMeter test.
image

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

@jsheetzmt it looks like you're seeing a new connection opening up due to this setting:

const grpcMaxConcurrentStreams = 100
.

Essentially Dapr would open a new connection if the current one holds 100 streams in a Dapr to App connection.

@jsheetzmt
Copy link
Author

Interesting. Do you know why the sidecar is opening a new stream and not reusing an existing one?

@jsheetzmt
Copy link
Author

When I change App 1's URL to go to App 2 directly (not dapr sidecar), there's only a single connection open on App 2.

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

It looks like we're treating every request as a stream when incrementing the counter. cc @ItalyPaleAle to chime in on that.

@jsheetzmt
Copy link
Author

@yaron2 Would you expect connections to close after a certain period of a time? Right now, it seems like connections ramp up and never ramp down. Would always like a min connection of 1 to keep the connection warm

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

@yaron2 Would you expect connections to close after a certain period of a time? Right now, it seems like connections ramp up and never ramp down. Would always like a min connection of 1 to keep the connection warm

We have a max connection age of 3 minutes. If there's keep alive enabled on your app server, it might mess with that.

@jsheetzmt
Copy link
Author

Okay, I might have to play around with that some more. We have keep alive on our clients but not the server. Even when I disable the client's keep alive, I am not seeing total connections drop back down.

@ItalyPaleAle
Copy link
Contributor

It looks like we're treating every request as a stream when incrementing the counter. cc @ItalyPaleAle to chime in on that.

In the context of grpcMaxConcurrentStreams, "stream" refers to HTTP/2 streams, not to gRPC streaming RPC. Each gRPC RPC uses a HTTP/2 stream underlying (even unary RPCs use a stream in HTTP/2)

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

So the next question is why does Dapr dial a new connection to talk to the app every for 100 RPCs per connection? When a Dapr SDK talks to Dapr via gRPC, or when users open native gRPC clients, or even when users use gRPC without Dapr, this logic isn't present.

@ItalyPaleAle
Copy link
Contributor

ItalyPaleAle commented Aug 2, 2023

@yaron2

It's been a while so I don't recall 100% of details but from memory, and from some refreshing Google searches, it was probably related to:

First. The need to ensure better load balancing across multiple apps. Before we were closing connections right after they were used, so each RPC was creating a new connection so it was connecting to (possibly) a different sidecar. If we keep connections open, the worry was that it may not load balance. However, in hindsight I wonder if this was some sort of premature optimization and maybe the gRPC SDK does that already?

Second. From the gRPC docs

image

Related issue: grpc/grpc#21386

As for why 100, that comes from RFC7540 (the HTTP/2 specs) directly: https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_CONCURRENT_STREAMS

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

However, in hindsight I wonder if this was some sort of premature optimization and maybe the gRPC SDK does that already?

This should be done by gRPC based on our dns:// schema in combination with the max connection age (which should most probably be reduced from 3m to a much lower value).

The 2nd point of the gRPC docs is relevant, however I can confirm (and also @jsheetzmt) that if you open a client/server dial, you do not see additional TCP connection beyond 100 requests.

My conclusion is this: if load balancing isn't affected, we should default to remove this logic by default and make it configurable to allow users to handle special cases that may suffer due to queuing.

@ItalyPaleAle
Copy link
Contributor

This should be done by gRPC based on our dns:// schema in combination with the max connection age (which should most probably be reduced from 3m to a much lower value).

Yes, using dns:// causes gRPC to round-robin when establishing a connection. I am just not sure what happens if the connection is not closed every time (like it used to in the past): will it still send each request to a different app?

Some testing may help figuring out if the additional pooling we're doing is needed or not.

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

Some testing may help figuring out if the additional pooling we're doing is needed or not.

Indeed, but for Dapr to app channel where there's expected to be a single endpoint for the app (most of the time, anyway), this should likely skip this logic by default and be made configurable. Testing is definitely needed to see if this is also the case for Dapr to Dapr communication.

@jsheetzmt
Copy link
Author

jsheetzmt commented Aug 2, 2023

We are using Dapr in Azure Container Apps, and we will have scaling so that we spin up additional replicas during load and ideally that traffic is load balanced across the replicas. I would expect each replica to have a connection open to Dapr sidecar, and the side car is still doing round-robin

@ItalyPaleAle
Copy link
Contributor

Indeed, but for Dapr to app channel where there's expected to be a single endpoint for the app (most of the time, anyway), this should likely skip this logic by default and be made configurable.

Oh, yes. I was primarily fixated on sidecar-to-sidecar. sidecar-to-app can be different

@yaron2
Copy link
Member

yaron2 commented Aug 2, 2023

We are using Dapr in Azure Container Apps, and we will have scaling so that we spin up additional replicas during load and ideally that traffic is load balanced across the replicas. I would expect each replica to have a connection open to Dapr sidecar, and the side car is still doing round-robin

That is a valid expectation

@jsheetzmt
Copy link
Author

@yaron2 Pulled down the nightly and things are looking much, much better now. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working P1
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants