Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate worker connection accept balance #4602

Closed
mattklein123 opened this issue Oct 4, 2018 · 25 comments · Fixed by #8422
Closed

Investigate worker connection accept balance #4602

mattklein123 opened this issue Oct 4, 2018 · 25 comments · Fixed by #8422
Assignees
Labels

Comments

@mattklein123
Copy link
Member

See relevant envoy-dev thread: https://groups.google.com/forum/#!topic/envoy-dev/33QvlXyBinw

Some potential work here:

  • Performance profile different application loads against Envoy running on different numbers of cores.
  • Per worker CPU pinning.
  • Adding EPOLLEXCLUSIVE and EPOLLROUNDROBIN support to libevent and consume them in Envoy
  • Adding per worker stats for connection load, accept rate, etc.
  • Investigate Envoy built-in connection rebalance between workers (e.g., allow configurable slew between workers but if some worker gets too far from the highest loaded worker, send a new connection to the other worker).

Note that I've tried variants of ^ over the years and it's extremely difficult to beat the kernel at its own game while generically improving all workloads. With that said, this is a very interesting area to investigate albeit incredibly time consuming.

@mattklein123
Copy link
Member Author

cc @tonya11en

@tonya11en
Copy link
Member

This can be broken out into a bunch of different tasks, so we can probably have multiple assignees here. I'll volunteer myself to be the first.

An important first step will be to get per-worker stats, so that's where I'll begin working. From there we'll have a basis to begin a more thorough investigation into the performance impact of the items above.

@htuch
Copy link
Member

htuch commented Nov 2, 2018

@tonya11en I'm currently working on per-worker stats for event loop duration. Have you started on any of the others? I can take some of these on as well.

@tonya11en
Copy link
Member

@htuch I haven't started work on any of the others yet. Happy to help review.

@mattklein123
Copy link
Member Author

FYI for those watching I'm chasing an internal performance problem that may be related to this so I will be working on this.

@mattklein123
Copy link
Member Author

For anyone watching that cares about this issue, I have hacked up code that adds per-worker connection and watchdog miss stats and also re-balances active TCP connections across listeners. I plan on cleaning this up and upstreaming it, but I'm not sure exactly when I will get to it. If anyone urgently wants to work on this let me know. My WIP branch is here: https://github.com/envoyproxy/envoy/tree/cx_rebalance

@ramaraochavali
Copy link
Contributor

@mattklein123 would you be able to share the symptoms of internal performance problem that you think are related to this if it is some thing that can be shared? The reason I am asking that is because we are doing some perf tests of a low latency gRPC service and we see that p99 latencies increase as the req/min increase and at the same we also see the p99 connect_time_ms to upstream cluster increases. I was wondering could it be due to high number of worker threads (we left the concurrency arg to default) or could be related to this specific issue? We have n't enabled the dispatcher stats yet. Any thoughts?

@mattklein123
Copy link
Member Author

@ramaraochavali we are chasing a very similar issue. We have a gRPC service which has many downstream callers (so a large number of incoming mesh connections) and a single upstream callee. At P99 the latency do the upstream service is almost double what the upstream service thinks its downstream latency is.

I've been looking at this for days and I can't find any issue here in terms of blocking or anything other than straight up load, but the delta is pretty large, like 30-40ms in some cases, which is quite large. The only thing I really have right now is this is a straight up load issue where Envoy and the service are context switching back and forth, and Envoy has to deal with a large number of individual connections, and at P99 there are interleavings that just go bad. I'm still looking though.

@mattklein123
Copy link
Member Author

I forgot to add that balancing the connections and increasing the workers does not seem to improve upstream P99 latency, which is why I think there may be scheduling contention with the service.

@ramaraochavali
Copy link
Contributor

@mattklein123 Thanks a lot for the details. This is very useful information. I will share here if we find any thing different from what you have found bases on our tests/tweaks to the config.

@mattklein123
Copy link
Member Author

@oschaaf see the above thread if you have time. Do we have any NH simulations in which we try to model this type of service mesh use case? The characteristics would be:

  • gRPC API with a very small request/response payload, so headers/trailers and header compression dominate the workload.
  • Sidecar pattern with a local service both handling the incoming API as well as making a chained outgoing egress API call.
  • Lots of incoming callers with persistent connections, but each connection is very low rate, so we get no real benefit of transport level batching.

My suspicion here is that this is a pathological case in which we get a lot of context switching, have to switch/handle a lot of OS connections, etc. I think it's conceivable that there are event loop efficiency issues also but I'm not sure (e.g., the way that we defer writes by activating the write ready event vs. doing them inline).

cc @jmarantz our other perf guru

@jmarantz
Copy link
Contributor

jmarantz commented Sep 2, 2019

I have some naive questions :)

I'm inferring from the discussion that the Envoy is sharing a host with its single upstream service process. That right?

How compute-intensive vs i/o-bound is the service? How multi-threaded is the service? I know that Envoy by default will create ~ 1 worker per HW thread; is that how you have it set? If the service had (say) 5 threads could you play games like subtracting 5, or have an Envoy option to use Max(ConfiguredMininumThreadCount, NumHardwareThreads - ConfiguredCarveOut) or something like that?

What happens to perf if you give the upstream a dedicated machine on the same rack as the Envoy?

@mattklein123
Copy link
Member Author

I'm inferring from the discussion that the Envoy is sharing a host with its single upstream service process. That right?

Envoy is a sidecar to the service (service mesh ingress/ingress)

<Many downstreams> -> <Many downstream egress Envoy> -> <Ingress Envoy> -> <Service> -> <Egress Envoy (same as ingress Envoy)> -> Upstream Service

Running Envoy on more workers does not seem to appreciably improve the situation, and neither does listener connection balancing. I think there are lots of things to try in terms of for example pinning the service to 2 cores and pinning envoy to 2 cores, etc. but I haven't gone there yet. I've mainly been trying to figure out if anything in Envoy is blocking or taking a really long time, and I can't find anything.

@htuch
Copy link
Member

htuch commented Sep 3, 2019

@mattklein123 another approach to tackling this kind of hot spotting is to just add a per-TCP connection timeout and do a graceful GOAWAY after this, e.g. every 10 minutes. This will cause the gRPC client to treat this TCP connection as drained and try another one. The new TCP connection will provide better stochastic balancing behavior.

@mattklein123
Copy link
Member Author

@mattklein123 another approach to tackling this kind of hot spotting is to just add a per-TCP connection timeout and do a graceful GOAWAY after this, e.g. every 10 minutes.

I agree this is a good approach in some cases, but not in every case, particularly the side car case in which the local service may only make a very small number of HTTP/2 connections to the side car. This is why I hacked up "perfect" CX balancing, though it didn't make much of a difference to out workload AFAICT.

The only thing that I still want to spend some time looking at (probably not until next week) is whether how we handle write events is somehow contributing to the problem we are seeing. The current code never writes inline. It always activates an event which then handles the writes. I wonder if there are some pathological cases in which a bunch of write events get delayed to a single event loop iteration and then take a large amount of time. I don't see anything obvious here but it's worth a look.

I will also work on upstreaming the new stats I added as well as the CX balancing code which I do think is useful for certain cases.

@davidkilliansc
Copy link

re-balances active TCP connections across listeners

In our case, I doubt we'd need any rebalancing of active connections if the mapping of brand new connections to Envoy threads was done round-robin or least-conns. In our cases, the many incoming connections have even request load across them, yet most of our Envoy threads appear to have virtually zero connections assigned to them. Rebalancing active connections would reactively detect and fix this form of imbalance, but there'd still be the initial period of large imbalance we'd have to work around (e.g. by over-scaling or eating bad tail latencies during deployments or scale ups). In a sense, auto re-balancing further hides this initial connection imbalance. Thoughts on starting with just better predictability/control over the mapping of new connections to worker threads?

@mattklein123
Copy link
Member Author

Thoughts on starting with just better predictability/control over the mapping of new connections to worker threads?

Sorry this is what I actually implemented, not rebalancing once accepted. Envoy will likely never support the latter given the architecture.

@davidkilliansc
Copy link

I think it's conceivable that there are event loop efficiency issues also but I'm not sure (e.g., the way that we defer writes by activating the write ready event vs. doing them inline).

@mattklein123 Is there visibility into event loop queue depth and/or time spent in queue? Could hidden latency could be while sitting in the event loop queue, e.g. if the processing on that thread was backed up? Not sure I'm describing this in a way that even makes sense w/ my limited understanding of Envoy :)

@mattklein123
Copy link
Member Author

@mattklein123 Is there visibility into event loop queue depth and/or time spent in queue?

Not currently, though this would be interesting to track potentially. cc @mergeconflict @jmarantz @oschaaf

@ahmedelbazsc
Copy link

Thanks @mattklein123 for the details here. A few points to clarify:

Do you expect the bottleneck originates in the egress path [service -> egress Envoy -> upstream service] or we see the same behavior with direct response from service even when no upstream is involved? If the former, curious if there is a single persistent connection between the upstream callee (service) and its local egress Envoy, or we have a pool of connections that are getting rebalanced across workers with your patch?

In our similar example we needed to leverage a pool of connections which is regularly refreshed every few minutes (as opposed to a single persistent HTTP-2 connection). With your connection (re-)balancing change in place we should no longer have to perform regular connection recycling albeit the need to still maintain a pool of connections from the service -> Envoy to leverage multiple worker threads.

In-line with the point raised by @htuch I think it would be valuable for certain scenarios to expose Envoy cluster options for connection age/time-to-live. In addition to the one mentioned above we have also considered it to rebalance connected Envoy persistent bidi streams to XDS control plane in the mesh, esp on XDS server scale outs.

mattklein123 added a commit that referenced this issue Sep 17, 2019
This PR does a few things:
1) Adds per-worker listener stats, useful for viewing worker
   connection imbalance.
2) Adds per-worker watchdog miss stats, useful for viewing per
   worker event loop latency.
3) Misc connection handling cleanups.

Part of #4602

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this issue Sep 20, 2019
This PR does a few things:
1) Adds per-worker listener stats, useful for viewing worker
   connection imbalance.
2) Adds per-worker watchdog miss stats, useful for viewing per
   worker event loop latency.
3) Misc connection handling cleanups.

Part of #4602

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 pushed a commit to envoyproxy/data-plane-api that referenced this issue Sep 20, 2019
This PR does a few things:
1) Adds per-worker listener stats, useful for viewing worker
   connection imbalance.
2) Adds per-worker watchdog miss stats, useful for viewing per
   worker event loop latency.
3) Misc connection handling cleanups.

Part of envoyproxy/envoy#4602

Signed-off-by: Matt Klein <mklein@lyft.com>

Mirrored from https://github.com/envoyproxy/envoy @ 483aa09545a55853fa41710f80ceff23fcad290d
danzh2010 pushed a commit to danzh2010/envoy that referenced this issue Sep 24, 2019
This PR does a few things:
1) Adds per-worker listener stats, useful for viewing worker
   connection imbalance.
2) Adds per-worker watchdog miss stats, useful for viewing per
   worker event loop latency.
3) Misc connection handling cleanups.

Part of envoyproxy#4602

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this issue Sep 29, 2019
This commit introduces optional connection rebalancing
for TCP listeners, targeted as cases where there are a
small number of long lived connections such as service
mesh HTTP2/gRPC egress.

Part of this change involved tracking connection counts
at the per-listener level, which made it clear that we
have quite a bit of tech debt in some of our interfaces
in this area. I did various cleanups in service of this
change which leave the connection handler / accept path
in a cleaner state.

Fixes #4602

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

For those watching this issue, I have a PR up (#8422) which adds configurable connection balancing for TCP listeners. Please try it out.

danzh2010 pushed a commit to danzh2010/envoy that referenced this issue Oct 4, 2019
This PR does a few things:
1) Adds per-worker listener stats, useful for viewing worker
   connection imbalance.
2) Adds per-worker watchdog miss stats, useful for viewing per
   worker event loop latency.
3) Misc connection handling cleanups.

Part of envoyproxy#4602

Signed-off-by: Matt Klein <mklein@lyft.com>
danzh2010 pushed a commit to danzh2010/envoy that referenced this issue Oct 4, 2019
This PR does a few things:
1) Adds per-worker listener stats, useful for viewing worker
   connection imbalance.
2) Adds per-worker watchdog miss stats, useful for viewing per
   worker event loop latency.
3) Misc connection handling cleanups.

Part of envoyproxy#4602

Signed-off-by: Matt Klein <mklein@lyft.com>
@rgs1
Copy link
Member

rgs1 commented Oct 8, 2019

FWIW, we started running #8263 in prod and things look pretty imbalanced:

listener.0.0.0.0_443.downstream_cx_active: 5711
listener.0.0.0.0_443.downstream_cx_destroy: 895161
listener.0.0.0.0_443.downstream_cx_total: 895699
listener.0.0.0.0_443.downstream_pre_cx_active: 0
listener.0.0.0.0_443.downstream_pre_cx_timeout: 0
listener.0.0.0.0_443.worker_0.downstream_cx_active: 1
listener.0.0.0.0_443.worker_0.downstream_cx_total: 221
listener.0.0.0.0_443.worker_1.downstream_cx_active: 1
listener.0.0.0.0_443.worker_1.downstream_cx_total: 213
listener.0.0.0.0_443.worker_10.downstream_cx_active: 5
listener.0.0.0.0_443.worker_10.downstream_cx_total: 215
listener.0.0.0.0_443.worker_11.downstream_cx_active: 4
listener.0.0.0.0_443.worker_11.downstream_cx_total: 237
listener.0.0.0.0_443.worker_12.downstream_cx_active: 2
listener.0.0.0.0_443.worker_12.downstream_cx_total: 208
listener.0.0.0.0_443.worker_13.downstream_cx_active: 6
listener.0.0.0.0_443.worker_13.downstream_cx_total: 229
listener.0.0.0.0_443.worker_14.downstream_cx_active: 2
listener.0.0.0.0_443.worker_14.downstream_cx_total: 212
listener.0.0.0.0_443.worker_15.downstream_cx_active: 2
listener.0.0.0.0_443.worker_15.downstream_cx_total: 243
listener.0.0.0.0_443.worker_16.downstream_cx_active: 2
listener.0.0.0.0_443.worker_16.downstream_cx_total: 181
listener.0.0.0.0_443.worker_17.downstream_cx_active: 2
listener.0.0.0.0_443.worker_17.downstream_cx_total: 216
listener.0.0.0.0_443.worker_18.downstream_cx_active: 4
listener.0.0.0.0_443.worker_18.downstream_cx_total: 185
listener.0.0.0.0_443.worker_19.downstream_cx_active: 7
listener.0.0.0.0_443.worker_19.downstream_cx_total: 173
listener.0.0.0.0_443.worker_2.downstream_cx_active: 1
listener.0.0.0.0_443.worker_2.downstream_cx_total: 235
listener.0.0.0.0_443.worker_20.downstream_cx_active: 14
listener.0.0.0.0_443.worker_20.downstream_cx_total: 1105
listener.0.0.0.0_443.worker_21.downstream_cx_active: 4
listener.0.0.0.0_443.worker_21.downstream_cx_total: 189
listener.0.0.0.0_443.worker_22.downstream_cx_active: 4
listener.0.0.0.0_443.worker_22.downstream_cx_total: 185
listener.0.0.0.0_443.worker_23.downstream_cx_active: 2
listener.0.0.0.0_443.worker_23.downstream_cx_total: 166
listener.0.0.0.0_443.worker_24.downstream_cx_active: 1
listener.0.0.0.0_443.worker_24.downstream_cx_total: 235
listener.0.0.0.0_443.worker_25.downstream_cx_active: 49
listener.0.0.0.0_443.worker_25.downstream_cx_total: 4443
listener.0.0.0.0_443.worker_26.downstream_cx_active: 6
listener.0.0.0.0_443.worker_26.downstream_cx_total: 404
listener.0.0.0.0_443.worker_27.downstream_cx_active: 140
listener.0.0.0.0_443.worker_27.downstream_cx_total: 12694
listener.0.0.0.0_443.worker_28.downstream_cx_active: 438
listener.0.0.0.0_443.worker_28.downstream_cx_total: 60326
listener.0.0.0.0_443.worker_29.downstream_cx_active: 259
listener.0.0.0.0_443.worker_29.downstream_cx_total: 33042
listener.0.0.0.0_443.worker_3.downstream_cx_active: 1
listener.0.0.0.0_443.worker_3.downstream_cx_total: 212
listener.0.0.0.0_443.worker_30.downstream_cx_active: 931
listener.0.0.0.0_443.worker_30.downstream_cx_total: 157107
listener.0.0.0.0_443.worker_31.downstream_cx_active: 573
listener.0.0.0.0_443.worker_31.downstream_cx_total: 89898
listener.0.0.0.0_443.worker_32.downstream_cx_active: 723
listener.0.0.0.0_443.worker_32.downstream_cx_total: 115034
listener.0.0.0.0_443.worker_33.downstream_cx_active: 796
listener.0.0.0.0_443.worker_33.downstream_cx_total: 133812
listener.0.0.0.0_443.worker_34.downstream_cx_active: 913
listener.0.0.0.0_443.worker_34.downstream_cx_total: 151900
listener.0.0.0.0_443.worker_35.downstream_cx_active: 798
listener.0.0.0.0_443.worker_35.downstream_cx_total: 130669
listener.0.0.0.0_443.worker_4.downstream_cx_active: 2
listener.0.0.0.0_443.worker_4.downstream_cx_total: 308
listener.0.0.0.0_443.worker_5.downstream_cx_active: 8
listener.0.0.0.0_443.worker_5.downstream_cx_total: 238
listener.0.0.0.0_443.worker_6.downstream_cx_active: 1
listener.0.0.0.0_443.worker_6.downstream_cx_total: 250
listener.0.0.0.0_443.worker_7.downstream_cx_active: 0
listener.0.0.0.0_443.worker_7.downstream_cx_total: 197
listener.0.0.0.0_443.worker_8.downstream_cx_active: 6
listener.0.0.0.0_443.worker_8.downstream_cx_total: 289
listener.0.0.0.0_443.worker_9.downstream_cx_active: 3
listener.0.0.0.0_443.worker_9.downstream_cx_total: 228

Are others seeing similar natural imbalances?

@mattklein123
Copy link
Member Author

Are others seeing similar natural imbalances?

This doesn't surprise me at all. The kernel will generally keep things on the same thread if it thinks that it can avoid context switches, etc. Over the years I have come to realize that in almost all cases the kernel knows what it is doing, but for those small cases where it does not hold true please try #8422. :)

mattklein123 added a commit that referenced this issue Oct 11, 2019
This commit introduces optional connection rebalancing
for TCP listeners, targeted as cases where there are a
small number of long lived connections such as service
mesh HTTP2/gRPC egress.

Part of this change involved tracking connection counts
at the per-listener level, which made it clear that we
have quite a bit of tech debt in some of our interfaces
in this area. I did various cleanups in service of this
change which leave the connection handler / accept path
in a cleaner state.

Fixes #4602

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 pushed a commit to envoyproxy/data-plane-api that referenced this issue Oct 11, 2019
This commit introduces optional connection rebalancing
for TCP listeners, targeted as cases where there are a
small number of long lived connections such as service
mesh HTTP2/gRPC egress.

Part of this change involved tracking connection counts
at the per-listener level, which made it clear that we
have quite a bit of tech debt in some of our interfaces
in this area. I did various cleanups in service of this
change which leave the connection handler / accept path
in a cleaner state.

Fixes envoyproxy/envoy#4602

Signed-off-by: Matt Klein <mklein@lyft.com>

Mirrored from https://github.com/envoyproxy/envoy @ 587e07974e6badb061ee3c9413660ab47f42750f
nandu-vinodan pushed a commit to nandu-vinodan/envoy that referenced this issue Oct 17, 2019
This commit introduces optional connection rebalancing
for TCP listeners, targeted as cases where there are a
small number of long lived connections such as service
mesh HTTP2/gRPC egress.

Part of this change involved tracking connection counts
at the per-listener level, which made it clear that we
have quite a bit of tech debt in some of our interfaces
in this area. I did various cleanups in service of this
change which leave the connection handler / accept path
in a cleaner state.

Fixes envoyproxy#4602

Signed-off-by: Matt Klein <mklein@lyft.com>
@antoniovicente
Copy link
Contributor

Have we considered use of SO_REUSEPORT on listeners in order to distribute incoming connections to accept queues via hashing? Granted, this socket option behavior may be specific to Linux.

@mattklein123
Copy link
Member Author

Have we considered use of SO_REUSEPORT on listeners in order to distribute incoming connections to accept queues via hashing? Granted, this socket option behavior may be specific to Linux.

Some people already configure SO_REUSEPORT, at least across processes, and there has been discussion about allowing this within the process also so each worker has its own FD. cc @euroelessar. With that said, from previous experience, this still won't fix the case that I just fixed where you have a tiny number of connections and want to make sure they get spread evenly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants