THRIFT-5240: Do connectivity check in Go server #2190

fishy · 2020-06-27T17:17:12Z

Client: go

In compiler generated TProcessorFunction implementations, add a
goroutine after read the request to do connectivity check on the input
transport. If the transport is no longer open, cancel the context object
passed into the handler implementation.

Also define ErrAbandonRequest error, to help TSimpleServer closing
client connections that's already closed on the other end.

Did you create an Apache Jira ticket? (not required for trivial changes)
If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
Did you squash your changes to a single commit? (not required, but preferred)
Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

dcelasun · 2020-07-01T09:00:58Z

Not so sure about this one:

We are spawning multiple additional goroutines per incoming request (the one you added and the one internal to the ticker).
Ticker is set to 1ms, which seems like it would cause a lot of scheduling overhead at runtime. I vaguely remember Go having some optimization for short timers, but I'm not sure.

Did you test this on a production workload by any chance? I'd be especially interested to see its effects on latency.

fishy · 2020-07-01T15:02:13Z

@dcelasun Currently our prod is still running pre-bcae3bb (we plan to deploy bcae3bb to prod next week). I've been running prod version (old) vs. bcae3bb+this PR (new) side-by-side in our staging environment for a while now. We have 2 go services, A & B, while A is also client to B (and another Java service). Here are some graphs for the last 12hr (in all the graphs, cyan is new and green is old):

A's server side latency p99:

A's server side latency p50:

B's server side latency p99:

B's server side latency p50:

A's client side latency p99 for A->B requests:

A's client side latency p50 for A->B requests:

A's goroutines (this is a ticker of 10s getting runtime.NumGoroutine() and report as a gauge):

B's goroutines:

(in p50 charts, the unit of y axis is ms)
So yes this has an overhead on goroutines (more obvious on B as it has fewer goroutines in general), and an overhead on latency (more obvious on B's server side p50, which is constantly ~0.5ms. less obvious on all p99 charts). I think those overheads are acceptable?

I'll run bcae3bb vs. this PR now and post results a few hours later.

dcelasun · 2020-07-01T15:31:36Z

Thanks for the graphs! Eyeballing the p50 latency, seems like a ~15% increase? That's a bit worse than I expected. If it's not too much to ask, can you also test the effects of different GOMAXPROCS? I wonder if reduced concurrency would help here.

Also, what's your Go version? I've found the ticker optimization I was talking about and that should be available with 1.14+

fishy · 2020-07-01T15:39:17Z

We are running go 1.14.4.

I think it's more like a constant ~0.5ms overhead than 15% overhead :) If someone is running a really low latency go thrift server, they can also disable this feature by having thrift.ServerConnectivityCheckInterval = 0 early in their main function.

fishy · 2020-07-01T15:40:22Z

also the charts' y-axis are not starting from 0 (I know that's bad but that's just the default behavior from the monitoring service we use)

dcelasun · 2020-07-01T15:42:29Z

I think it's more like a constant ~0.5ms overhead than 15% overhead :)

Ha, that's a better interpretation :)

also the charts' y-axis are not starting from 0

I know. I just picked a few random points from p50 charts and saw a ~15% increase but if it's actually a constant 0.5ms then that's no problem at all.

fishy · 2020-07-01T16:23:35Z

Also the GOMAXPROCS is 7 on service A and 1 on service B.

fishy · 2020-07-01T17:27:30Z

Now bcae3bb (old) vs. this PR (new) has been running for 2hr+, here are the 2hr charts (still, cyan is new and green is old).

Service A goroutines:

Service B goroutines:

Service A server side latency p50:

Service A server side latency p99:

Service B server side latency p50:

Service B server side latency p99 (this one is weird, the old one actually hits the timeout/40ms a lot, probably just bad luck?):

A->B client side latency p50:

A->B client side latency p99 (note that it's not always old A hitting old B and new A hitting new B. service A and B are both in their pool so any A->B request can hit any B service with a roughly equal chance):

The most notable gap is on B server side p50 latency, which could be as large as 1ms (it could be calculated as ~50% increase in latency, depending on how you calculate that). I think we can just document that (probably in lib/go/README.md?), and also document how to disable this feature if that 1ms is important to people there?

dcelasun · 2020-07-01T17:32:38Z

OK this looks reasonable I think.

I think we can just document that (probably in lib/go/README.md?)

Yeah, just a note about how this feature will slightly increase latency and goroutine count, but can be disabled with ServerConnectivityCheckInterval=0.

fishy · 2020-07-01T17:35:45Z

README updated.

Client: go In compiler generated TProcessorFunction implementations, add a goroutine after read the request to do connectivity check on the input transport. If the transport is no longer open, cancel the context object passed into the handler implementation. Also define ErrAbandonRequest error, to help TSimpleServer closing client connections that's already closed on the other end.

zerosnake0 · 2020-11-05T10:42:20Z

Although this is a very demanding feature in my POV, but i think there are some problems in this PR

This feature should not be enabled by default for backward compatibility
The ticker loop design consumes too much cpu which is not good (which mentioned also by @dcelasun )
The functionality should be encapsulated inside the transport interface (this will also satisfy 1)

I will make a new PR for this in some days

fishy · 2020-11-05T16:50:36Z

@zerosnake0 The default interval has been changed from 1ms to 5ms in #2256. We also discussed between disable it by default vs. use 5ms as default in that PR. Did you try that version? Is it still too much cpu in your case?

fishy force-pushed the go-server-connectivity-check branch 6 times, most recently from 1f2cbb0 to 8b4e17a Compare June 28, 2020 18:29

fishy marked this pull request as ready for review June 28, 2020 19:38

fishy requested a review from dcelasun June 28, 2020 19:39

fishy force-pushed the go-server-connectivity-check branch from 8b4e17a to 980c7d7 Compare July 1, 2020 17:35

fishy force-pushed the go-server-connectivity-check branch from 980c7d7 to 07dc2fe Compare July 1, 2020 17:36

dcelasun approved these changes Jul 1, 2020

View reviewed changes

fishy merged commit 4db7a0a into apache:master Jul 1, 2020

fishy deleted the go-server-connectivity-check branch July 1, 2020 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

THRIFT-5240: Do connectivity check in Go server #2190

THRIFT-5240: Do connectivity check in Go server #2190

fishy commented Jun 27, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020 •

edited

fishy commented Jul 1, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020

fishy commented Jul 1, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020

zerosnake0 commented Nov 5, 2020 •

edited

fishy commented Nov 5, 2020

THRIFT-5240: Do connectivity check in Go server #2190

THRIFT-5240: Do connectivity check in Go server #2190

Conversation

fishy commented Jun 27, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020 • edited

fishy commented Jul 1, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020

fishy commented Jul 1, 2020

dcelasun commented Jul 1, 2020

fishy commented Jul 1, 2020

zerosnake0 commented Nov 5, 2020 • edited

fishy commented Nov 5, 2020

fishy commented Jul 1, 2020 •

edited

zerosnake0 commented Nov 5, 2020 •

edited