Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THRIFT-5240: Do connectivity check in Go server #2190

Merged
merged 1 commit into from Jul 1, 2020

Conversation

fishy
Copy link
Member

@fishy fishy commented Jun 27, 2020

Client: go

In compiler generated TProcessorFunction implementations, add a
goroutine after read the request to do connectivity check on the input
transport. If the transport is no longer open, cancel the context object
passed into the handler implementation.

Also define ErrAbandonRequest error, to help TSimpleServer closing
client connections that's already closed on the other end.

  • Did you create an Apache Jira ticket? (not required for trivial changes)
  • If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
  • Did you squash your changes to a single commit? (not required, but preferred)
  • Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
  • If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

@fishy fishy force-pushed the go-server-connectivity-check branch 6 times, most recently from 1f2cbb0 to 8b4e17a Compare June 28, 2020 18:29
@fishy fishy marked this pull request as ready for review June 28, 2020 19:38
@fishy fishy requested a review from dcelasun June 28, 2020 19:39
@dcelasun
Copy link
Member

dcelasun commented Jul 1, 2020

Not so sure about this one:

  • We are spawning multiple additional goroutines per incoming request (the one you added and the one internal to the ticker).
  • Ticker is set to 1ms, which seems like it would cause a lot of scheduling overhead at runtime. I vaguely remember Go having some optimization for short timers, but I'm not sure.

Did you test this on a production workload by any chance? I'd be especially interested to see its effects on latency.

@fishy
Copy link
Member Author

fishy commented Jul 1, 2020

@dcelasun Currently our prod is still running pre-bcae3bb (we plan to deploy bcae3bb to prod next week). I've been running prod version (old) vs. bcae3bb+this PR (new) side-by-side in our staging environment for a while now. We have 2 go services, A & B, while A is also client to B (and another Java service). Here are some graphs for the last 12hr (in all the graphs, cyan is new and green is old):

A's server side latency p99:
Screenshot 2020-07-01 at 07 47 13

A's server side latency p50:
Screenshot 2020-07-01 at 07 48 34

B's server side latency p99:
Screenshot 2020-07-01 at 07 46 18

B's server side latency p50:
Screenshot 2020-07-01 at 07 47 53

A's client side latency p99 for A->B requests:
Screenshot 2020-07-01 at 07 49 29

A's client side latency p50 for A->B requests:
Screenshot 2020-07-01 at 07 49 56

A's goroutines (this is a ticker of 10s getting runtime.NumGoroutine() and report as a gauge):
Screenshot 2020-07-01 at 07 43 52

B's goroutines:
Screenshot 2020-07-01 at 07 44 39

(in p50 charts, the unit of y axis is ms)
So yes this has an overhead on goroutines (more obvious on B as it has fewer goroutines in general), and an overhead on latency (more obvious on B's server side p50, which is constantly ~0.5ms. less obvious on all p99 charts). I think those overheads are acceptable?

I'll run bcae3bb vs. this PR now and post results a few hours later.

@dcelasun
Copy link
Member

dcelasun commented Jul 1, 2020

Thanks for the graphs! Eyeballing the p50 latency, seems like a ~15% increase? That's a bit worse than I expected. If it's not too much to ask, can you also test the effects of different GOMAXPROCS? I wonder if reduced concurrency would help here.

Also, what's your Go version? I've found the ticker optimization I was talking about and that should be available with 1.14+

@fishy
Copy link
Member Author

fishy commented Jul 1, 2020

We are running go 1.14.4.

I think it's more like a constant ~0.5ms overhead than 15% overhead :) If someone is running a really low latency go thrift server, they can also disable this feature by having thrift.ServerConnectivityCheckInterval = 0 early in their main function.

@fishy
Copy link
Member Author

fishy commented Jul 1, 2020

also the charts' y-axis are not starting from 0 (I know that's bad but that's just the default behavior from the monitoring service we use)

@dcelasun
Copy link
Member

dcelasun commented Jul 1, 2020

I think it's more like a constant ~0.5ms overhead than 15% overhead :)

Ha, that's a better interpretation :)

also the charts' y-axis are not starting from 0

I know. I just picked a few random points from p50 charts and saw a ~15% increase but if it's actually a constant 0.5ms then that's no problem at all.

@fishy
Copy link
Member Author

fishy commented Jul 1, 2020

Also the GOMAXPROCS is 7 on service A and 1 on service B.

@fishy
Copy link
Member Author

fishy commented Jul 1, 2020

Now bcae3bb (old) vs. this PR (new) has been running for 2hr+, here are the 2hr charts (still, cyan is new and green is old).

Service A goroutines:
Screenshot 2020-07-01 at 10 15 01

Service B goroutines:
Screenshot 2020-07-01 at 10 15 29

Service A server side latency p50:
Screenshot 2020-07-01 at 10 16 37

Service A server side latency p99:
Screenshot 2020-07-01 at 10 17 25

Service B server side latency p50:
Screenshot 2020-07-01 at 10 18 09

Service B server side latency p99 (this one is weird, the old one actually hits the timeout/40ms a lot, probably just bad luck?):
Screenshot 2020-07-01 at 10 19 09

A->B client side latency p50:
Screenshot 2020-07-01 at 10 20 02

A->B client side latency p99 (note that it's not always old A hitting old B and new A hitting new B. service A and B are both in their pool so any A->B request can hit any B service with a roughly equal chance):
Screenshot 2020-07-01 at 10 20 23

The most notable gap is on B server side p50 latency, which could be as large as 1ms (it could be calculated as ~50% increase in latency, depending on how you calculate that). I think we can just document that (probably in lib/go/README.md?), and also document how to disable this feature if that 1ms is important to people there?

@dcelasun
Copy link
Member

dcelasun commented Jul 1, 2020

OK this looks reasonable I think.

I think we can just document that (probably in lib/go/README.md?)

Yeah, just a note about how this feature will slightly increase latency and goroutine count, but can be disabled with ServerConnectivityCheckInterval=0.

@fishy fishy force-pushed the go-server-connectivity-check branch from 8b4e17a to 980c7d7 Compare July 1, 2020 17:35
@fishy
Copy link
Member Author

fishy commented Jul 1, 2020

README updated.

Client: go

In compiler generated TProcessorFunction implementations, add a
goroutine after read the request to do connectivity check on the input
transport. If the transport is no longer open, cancel the context object
passed into the handler implementation.

Also define ErrAbandonRequest error, to help TSimpleServer closing
client connections that's already closed on the other end.
@fishy fishy force-pushed the go-server-connectivity-check branch from 980c7d7 to 07dc2fe Compare July 1, 2020 17:36
@fishy fishy merged commit 4db7a0a into apache:master Jul 1, 2020
@fishy fishy deleted the go-server-connectivity-check branch July 1, 2020 18:50
@zerosnake0
Copy link

zerosnake0 commented Nov 5, 2020

Although this is a very demanding feature in my POV, but i think there are some problems in this PR

  1. This feature should not be enabled by default for backward compatibility
  2. The ticker loop design consumes too much cpu which is not good (which mentioned also by @dcelasun )
  3. The functionality should be encapsulated inside the transport interface (this will also satisfy 1)

I will make a new PR for this in some days

@fishy
Copy link
Member Author

fishy commented Nov 5, 2020

@zerosnake0 The default interval has been changed from 1ms to 5ms in #2256. We also discussed between disable it by default vs. use 5ms as default in that PR. Did you try that version? Is it still too much cpu in your case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants