tpccbench: use HAProxy when running chaos benchmarks by nvb · Pull Request #26075 · cockroachdb/cockroach

nvb · 2018-05-25T01:13:13Z

This change adjusts tpccbench to run an HAProxy server on the load
generator when running chaos benchmarking. HAProxy is then configured
to point at the cluster and the load generator is pointed at HAProxy.
This allows the load generator to fail-over when nodes go down without
any extra client-side logic.

It also has the nice effect of being the first roachtest to use HAProxy,
a tool that we recommend our users use.

This still needs some tweaking. For instance, I still need to tune the
haproxy.cfg to support a larger number of concurrent requests and to
fail over faster.

Release note: None

cockroach-teamcity · 2018-05-25T01:13:19Z

This change is

petermattis · 2018-05-25T13:49:04Z

though as I mentioned in person yesterday, I'll be surprised if this is significantly better than workload's builtin round-robin driver.

Review status: 0 of 2 files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

nvb · 2018-05-25T21:48:12Z

though as I mentioned in person yesterday, I'll be surprised if this is significantly better than workload's builtin round-robin driver

Loadgen's tpcc implementation pins workers to individual driver connections, and it looks like workload's tpcc does the same. Is there a round-robin policy I'm missing? If so, I would expect it to look more like kv.

And regardless of whether there is or not, shouldn't a load balancer be able to make better decisions than a simple round-robin policy because it can remember which nodes are down and avoid sending them each 1/nth of the traffic only to have all of that traffic fail? My intuition is that this should result in fewer failed operations.

petermattis · 2018-05-26T13:50:01Z

Ah, I keep forgetting that tpcc pins workers to connections (surprising given that I added that code). If we removed that pinning (or made it optional), the builtin round-robin driver should be essentially the same as using haproxy. Why use one over the other? Convenience. The round-robin driver was initially added to avoid having to setup haproxy when doing perf experiments.

And regardless of whether there is or not, shouldn't a load balancer be able to make better decisions than a simple round-robin policy because it can remember which nodes are down and avoid sending them each 1/nth of the traffic only to have all of that traffic fail? My intuition is that this should result in fewer failed operations.

I'm pretty sure we configure haproxy with a round-robin load balancing policy. We can certainly investigate more complex load balancing, though I'm not sure if we'll be able to achieve it with haproxy given that it doesn't know about the pgwire protocol. Regardless, this PR is fine to merge given how small it is. I'm mostly talking to spread my concern about placing too much faith in haproxy (especially the out of the box configuration). We might have to put effort into either improving that configuration or providing or own load balancing. To give you a concrete example of a concern: workers open connections to haproxy which in turn opens a long lived connection to a cockroach node. If a new cockroach node comes on line (or restarts after being killed), haproxy will not utilize this node until a worker opens a new connection.

bdarnell · 2018-05-28T14:38:31Z

We might have to put effort into either improving that configuration or providing or own load balancing.

Or switching from haproxy to something that does understand the postgres protocol (like pgbouncer or pgpool).

This change adjusts tpccbench to run an HAProxy server on the load generator when running chaos benchmarking. HAProxy is then configured to point at the cluster and the load generator is pointed at HAProxy. This allows the load generator to fail-over when nodes go down without any extra client-side logic. It also has the nice effect of being the first roachtest to use HAProxy, a tool that we recommend our users use. This still needs some tweaking. For instance, I still need to tune the haproxy.cfg to support a larger number of concurrent requests and to fail over faster. Release note: None

nvb · 2018-05-29T22:49:15Z

To give you a concrete example of a concern: workers open connections to haproxy which in turn opens a long lived connection to a cockroach node. If a new cockroach node comes on line (or restarts after being killed), haproxy will not utilize this node until a worker opens a new connection.

That's a great point. HAProxy has a leastconn balancing policy which is an alternative to roundrobin for protocols that create long-lived connections. It would be interesting to see whether switching algorithms has a noticeable effect, especially while performing chaos testing.

Or switching from haproxy to something that does understand the postgres protocol (like pgbouncer or pgpool).

Have we ever tried running either of these against Cockroach?

bdarnell · 2018-05-29T23:11:18Z

HAProxy has a leastconn balancing policy which is an alternative to roundrobin for protocols that create long-lived connections.

leastconn isn't quite enough for this - clients that have enough connections may just hold on to them indefinitely and never need to open up a new connection that could be assigned to the new node. You need some kind of maximum lifespan on the connection to force them to be recycled occasionally. This is ideally a client-side feature (such as sqlalchemy's pool_recycle option). MySQL has a server-side feature to kill idle connections and it's handled ungracefully by many clients.

Have we ever tried running either of these against Cockroach?

I ran some trivial tests with pgbouncer a while back and it seemed to work. @asubiotto has tried pgpool; it looks like there were issues in older versions of pgpool but they may be fixed now (#13009 (comment))

nvb · 2018-05-29T23:34:43Z

bors r+

26075: tpccbench: use HAProxy when running chaos benchmarks r=nvanbenschoten a=nvanbenschoten This change adjusts tpccbench to run an HAProxy server on the load generator when running chaos benchmarking. HAProxy is then configured to point at the cluster and the load generator is pointed at HAProxy. This allows the load generator to fail-over when nodes go down without any extra client-side logic. It also has the nice effect of being the first roachtest to use HAProxy, a tool that we recommend our users use. This still needs some tweaking. For instance, I still need to tune the haproxy.cfg to support a larger number of concurrent requests and to fail over faster. First commit is #26073. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2018-05-29T23:50:15Z

Build succeeded

GitHub CI (Cockroach)

tbg · 2018-05-30T09:45:34Z

Whatever we decide here, I already like the dogfooding aspect of this. We have ./cockroach gen haproxy and I assume this was put in because someone thought it was something production users may actually want to use. Now it looks like that's not true, and that there should be a ./cockroach gen pgbouncer (or something like it -- whichever we want to favor) instead.

nvb requested a review from m-schneider May 25, 2018 01:13

nvb force-pushed the nvanbenschoten/tpccbenchHaproxy branch from ee9e337 to 2f6a3a9 Compare May 29, 2018 22:43

nvb changed the title ~~[DNM] tpccbench: use HAProxy when running chaos benchmarks~~ tpccbench: use HAProxy when running chaos benchmarks May 29, 2018

craig Bot merged commit 2f6a3a9 into cockroachdb:master May 29, 2018

nvb deleted the nvanbenschoten/tpccbenchHaproxy branch May 29, 2018 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpccbench: use HAProxy when running chaos benchmarks#26075

tpccbench: use HAProxy when running chaos benchmarks#26075
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/tpccbenchHaproxy

nvb commented May 25, 2018 •

edited

Loading

Uh oh!

cockroach-teamcity commented May 25, 2018

Uh oh!

petermattis commented May 25, 2018

Uh oh!

nvb commented May 25, 2018

Uh oh!

petermattis commented May 26, 2018

Uh oh!

bdarnell commented May 28, 2018

Uh oh!

nvb commented May 29, 2018

Uh oh!

bdarnell commented May 29, 2018

Uh oh!

nvb commented May 29, 2018

Uh oh!

craig Bot commented May 29, 2018

Uh oh!

tbg commented May 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nvb commented May 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 25, 2018

Uh oh!

petermattis commented May 25, 2018

Uh oh!

nvb commented May 25, 2018

Uh oh!

petermattis commented May 26, 2018

Uh oh!

bdarnell commented May 28, 2018

Uh oh!

nvb commented May 29, 2018

Uh oh!

bdarnell commented May 29, 2018

Uh oh!

nvb commented May 29, 2018

Uh oh!

craig Bot commented May 29, 2018

Build succeeded

Uh oh!

tbg commented May 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nvb commented May 25, 2018 •

edited

Loading