Skip to content

tpccbench: use HAProxy when running chaos benchmarks#26075

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/tpccbenchHaproxy
May 29, 2018
Merged

tpccbench: use HAProxy when running chaos benchmarks#26075
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/tpccbenchHaproxy

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented May 25, 2018

This change adjusts tpccbench to run an HAProxy server on the load
generator when running chaos benchmarking. HAProxy is then configured
to point at the cluster and the load generator is pointed at HAProxy.
This allows the load generator to fail-over when nodes go down without
any extra client-side logic.

It also has the nice effect of being the first roachtest to use HAProxy,
a tool that we recommend our users use.

This still needs some tweaking. For instance, I still need to tune the
haproxy.cfg to support a larger number of concurrent requests and to
fail over faster.

Release note: None

@nvb nvb requested a review from m-schneider May 25, 2018 01:13
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm: though as I mentioned in person yesterday, I'll be surprised if this is significantly better than workload's builtin round-robin driver.


Review status: 0 of 2 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 25, 2018

though as I mentioned in person yesterday, I'll be surprised if this is significantly better than workload's builtin round-robin driver

Loadgen's tpcc implementation pins workers to individual driver connections, and it looks like workload's tpcc does the same. Is there a round-robin policy I'm missing? If so, I would expect it to look more like kv.

And regardless of whether there is or not, shouldn't a load balancer be able to make better decisions than a simple round-robin policy because it can remember which nodes are down and avoid sending them each 1/nth of the traffic only to have all of that traffic fail? My intuition is that this should result in fewer failed operations.

@petermattis
Copy link
Copy Markdown
Collaborator

Ah, I keep forgetting that tpcc pins workers to connections (surprising given that I added that code). If we removed that pinning (or made it optional), the builtin round-robin driver should be essentially the same as using haproxy. Why use one over the other? Convenience. The round-robin driver was initially added to avoid having to setup haproxy when doing perf experiments.

And regardless of whether there is or not, shouldn't a load balancer be able to make better decisions than a simple round-robin policy because it can remember which nodes are down and avoid sending them each 1/nth of the traffic only to have all of that traffic fail? My intuition is that this should result in fewer failed operations.

I'm pretty sure we configure haproxy with a round-robin load balancing policy. We can certainly investigate more complex load balancing, though I'm not sure if we'll be able to achieve it with haproxy given that it doesn't know about the pgwire protocol. Regardless, this PR is fine to merge given how small it is. I'm mostly talking to spread my concern about placing too much faith in haproxy (especially the out of the box configuration). We might have to put effort into either improving that configuration or providing or own load balancing. To give you a concrete example of a concern: workers open connections to haproxy which in turn opens a long lived connection to a cockroach node. If a new cockroach node comes on line (or restarts after being killed), haproxy will not utilize this node until a worker opens a new connection.

@bdarnell
Copy link
Copy Markdown
Contributor

We might have to put effort into either improving that configuration or providing or own load balancing.

Or switching from haproxy to something that does understand the postgres protocol (like pgbouncer or pgpool).

This change adjusts tpccbench to run an HAProxy server on the load
generator when running chaos benchmarking. HAProxy is then configured
to point at the cluster and the load generator is pointed at HAProxy.
This allows the load generator to fail-over when nodes go down without
any extra client-side logic.

It also has the nice effect of being the first roachtest to use HAProxy,
a tool that we recommend our users use.

This still needs some tweaking. For instance, I still need to tune the
haproxy.cfg to support a larger number of concurrent requests and to
fail over faster.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/tpccbenchHaproxy branch from ee9e337 to 2f6a3a9 Compare May 29, 2018 22:43
@nvb nvb changed the title [DNM] tpccbench: use HAProxy when running chaos benchmarks tpccbench: use HAProxy when running chaos benchmarks May 29, 2018
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 29, 2018

To give you a concrete example of a concern: workers open connections to haproxy which in turn opens a long lived connection to a cockroach node. If a new cockroach node comes on line (or restarts after being killed), haproxy will not utilize this node until a worker opens a new connection.

That's a great point. HAProxy has a leastconn balancing policy which is an alternative to roundrobin for protocols that create long-lived connections. It would be interesting to see whether switching algorithms has a noticeable effect, especially while performing chaos testing.

Or switching from haproxy to something that does understand the postgres protocol (like pgbouncer or pgpool).

Have we ever tried running either of these against Cockroach?

@bdarnell
Copy link
Copy Markdown
Contributor

HAProxy has a leastconn balancing policy which is an alternative to roundrobin for protocols that create long-lived connections.

leastconn isn't quite enough for this - clients that have enough connections may just hold on to them indefinitely and never need to open up a new connection that could be assigned to the new node. You need some kind of maximum lifespan on the connection to force them to be recycled occasionally. This is ideally a client-side feature (such as sqlalchemy's pool_recycle option). MySQL has a server-side feature to kill idle connections and it's handled ungracefully by many clients.

Have we ever tried running either of these against Cockroach?

I ran some trivial tests with pgbouncer a while back and it seemed to work. @asubiotto has tried pgpool; it looks like there were issues in older versions of pgpool but they may be fixed now (#13009 (comment))

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 29, 2018

bors r+

craig Bot pushed a commit that referenced this pull request May 29, 2018
26075: tpccbench: use HAProxy when running chaos benchmarks r=nvanbenschoten a=nvanbenschoten

This change adjusts tpccbench to run an HAProxy server on the load
generator when running chaos benchmarking. HAProxy is then configured
to point at the cluster and the load generator is pointed at HAProxy.
This allows the load generator to fail-over when nodes go down without
any extra client-side logic.

It also has the nice effect of being the first roachtest to use HAProxy,
a tool that we recommend our users use.

This still needs some tweaking. For instance, I still need to tune the
haproxy.cfg to support a larger number of concurrent requests and to
fail over faster.

First commit is #26073.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig Bot commented May 29, 2018

Build succeeded

@craig craig Bot merged commit 2f6a3a9 into cockroachdb:master May 29, 2018
@nvb nvb deleted the nvanbenschoten/tpccbenchHaproxy branch May 29, 2018 23:51
@tbg
Copy link
Copy Markdown
Member

tbg commented May 30, 2018

Whatever we decide here, I already like the dogfooding aspect of this. We have ./cockroach gen haproxy and I assume this was put in because someone thought it was something production users may actually want to use. Now it looks like that's not true, and that there should be a ./cockroach gen pgbouncer (or something like it -- whichever we want to favor) instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants