Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CockroachDB logs " ... write: broken pipe" [HAProxy + gRPC] #16084

Closed
TomArbuckle opened this issue May 23, 2017 · 10 comments

Comments

@TomArbuckle
Copy link

commented May 23, 2017

BUG REPORT

  1. Please supply the header (i.e. the first few lines) of your most recent
    log file for each node in your cluster.

$ fgrep -F '[config]' node?/logs/cockroach.log
node1/logs/cockroach.log:I170523 10:10:51.370540 1 util/log/clog.go:1039 [config] file created at: 2017/05/23 10:10:51
node1/logs/cockroach.log:I170523 10:10:51.370540 1 util/log/clog.go:1039 [config] running on machine: Toms-MBP
node1/logs/cockroach.log:I170523 10:10:51.370540 1 util/log/clog.go:1039 [config] binary: CockroachDB CCL 4a9f417 (darwin amd64, built 2017/05/23 08:45:23, go1.8.1)
node1/logs/cockroach.log:I170523 10:10:51.370540 1 util/log/clog.go:1039 [config] arguments: [cockroach start --insecure --host=localhost --store=node1 --max-sql-memory=3G --port=26258 --http-port=8081]
node2/logs/cockroach.log:I170523 10:10:52.674976 1 util/log/clog.go:1039 [config] file created at: 2017/05/23 10:10:52
node2/logs/cockroach.log:I170523 10:10:52.674976 1 util/log/clog.go:1039 [config] running on machine: Toms-MBP
node2/logs/cockroach.log:I170523 10:10:52.674976 1 util/log/clog.go:1039 [config] binary: CockroachDB CCL 4a9f417 (darwin amd64, built 2017/05/23 08:45:23, go1.8.1)
node2/logs/cockroach.log:I170523 10:10:52.674976 1 util/log/clog.go:1039 [config] arguments: [cockroach start --insecure --host=localhost --store=node2 --max-sql-memory=3G --port=26259 --http-port=8082 --join=localhost:26258]
node3/logs/cockroach.log:I170523 10:10:52.896888 1 util/log/clog.go:1039 [config] file created at: 2017/05/23 10:10:52
node3/logs/cockroach.log:I170523 10:10:52.896888 1 util/log/clog.go:1039 [config] running on machine: Toms-MBP
node3/logs/cockroach.log:I170523 10:10:52.896888 1 util/log/clog.go:1039 [config] binary: CockroachDB CCL 4a9f417 (darwin amd64, built 2017/05/23 08:45:23, go1.8.1)
node3/logs/cockroach.log:I170523 10:10:52.896888 1 util/log/clog.go:1039 [config] arguments: [cockroach start --insecure --host=localhost --store=node3 --max-sql-memory=3G --port=26260 --http-port=8083 --join=localhost:26258]
node4/logs/cockroach.log:I170523 10:15:35.611564 1 util/log/clog.go:1039 [config] file created at: 2017/05/23 10:15:35
node4/logs/cockroach.log:I170523 10:15:35.611564 1 util/log/clog.go:1039 [config] running on machine: Toms-MBP
node4/logs/cockroach.log:I170523 10:15:35.611564 1 util/log/clog.go:1039 [config] binary: CockroachDB CCL 4a9f417 (darwin amd64, built 2017/05/23 08:45:23, go1.8.1)
node4/logs/cockroach.log:I170523 10:15:35.611564 1 util/log/clog.go:1039 [config] arguments: [cockroach start --insecure --host=localhost --store=node4 --max-sql-memory=3G --port=26261 --http-port=8084 --join=localhost:26258]
node5/logs/cockroach.log:I170523 10:15:47.449602 1 util/log/clog.go:1039 [config] file created at: 2017/05/23 10:15:47
node5/logs/cockroach.log:I170523 10:15:47.449602 1 util/log/clog.go:1039 [config] running on machine: Toms-MBP
node5/logs/cockroach.log:I170523 10:15:47.449602 1 util/log/clog.go:1039 [config] binary: CockroachDB CCL 4a9f417 (darwin amd64, built 2017/05/23 08:45:23, go1.8.1)
node5/logs/cockroach.log:I170523 10:15:47.449602 1 util/log/clog.go:1039 [config] arguments: [cockroach start --insecure --host=localhost --store=node5 --max-sql-memory=3G --port=26262 --http-port=8085 --join=localhost:26258]

This is a 5 node cluster fronted by HAProxy with config file:

global
maxconn 4096
daemon

defaults
mode tcp
timeout connect 10s
timeout client 1m
timeout server 1m

listen psql
bind :26257
mode tcp
balance leastconn
server cockroach1 localhost:26258 check
server cockroach2 localhost:26259 check
server cockroach3 localhost:26260 check
server cockroach4 localhost:26261 check
server cockroach5 localhost:26262 check

  1. Please describe the issue you observed:

An example of the problem log.

$ fgrep pipe node?/logs/cockroach.log
...
node5/logs/cockroach.log:I170523 14:27:58.173516 201803 vendor/google.golang.org/grpc/server.go:519 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 127.0.0.1:26262->127.0.0.1:53701: write: broken pipe"
node5/logs/cockroach.log:I170523 14:28:00.175008 201861 vendor/google.golang.org/grpc/server.go:519 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 127.0.0.1:26262->127.0.0.1:53708: write: broken pipe"
node5/logs/cockroach.log:I170523 14:28:02.177234 201816 vendor/google.golang.org/grpc/server.go:519 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 127.0.0.1:26262->127.0.0.1:53717: write: broken pipe"
...

  • What did you do?

Some searching revealed similar issues:

grpc/grpc-go#1142

Removing the check annotation from the haproxy.cfg file stopped the "broken pipe" error reports.

Verified that HAProxy liveness checks are indeed the cause by adding

option pgsql-check

to the haproxy.cfg file (after replacing the check annotation).

Error logged by Cockroach becomes:
...
E170523 12:09:28.163867 41626 server/server.go:628 [n5,client=127.0.0.1:57197] read tcp 127.0.0.1:26262->127.0.0.1:57197: read: connection reset by peer
E170523 12:09:30.166172 41632 server/server.go:628 [n5,client=127.0.0.1:57204] read tcp 127.0.0.1:26262->127.0.0.1:57204: read: connection reset by peer
E170523 12:09:32.168811 41672 server/server.go:628 [n5,client=127.0.0.1:57213] read tcp 127.0.0.1:26262->127.0.0.1:57213: read: connection reset by peer
E170523 12:09:34.170868 41698 server/server.go:628 [n5,client=127.0.0.1:57220] read tcp 127.0.0.1:26262->127.0.0.1:57220: read: connection reset by peer
E170523 12:09:36.173829 41677 server/server.go:628 [n5,client=127.0.0.1:57225] read tcp 127.0.0.1:26262->127.0.0.1:57225: read: connection reset by peer
...

The "check" annotation is emitted for

cockroach gen haproxy .

@andreimatei

This comment has been minimized.

Copy link
Member

commented May 23, 2017

Hi @TomArbuckle. So you're saying that HAProxy checks fail (with and without option pgsql-check), right?
Is the cluster working otherwise? Can you connect to it from the HAProxy machine?

@TomArbuckle

This comment has been minimized.

Copy link
Author

commented May 23, 2017

Hi @andreimatei .

Cluster is working normally, as far as I can tell. haproxy is distributing requests. (The cluster is running on one machine.)

I find the log messages disconcerting and wanted to know what was causing them.

I guess the point I'm trying to make is that such messages are going to be appearing in people's logs if there are "check" annotations in people's haproxy.cfg files and they might be wondering what was going on.

@mberhault

This comment has been minimized.

Copy link
Contributor

commented May 23, 2017

Trying to reproduce this, I only see it when option pgsql-check is set. The generated haproxy config does not set that. The config you pasted also does not have the option set.
Did you see it happen without that option (but with check still present)?

@andreimatei

This comment has been minimized.

Copy link
Member

commented May 23, 2017

Oh, so the checks are not failing, as far as HAProxy is concerned. It's just the log spam that's the issue.

For the read tcp 127.0.0.1:26262->127.0.0.1:57197: read: connection reset by peer message, perhaps our filter for "expected" errors is missing this:

func IsClosedConnection(err error) bool {

@mberhault perhaps the appearance of grpc: Server.Serve failed to create ServerTransport error is just OSX specific. I know we've installed some magic to make grpc errors silents. Is that right @tamird ?

@mberhault

This comment has been minimized.

Copy link
Contributor

commented May 23, 2017

Ah, that might explain it.
On linux, I have no problems using checks without pgsql-check. The logs show nothing, and connections through haproxy work fine.

However, when enabling pgsql-check, health checks fail with:

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result FAILED
Total: 3 (2 usable), will use epoll.
Using epoll() as the polling mechanism.
[WARNING] 142/130314 (14872) : Server psql/cockroach1 is DOWN, reason: Layer7 timeout, check duration: 2001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 142/130314 (14872) : proxy 'psql' has no server available!

And the logs dump a bunch of:

I170523 13:03:22.539230 738 vendor/google.golang.org/grpc/server.go:519  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 127.0.1.1:26257->127.0.0.1:45280: write: broken pipe"
I170523 13:03:26.543618 561 vendor/google.golang.org/grpc/server.go:519  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write tcp 127.0.1.1:26257->127.0.0.1:45286: write: broken pipe"

A tcp capture of :26257 shows no data being sent, just SYN, SYN/ACK, ACK and a RST sent by haproxy.

@mberhault

This comment has been minimized.

Copy link
Contributor

commented May 23, 2017

If I change the option to option pgsql-check user root, the checks now pass, but we do get connection reset by peer as opposed to the broken pipe errors. I guess grpc doesn't like a plain RST

@TomArbuckle

This comment has been minimized.

Copy link
Author

commented May 24, 2017

So, in summary:

  • I noticed these "broken pipe" warnings because I was running the code on a Macintosh;
  • option pgsql-check user root also gives the "reset by peer" warnings on non-Macintosh;
  • HAProxy's SYN, SYN/ACK, ACK, RST sequence might be lightweight but can leave the backend server it's fronting confused about the state of the liveness checking connection.
@bdarnell

This comment has been minimized.

Copy link
Member

commented May 24, 2017

It's annoying that haproxy RSTs all its health check connections instead of shutting them down cleanly (maybe RST is appropriate for the TCP health checks but with pgsql-check it should be able to do an orderly shutdown).

I think that in most cases certain networking errors should be downgraded to V(1) (in both pgwire and grpc). The list I use in Tornado for this is ECONNRESET, ECONNABORTED, ETIMEDOUT, EPIPE, and EPROTOTYPE (the latter is only on darwin).

@mberhault

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2017

Is there anything else to do here? The configs warn that timeouts should be set to fit your usage. Beyond that, there's not a whole lot we can do.

@mberhault

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2017

Closing.

@mberhault mberhault closed this Oct 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.