Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util: reduce log spam during network breakdown #51937

Merged
merged 1 commit into from
Jul 29, 2020

Conversation

irfansharif
Copy link
Contributor

Touches #32581. We currently spam a lot whenever there is any issues
with network connectivity. Seems like we used to dedup this log spam
away before, but the strings we were matching against have gone stale
(there's no more addConn.resetTransport in grpc code). Also, we used
to dedup log spam away by looking at goroutine thread ID, instead of the
message itself, which this patch does instead. This reduces spam in
situations where we liberally construct one-off connections to
downed/partitioned nodes (for e.g., during start up).

W200727 19:02:29.108962 392
vendor/google.golang.org/grpc/internal/channelz/logging.go:73  grpc:
addrConn.createTransport failed to connect to {127.0.0.1:26257  <nil> 0
<nil>}. Err: connection error: desc = "transport: Error while dialing
dial tcp 127.0.0.1:26257: connect: connection refused". Reconnecting...

W200727 19:02:30.111670 392
vendor/google.golang.org/grpc/internal/channelz/logging.go:73  grpc:
addrConn.createTransport failed to connect to {127.0.0.1:26257  <nil> 0
<nil>}. Err: connection error: desc = "transport: Error while dialing
cannot reuse client connection". Reconnecting...

Release note (general change): Previously CRDB would spam its logs
when undergoing network connectivity issues. This patch reduces the
frequency of said spam.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

By the way, the frequency (30s) is something I pulled out of thin air. I can revive the goid based duplication if asked, or even elevate the specific warnings to errors now that they're a bit more infrequent. The PR was basically reviving what we already had in place with this shouldPrint stuff, the check just happened to move away from Warningf to Warning because of internal grpc code/wiring changes since.

@knz
Copy link
Contributor

knz commented Jul 29, 2020

LGTM - no objection on maintaining this technical debt

I wonder though if it's possible to identify the error by looking at the call stack instead: if one of the caller is the connect function, don't log (or something like that).

You'll need to rebase to address the CI failure (or merge as-is: TestComposeGSS is now green on master)

@irfansharif
Copy link
Contributor Author

TFTR, I've rebased. I thought about trying extract a more specific error and convinced myself that the callstack would be as fragile (ish) as what we've done here. Was also being lazy.

bors r+

@lunevalex lunevalex removed this from In progress in KV 20.2 Jul 29, 2020
@irfansharif
Copy link
Contributor Author

Flaked on #52057, retrying.

Touches cockroachdb#32581. We currently spam a lot whenever there is any issues
with network connectivity. Seems like we used to dedup this log spam
away before, but the strings we were matching against have gone stale
(there's no more `addConn.resetTransport` in grpc code). Also, we used
to dedup log spam away by looking at goroutine thread ID, instead of the
message itself, which this patch does instead. This reduces spam in
situations where we liberally construct one-off connections to
downed/partitioned nodes (for e.g., during start up).

```
W200727 19:02:29.108962 392
vendor/google.golang.org/grpc/internal/channelz/logging.go:73  grpc:
addrConn.createTransport failed to connect to {127.0.0.1:26257  <nil> 0
<nil>}. Err: connection error: desc = "transport: Error while dialing
dial tcp 127.0.0.1:26257: connect: connection refused". Reconnecting...

W200727 19:02:30.111670 392
vendor/google.golang.org/grpc/internal/channelz/logging.go:73  grpc:
addrConn.createTransport failed to connect to {127.0.0.1:26257  <nil> 0
<nil>}. Err: connection error: desc = "transport: Error while dialing
cannot reuse client connection". Reconnecting...
```

Release note (general change): Previously CRDB would spam its logs
when undergoing network connectivity issues. This patch reduces the
frequency of said spam.
@irfansharif
Copy link
Contributor Author

bors r+

@irfansharif
Copy link
Contributor Author

Flaked on #52026.

@craig
Copy link
Contributor

craig bot commented Jul 29, 2020

Build succeeded:

@craig craig bot merged commit 8dff8da into cockroachdb:master Jul 29, 2020
@irfansharif irfansharif deleted the 200727.log-spam branch July 29, 2020 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants