Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble/relay: print better error message on peer conn dial failure #13484

Merged
merged 1 commit into from
Oct 12, 2020

Conversation

rolinh
Copy link
Member

@rolinh rolinh commented Oct 9, 2020

To connect to peers via gRPC, Hubble Relay uses grpc.DialContext()
with a dial timeout. When a connection attempt to a peer fails, the
following error message is returned: context deadline exceeded.
This provides no value when trying to debug the issue. Is the peer
unreachable? Is this a TLS configuration issue? etc.

By using the dial option grpc.FailOnNonTempDialError(true), we get
better error messages on non-temporary dial errors, such as:

connection error: desc = "transport: error while dialing: dial tcp 172.18.0.4:4244: connect: connection refused"

Using the option grpc.WithReturnConnectionError(), we would get a
message that contains both the last connection error that occurred and
the context.DeadlineExceeded error. This greatly improves the
situation for errors that are deemed transient. Example:

context deadline exceeded: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for *.hubble-relay.cilium.io, not localhost"

However, this is only available starting with grpc-go v1.30.0 and due to
the issue described here, we are currently stuck with v1.29.1. Thus,
this commit adds the option but comments it with a TODO note so that it
can be caught up and uncommented when upgrading grpc-go becomes
possible.

Hubble Relay: improve error log message on peer connection failure.

To connect to peers via gRPC, Hubble Relay uses `grpc.DialContext()`
with a dial timeout. When a connection attempt to a peer fails, the
following error message is returned: `context deadline exceeded`.
This provides no value when trying to debug the issue. Is the peer
unreachable? Is this a TLS configuration issue? etc.

By using the dial option `grpc.FailOnNonTempDialError(true)`, we get
better error messages on non-temporary dial errors, such as:

    connection error: desc = "transport: error while dialing: dial tcp 172.18.0.4:4244: connect: connection refused"

Using the option `grpc.WithReturnConnectionError()`, we would get a
message that contains both the last connection error that occurred and
the `context.DeadlineExceeded` error. This greatly improves the
situation for errors that are deemed transient. Example:

    context deadline exceeded: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for *.hubble-relay.cilium.io, not localhost"

However, this is only available starting with grpc-go v1.30.0 and due to
the issue described here[0], we are currently stuck with v1.29.1. Thus,
this commit adds the option but comments it with a TODO note so that it
can be caught up and uncommented when upgrading grpc-go becomes
possible.

[0]: #13405 (comment)

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
@rolinh rolinh added kind/cleanup This includes no functional changes. release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay labels Oct 9, 2020
@rolinh rolinh requested a review from a team October 9, 2020 12:35
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 9, 2020
@rolinh
Copy link
Member Author

rolinh commented Oct 9, 2020

test-me-please

@rolinh
Copy link
Member Author

rolinh commented Oct 9, 2020

test-gke

1 similar comment
@rolinh
Copy link
Member Author

rolinh commented Oct 9, 2020

test-gke

@rolinh
Copy link
Member Author

rolinh commented Oct 9, 2020

retest-4.9

@rolinh
Copy link
Member Author

rolinh commented Oct 12, 2020

test-gke

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 12, 2020
@gandro gandro merged commit e2e3661 into master Oct 12, 2020
@gandro gandro deleted the pr/rolinh/hubble-relay-improved-conn-error-logs branch October 12, 2020 10:06
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 12, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup This includes no functional changes. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay
Projects
No open projects
1.8.5
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

6 participants