Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log when probe succeeds but full connection fails #51304

Conversation

DaveCTurner
Copy link
Contributor

It is permitted for nodes to accept transport connections at addresses other
than their publish address, which allows a good deal of flexibility when
configuring discovery. However, it is not unusual for users to misconfigure
nodes to pick a publish address which is inaccessible to other nodes. We see
this happen a lot if the nodes are on different networks separated by a proxy,
or if the nodes are running in Docker with the wrong kind of network config.

In this case we offer no useful feedback to the user unless they enable
TRACE-level logs. It's particularly tricky to diagnose because if we test
connectivity between the nodes (using their discovery addresses) then all will
appear well.

This commit adds a WARN-level log if this kind of misconfiguration is detected:
the probe connection has succeeded (to indicate that we are really talking to a
healthy Elasticsearch node) but the followup connection attempt fails.

It also tidies up some loose ends in HandshakingTransportAddressConnector,
removing some TODOs that need not be completed, and registering its
accidentally-unregistered timeout settings.

It is permitted for nodes to accept transport connections at addresses other
than their publish address, which allows a good deal of flexibility when
configuring discovery. However, it is not unusual for users to misconfigure
nodes to pick a publish address which is inaccessible to other nodes. We see
this happen a lot if the nodes are on different networks separated by a proxy,
or if the nodes are running in Docker with the wrong kind of network config.

In this case we offer no useful feedback to the user unless they enable
TRACE-level logs. It's particularly tricky to diagnose because if we test
connectivity between the nodes (using their discovery addresses) then all will
appear well.

This commit adds a WARN-level log if this kind of misconfiguration is detected:
the probe connection has succeeded (to indicate that we are really talking to a
healthy Elasticsearch node) but the followup connection attempt fails.

It also tidies up some loose ends in `HandshakingTransportAddressConnector`,
removing some TODOs that need not be completed, and registering its
accidentally-unregistered timeout settings.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment/question, looks fine in general :)

logger.trace("[{}] full connection successful: {}", thisConnectionAttempt, remoteNode);
listener.onResponse(remoteNode);
}));
transportService.connectToNode(remoteNode, ActionListener.wrap(ignored -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move to wrap here? We (mostly Henning :)) are currently trying to remove the number of instances of passing broken listeners to transport APIs that don't handle their own exceptions and this seems like a step in the wrong direction. Can we fix the listener to handle its exception instead?

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit ff6f509 into elastic:master Jan 23, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 23, 2020
It is permitted for nodes to accept transport connections at addresses other
than their publish address, which allows a good deal of flexibility when
configuring discovery. However, it is not unusual for users to misconfigure
nodes to pick a publish address which is inaccessible to other nodes. We see
this happen a lot if the nodes are on different networks separated by a proxy,
or if the nodes are running in Docker with the wrong kind of network config.

In this case we offer no useful feedback to the user unless they enable
TRACE-level logs. It's particularly tricky to diagnose because if we test
connectivity between the nodes (using their discovery addresses) then all will
appear well.

This commit adds a WARN-level log if this kind of misconfiguration is detected:
the probe connection has succeeded (to indicate that we are really talking to a
healthy Elasticsearch node) but the followup connection attempt fails.

It also tidies up some loose ends in `HandshakingTransportAddressConnector`,
removing some TODOs that need not be completed, and registering its
accidentally-unregistered timeout settings.
@DaveCTurner DaveCTurner deleted the 2020-01-22-HandshakingTransportAddressConnector-fixes branch January 23, 2020 16:02
DaveCTurner added a commit that referenced this pull request Jan 23, 2020
It is permitted for nodes to accept transport connections at addresses other
than their publish address, which allows a good deal of flexibility when
configuring discovery. However, it is not unusual for users to misconfigure
nodes to pick a publish address which is inaccessible to other nodes. We see
this happen a lot if the nodes are on different networks separated by a proxy,
or if the nodes are running in Docker with the wrong kind of network config.

In this case we offer no useful feedback to the user unless they enable
TRACE-level logs. It's particularly tricky to diagnose because if we test
connectivity between the nodes (using their discovery addresses) then all will
appear well.

This commit adds a WARN-level log if this kind of misconfiguration is detected:
the probe connection has succeeded (to indicate that we are really talking to a
healthy Elasticsearch node) but the followup connection attempt fails.

It also tidies up some loose ends in `HandshakingTransportAddressConnector`,
removing some TODOs that need not be completed, and registering its
accidentally-unregistered timeout settings.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (elastic#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
DaveCTurner added a commit that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
DaveCTurner added a commit that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
DaveCTurner added a commit that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
DaveCTurner added a commit that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
DaveCTurner added a commit that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
DaveCTurner added a commit that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
DaveCTurner added a commit that referenced this pull request Jan 28, 2021
The following settings are not exposed to users in 7.6 and earlier:

- `discovery.probe.connect_timeout`
- `discovery.probe.handshake_timeout`

This was addressed in 7.7 (#51304) but the docs for older versions
suggest incorrectly that these settings are available. This commit
removes the docs for these settings in the affected versions to avoid
confusion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >enhancement v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants