Log when probe succeeds but full connection fails #51304

DaveCTurner · 2020-01-22T12:27:37Z

It is permitted for nodes to accept transport connections at addresses other
than their publish address, which allows a good deal of flexibility when
configuring discovery. However, it is not unusual for users to misconfigure
nodes to pick a publish address which is inaccessible to other nodes. We see
this happen a lot if the nodes are on different networks separated by a proxy,
or if the nodes are running in Docker with the wrong kind of network config.

In this case we offer no useful feedback to the user unless they enable
TRACE-level logs. It's particularly tricky to diagnose because if we test
connectivity between the nodes (using their discovery addresses) then all will
appear well.

This commit adds a WARN-level log if this kind of misconfiguration is detected:
the probe connection has succeeded (to indicate that we are really talking to a
healthy Elasticsearch node) but the followup connection attempt fails.

It also tidies up some loose ends in HandshakingTransportAddressConnector,
removing some TODOs that need not be completed, and registering its
accidentally-unregistered timeout settings.

It is permitted for nodes to accept transport connections at addresses other than their publish address, which allows a good deal of flexibility when configuring discovery. However, it is not unusual for users to misconfigure nodes to pick a publish address which is inaccessible to other nodes. We see this happen a lot if the nodes are on different networks separated by a proxy, or if the nodes are running in Docker with the wrong kind of network config. In this case we offer no useful feedback to the user unless they enable TRACE-level logs. It's particularly tricky to diagnose because if we test connectivity between the nodes (using their discovery addresses) then all will appear well. This commit adds a WARN-level log if this kind of misconfiguration is detected: the probe connection has succeeded (to indicate that we are really talking to a healthy Elasticsearch node) but the followup connection attempt fails. It also tidies up some loose ends in `HandshakingTransportAddressConnector`, removing some TODOs that need not be completed, and registering its accidentally-unregistered timeout settings.

elasticmachine · 2020-01-22T12:27:40Z

Pinging @elastic/es-distributed (:Distributed/Network)

original-brownbear · 2020-01-22T13:12:40Z

@DaveCTurner FYI new test failed in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/14307/testReport/junit/org.elasticsearch.discovery/HandshakingTransportAddressConnectorTests/testLogsFullConnectionFailureAfterSuccessfulHandshake/

original-brownbear

One comment/question, looks fine in general :)

original-brownbear · 2020-01-22T15:42:00Z

server/src/main/java/org/elasticsearch/discovery/HandshakingTransportAddressConnector.java

-                                                logger.trace("[{}] full connection successful: {}", thisConnectionAttempt, remoteNode);
-                                                listener.onResponse(remoteNode);
-                                            }));
+                                        transportService.connectToNode(remoteNode, ActionListener.wrap(ignored -> {


Why move to wrap here? We (mostly Henning :)) are currently trying to remove the number of instances of passing broken listeners to transport APIs that don't handle their own exceptions and this seems like a step in the wrong direction. Can we fix the listener to handle its exception instead?

original-brownbear

LGTM

It is permitted for nodes to accept transport connections at addresses other than their publish address, which allows a good deal of flexibility when configuring discovery. However, it is not unusual for users to misconfigure nodes to pick a publish address which is inaccessible to other nodes. We see this happen a lot if the nodes are on different networks separated by a proxy, or if the nodes are running in Docker with the wrong kind of network config. In this case we offer no useful feedback to the user unless they enable TRACE-level logs. It's particularly tricky to diagnose because if we test connectivity between the nodes (using their discovery addresses) then all will appear well. This commit adds a WARN-level log if this kind of misconfiguration is detected: the probe connection has succeeded (to indicate that we are really talking to a healthy Elasticsearch node) but the followup connection attempt fails. It also tidies up some loose ends in `HandshakingTransportAddressConnector`, removing some TODOs that need not be completed, and registering its accidentally-unregistered timeout settings.

The following settings are not exposed to users in 7.6 and earlier: - `discovery.probe.connect_timeout` - `discovery.probe.handshake_timeout` This was addressed in 7.7 (elastic#51304) but the docs for older versions suggest incorrectly that these settings are available. This commit removes the docs for these settings in the affected versions to avoid confusion.

The following settings are not exposed to users in 7.6 and earlier: - `discovery.probe.connect_timeout` - `discovery.probe.handshake_timeout` This was addressed in 7.7 (#51304) but the docs for older versions suggest incorrectly that these settings are available. This commit removes the docs for these settings in the affected versions to avoid confusion.

DaveCTurner added >enhancement :Distributed/Network Http and internode communication implementations v8.0.0 v7.7.0 labels Jan 22, 2020

DaveCTurner requested review from ywelsch and original-brownbear January 22, 2020 12:27

DaveCTurner added 3 commits January 22, 2020 13:15

Meh let's always log the stack trace

b10b474

Fix level and message

1bd1c8e

😖

e787138

original-brownbear reviewed Jan 22, 2020

View reviewed changes

Raw ActionListener, no trappy exception handling

bd87bcb

original-brownbear approved these changes Jan 22, 2020

View reviewed changes

DaveCTurner merged commit ff6f509 into elastic:master Jan 23, 2020

DaveCTurner mentioned this pull request Jan 23, 2020

Log when probe succeeds but full connection fails #51357

Merged

DaveCTurner added the backport pending label Jan 23, 2020

DaveCTurner deleted the 2020-01-22-HandshakingTransportAddressConnector-fixes branch January 23, 2020 16:02

DaveCTurner removed the backport pending label Jan 23, 2020

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

DaveCTurner mentioned this pull request Jan 28, 2021

Remove unregistered discovery settings from old docs #68093

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log when probe succeeds but full connection fails #51304

Log when probe succeeds but full connection fails #51304

DaveCTurner commented Jan 22, 2020

elasticmachine commented Jan 22, 2020

original-brownbear commented Jan 22, 2020

original-brownbear left a comment

original-brownbear Jan 22, 2020

original-brownbear left a comment •

edited

Loading

Log when probe succeeds but full connection fails #51304

Log when probe succeeds but full connection fails #51304

Conversation

DaveCTurner commented Jan 22, 2020

elasticmachine commented Jan 22, 2020

original-brownbear commented Jan 22, 2020

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Jan 22, 2020

Choose a reason for hiding this comment

original-brownbear left a comment • edited Loading

Choose a reason for hiding this comment

original-brownbear left a comment •

edited

Loading