hubble-relay: Add node status message #11589

gandro · 2020-05-18T19:19:46Z

This adds a status message to the observer service implementation
of hubble-relay. The status message is added as a new variant of
the GetFlowsResponse, therefore older clients will silently ignore
it.

The new node status event is defined as follow:

// NodeStatusEvent is a message sent by hubble-relay to inform clients about
// the state of a particular node.
message NodeStatusEvent {
    // state_change contains the new node state
    NodeState state_change = 1;
    // node_names is the list of nodes for which the above state changes applies
    repeated string node_names = 2;
    // message is an optional message attached to the state change (e.g. an
    // error message). The message applies to all nodes in node_names.
    string message = 3;
}

enum NodeState {
    // UNKNOWN_NODE_STATE indicates that the state of this node is unknown.
    UNKNOWN_NODE_STATE = 0;
    // NODE_CONNECTED indicates that we have established a connection
    // to this node. The client can expect to observe flows from this node.
    NODE_CONNECTED = 1;
    // NODE_UNAVAILABLE indicates that the connection to this
    // node is currently unavailable. The client can expect to not see any
    // flows from this node until either the connection is re-established or
    // the node is gone.
    NODE_UNAVAILABLE = 2;
    // NODE_GONE indicates that a node has been removed from the
    // cluster. No reconnection attempts will be made.
    NODE_GONE = 3;
    // NODE_ERROR indicates that a node has reported an error while processing
    // the request. No reconnection attempts will be made.
    NODE_ERROR = 4;
}

The implementation makes sure to send down the initial node status
for all nodes participating in the request first, i.e. before any flows.

The node state GONE is not used in this initial version, as the
current implementation of the peer management in hubble-relay does not
inform the running GetFlows request about the removal of a node.
We mark any failed node as UNAVAILABLE for now.

Review per commit as this PR also performs a bit of cleanup.

Fixes: #11360

coveralls · 2020-05-18T19:50:55Z

Coverage increased (+0.009%) to 36.905% when pulling dd46c31 on pr/gandro/hubble-relay-status-response into 5a8501a on master.

rolinh

Very nice improvement overall! I'm a little worried about the complexity of the GetFlows implementation. Eventually, it should be refactored so that it could be more readable/testable/debuggable (work for a follow-up PR though).

api/v1/observer/observer.proto

rolinh · 2020-05-19T09:14:19Z

pkg/hubble/relay/observer.go

+func relayStatusResponse(numPeers int, failedPeers []string) *observerpb.GetFlowsResponse {
+	return &observerpb.GetFlowsResponse{
+		Time:     ptypes.TimestampNow(),
+		NodeName: node.GetName(),


Does the node name makes sense in the context of hubble-relay? Well, I guess we have to fill this anyway and I see no better alternative.

pkg/hubble/relay/observer.go

rolinh

I think the changes you made to the API proposal are good. However, I need to experiment/play a bit more with the API in order to understand what could be missing or improved before giving a final approval.

rolinh · 2020-05-20T11:16:18Z

api/v1/relay/relay.proto

+
+enum NodeState {
+    // UNKNOWN_NODE_STATE indicates that the state of this node is unknown.
+    UNKNOWN_NODE_STATE = 0;


Since the enum itself is called NodeState, I don't think it's necessary to repeat NODE_STATE in each state name.
They could just be UNKNOWN, CONNECTED, UNAVAILABLE and GONE.

UNKNOWN needs to be prefixed, because protobuf does not have enum namespaces and will therefore collide with any other enum which will have the UNKNOWN variant. But I guess for the others, we can do that.

I have shortened the variants to NODE_xxx. I think a prefix is still useful, since relay.proto will gain quite a few other enums that maybe also want to make use of variants like ERROR.

pkg/hubble/relay/observer.go

gandro · 2020-05-20T17:08:23Z

I have added a NODE_ERROR status to the API. It is used to report errors which may occur during a request.

Because certain errors occur on multiple nodes simultaneously (e.g. invalid request parameters), I have also added a function to coalesce duplicate errors (within a predefined time window).

gandro · 2020-05-20T19:41:10Z

test-me-please

rolinh

I like the NODE_ERROR change, very welcome! As a follow-up to this PR, I wonder if we should not also record the last error for every peer along with the corresponding timestamp so that we could then serve this information via a corresponding note status grpc service.

pkg/hubble/relay/relayoption/defaults.go

api/v1/relay/relay.proto

This commit contians no functional change. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This introduces a new node status message sent hubble-relay. The message is used to inform clients about the connectivity of nodes participating in a request. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This adds support for NodeStatusEvents in the GetFlows RPC call. The intent of this event is to inform downstream consumers about the state of the nodes which are participating in the current request. The node state `NODE_GONE` is not used in this initial version, as the current implementation of the peer management in hubble-relay does not inform the running GetFlows request about the removal of a node. If a peer is not ready when the request is started, we mark it as `NODE_UNAVAILABLE`. If a peer errors out during a request, we indicate this via `NODE_ERROR` and propagate the received error message. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit adds a new stage to the hubble-relay processing which supresses duplicate errors which may occur on multiple nodes within a certain time window. The list of nodes is merged, such that the reported error contains each node on which the error occured. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2020-05-25T08:29:36Z

test-me-please

pkg/hubble/relay/observer.go

gandro · 2020-05-25T09:48:41Z

retest-4.9

gandro · 2020-05-25T09:48:50Z

retest-4.19

rolinh

I've tested this quite extensively and I'm pretty happy with this new API change and implementation. Well done!

gandro · 2020-05-25T12:10:28Z

retest-runtime

gandro · 2020-05-25T13:06:10Z

retest-4.19

gandro · 2020-05-25T15:09:30Z

The runtime failure is a known flake: #10838.

Considering this PR does not affect any components in the runtime tests, I will not restart it.

gandro added release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay labels May 18, 2020

maintainer-s-little-helper bot added this to In progress in 1.8.0 May 18, 2020

rolinh reviewed May 19, 2020

View reviewed changes

gandro force-pushed the pr/gandro/hubble-relay-status-response branch from 2e6b01b to 30b1fed Compare May 20, 2020 11:05

gandro marked this pull request as ready for review May 20, 2020 11:11

gandro requested review from a team as code owners May 20, 2020 11:11

gandro requested a review from a team May 20, 2020 11:11

gandro changed the title ~~hubble-relay: Add relay status message~~ hubble-relay: Add node status message May 20, 2020

rolinh reviewed May 20, 2020

View reviewed changes

gandro force-pushed the pr/gandro/hubble-relay-status-response branch 2 times, most recently from fb05d51 to 814f17d Compare May 20, 2020 15:41

rolinh reviewed May 22, 2020

View reviewed changes

pkg/hubble/relay/relayoption/defaults.go Outdated Show resolved Hide resolved

gandro mentioned this pull request May 22, 2020

printer: Add support for NodeStatusEvent cilium/hubble#260

Merged

rolinh reviewed May 22, 2020

View reviewed changes

api/v1/relay/relay.proto Outdated Show resolved Hide resolved

hubble-relay: Split out GetFlows logic into functions

ded796a

This commit contians no functional change. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro force-pushed the pr/gandro/hubble-relay-status-response branch from 65aa5e3 to 6b74ace Compare May 25, 2020 08:26

gandro added 3 commits May 25, 2020 10:28

api: Add node status message for hubble-relay

fd278c2

This introduces a new node status message sent hubble-relay. The message is used to inform clients about the connectivity of nodes participating in a request. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro force-pushed the pr/gandro/hubble-relay-status-response branch from 6b74ace to dd46c31 Compare May 25, 2020 08:28

rolinh reviewed May 25, 2020

View reviewed changes

pkg/hubble/relay/observer.go Show resolved Hide resolved

pkg/hubble/relay/observer.go Show resolved Hide resolved

rolinh approved these changes May 25, 2020

View reviewed changes

aanm approved these changes May 25, 2020

View reviewed changes

aanm merged commit 1ea1960 into master May 25, 2020

1.8.0 automation moved this from In progress to Merged May 25, 2020

aanm deleted the pr/gandro/hubble-relay-status-response branch May 25, 2020 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble-relay: Add node status message #11589

hubble-relay: Add node status message #11589

gandro commented May 18, 2020 •

edited

coveralls commented May 18, 2020 •

edited

rolinh left a comment

rolinh May 19, 2020

rolinh left a comment

rolinh May 20, 2020

gandro May 20, 2020

gandro May 20, 2020

gandro commented May 20, 2020 •

edited

gandro commented May 20, 2020

rolinh left a comment

gandro commented May 25, 2020

gandro commented May 25, 2020

gandro commented May 25, 2020

rolinh left a comment

gandro commented May 25, 2020

gandro commented May 25, 2020

gandro commented May 25, 2020 •

edited

hubble-relay: Add node status message #11589

hubble-relay: Add node status message #11589

Conversation

gandro commented May 18, 2020 • edited

coveralls commented May 18, 2020 • edited

rolinh left a comment

Choose a reason for hiding this comment

rolinh May 19, 2020

Choose a reason for hiding this comment

rolinh left a comment

Choose a reason for hiding this comment

rolinh May 20, 2020

Choose a reason for hiding this comment

gandro May 20, 2020

Choose a reason for hiding this comment

gandro May 20, 2020

Choose a reason for hiding this comment

gandro commented May 20, 2020 • edited

gandro commented May 20, 2020

rolinh left a comment

Choose a reason for hiding this comment

gandro commented May 25, 2020

gandro commented May 25, 2020

gandro commented May 25, 2020

rolinh left a comment

Choose a reason for hiding this comment

gandro commented May 25, 2020

gandro commented May 25, 2020

gandro commented May 25, 2020 • edited

gandro commented May 18, 2020 •

edited

coveralls commented May 18, 2020 •

edited

gandro commented May 20, 2020 •

edited

gandro commented May 25, 2020 •

edited