Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble-relay: Add node status message #11589

Merged
merged 4 commits into from May 25, 2020

Conversation

gandro
Copy link
Member

@gandro gandro commented May 18, 2020

This adds a status message to the observer service implementation
of hubble-relay. The status message is added as a new variant of
the GetFlowsResponse, therefore older clients will silently ignore
it.

The new node status event is defined as follow:

// NodeStatusEvent is a message sent by hubble-relay to inform clients about
// the state of a particular node.
message NodeStatusEvent {
    // state_change contains the new node state
    NodeState state_change = 1;
    // node_names is the list of nodes for which the above state changes applies
    repeated string node_names = 2;
    // message is an optional message attached to the state change (e.g. an
    // error message). The message applies to all nodes in node_names.
    string message = 3;
}

enum NodeState {
    // UNKNOWN_NODE_STATE indicates that the state of this node is unknown.
    UNKNOWN_NODE_STATE = 0;
    // NODE_CONNECTED indicates that we have established a connection
    // to this node. The client can expect to observe flows from this node.
    NODE_CONNECTED = 1;
    // NODE_UNAVAILABLE indicates that the connection to this
    // node is currently unavailable. The client can expect to not see any
    // flows from this node until either the connection is re-established or
    // the node is gone.
    NODE_UNAVAILABLE = 2;
    // NODE_GONE indicates that a node has been removed from the
    // cluster. No reconnection attempts will be made.
    NODE_GONE = 3;
    // NODE_ERROR indicates that a node has reported an error while processing
    // the request. No reconnection attempts will be made.
    NODE_ERROR = 4;
}

The implementation makes sure to send down the initial node status
for all nodes participating in the request first, i.e. before any flows.

The node state GONE is not used in this initial version, as the
current implementation of the peer management in hubble-relay does not
inform the running GetFlows request about the removal of a node.
We mark any failed node as UNAVAILABLE for now.

Review per commit as this PR also performs a bit of cleanup.

Fixes: #11360

@gandro gandro added release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay labels May 18, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 May 18, 2020
@coveralls
Copy link

coveralls commented May 18, 2020

Coverage Status

Coverage increased (+0.009%) to 36.905% when pulling dd46c31 on pr/gandro/hubble-relay-status-response into 5a8501a on master.

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement overall! I'm a little worried about the complexity of the GetFlows implementation. Eventually, it should be refactored so that it could be more readable/testable/debuggable (work for a follow-up PR though).

api/v1/observer/observer.proto Outdated Show resolved Hide resolved
func relayStatusResponse(numPeers int, failedPeers []string) *observerpb.GetFlowsResponse {
return &observerpb.GetFlowsResponse{
Time: ptypes.TimestampNow(),
NodeName: node.GetName(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the node name makes sense in the context of hubble-relay? Well, I guess we have to fill this anyway and I see no better alternative.

pkg/hubble/relay/observer.go Outdated Show resolved Hide resolved
@gandro gandro force-pushed the pr/gandro/hubble-relay-status-response branch from 2e6b01b to 30b1fed Compare May 20, 2020 11:05
@gandro gandro marked this pull request as ready for review May 20, 2020 11:11
@gandro gandro requested review from a team as code owners May 20, 2020 11:11
@gandro gandro requested a review from a team May 20, 2020 11:11
@gandro gandro changed the title hubble-relay: Add relay status message hubble-relay: Add node status message May 20, 2020
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes you made to the API proposal are good. However, I need to experiment/play a bit more with the API in order to understand what could be missing or improved before giving a final approval.


enum NodeState {
// UNKNOWN_NODE_STATE indicates that the state of this node is unknown.
UNKNOWN_NODE_STATE = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the enum itself is called NodeState, I don't think it's necessary to repeat NODE_STATE in each state name.
They could just be UNKNOWN, CONNECTED, UNAVAILABLE and GONE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNKNOWN needs to be prefixed, because protobuf does not have enum namespaces and will therefore collide with any other enum which will have the UNKNOWN variant. But I guess for the others, we can do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have shortened the variants to NODE_xxx. I think a prefix is still useful, since relay.proto will gain quite a few other enums that maybe also want to make use of variants like ERROR.

pkg/hubble/relay/observer.go Show resolved Hide resolved
@gandro gandro force-pushed the pr/gandro/hubble-relay-status-response branch 2 times, most recently from fb05d51 to 814f17d Compare May 20, 2020 15:41
@gandro
Copy link
Member Author

gandro commented May 20, 2020

I have added a NODE_ERROR status to the API. It is used to report errors which may occur during a request.

Because certain errors occur on multiple nodes simultaneously (e.g. invalid request parameters), I have also added a function to coalesce duplicate errors (within a predefined time window).

@gandro
Copy link
Member Author

gandro commented May 20, 2020

test-me-please

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the NODE_ERROR change, very welcome! As a follow-up to this PR, I wonder if we should not also record the last error for every peer along with the corresponding timestamp so that we could then serve this information via a corresponding note status grpc service.

pkg/hubble/relay/relayoption/defaults.go Outdated Show resolved Hide resolved
api/v1/relay/relay.proto Outdated Show resolved Hide resolved
This commit contians no functional change.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/hubble-relay-status-response branch from 65aa5e3 to 6b74ace Compare May 25, 2020 08:26
This introduces a new  node status message sent hubble-relay. The
message is used to inform clients about the connectivity of nodes
participating in a request.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This adds support for NodeStatusEvents in the GetFlows RPC call.
The intent of this event is to inform downstream consumers about the
state of the nodes which are participating in the current request.

The node state `NODE_GONE` is not used in this initial version, as the
current implementation of the peer management in hubble-relay does not
inform the running GetFlows request about the removal of a node.

If a peer is not ready when the request is started, we mark it as
`NODE_UNAVAILABLE`. If a peer errors out during a request, we indicate
this via `NODE_ERROR` and propagate the received error message.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds a new stage to the hubble-relay processing which
supresses duplicate errors which may occur on multiple nodes within a
certain time window.

The list of nodes is merged, such that the reported error contains
each node on which the error occured.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/hubble-relay-status-response branch from 6b74ace to dd46c31 Compare May 25, 2020 08:28
@gandro
Copy link
Member Author

gandro commented May 25, 2020

test-me-please

pkg/hubble/relay/observer.go Show resolved Hide resolved
pkg/hubble/relay/observer.go Show resolved Hide resolved
@gandro
Copy link
Member Author

gandro commented May 25, 2020

retest-4.9

@gandro
Copy link
Member Author

gandro commented May 25, 2020

retest-4.19

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this quite extensively and I'm pretty happy with this new API change and implementation. Well done!

@gandro
Copy link
Member Author

gandro commented May 25, 2020

retest-runtime

@gandro
Copy link
Member Author

gandro commented May 25, 2020

retest-4.19

@gandro
Copy link
Member Author

gandro commented May 25, 2020

The runtime failure is a known flake: #10838.

Considering this PR does not affect any components in the runtime tests, I will not restart it.

@aanm aanm merged commit 1ea1960 into master May 25, 2020
1.8.0 automation moved this from In progress to Merged May 25, 2020
@aanm aanm deleted the pr/gandro/hubble-relay-status-response branch May 25, 2020 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay
Projects
No open projects
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

api/observer: include the number of peers that answered the request in GetFlows() response
4 participants