New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble-relay: Add node status message #11589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice improvement overall! I'm a little worried about the complexity of the GetFlows
implementation. Eventually, it should be refactored so that it could be more readable/testable/debuggable (work for a follow-up PR though).
pkg/hubble/relay/observer.go
Outdated
func relayStatusResponse(numPeers int, failedPeers []string) *observerpb.GetFlowsResponse { | ||
return &observerpb.GetFlowsResponse{ | ||
Time: ptypes.TimestampNow(), | ||
NodeName: node.GetName(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the node name makes sense in the context of hubble-relay? Well, I guess we have to fill this anyway and I see no better alternative.
2e6b01b
to
30b1fed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes you made to the API proposal are good. However, I need to experiment/play a bit more with the API in order to understand what could be missing or improved before giving a final approval.
|
||
enum NodeState { | ||
// UNKNOWN_NODE_STATE indicates that the state of this node is unknown. | ||
UNKNOWN_NODE_STATE = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the enum itself is called NodeState
, I don't think it's necessary to repeat NODE_STATE
in each state name.
They could just be UNKNOWN
, CONNECTED
, UNAVAILABLE
and GONE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNKNOWN needs to be prefixed, because protobuf does not have enum namespaces and will therefore collide with any other enum which will have the UNKNOWN variant. But I guess for the others, we can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have shortened the variants to NODE_xxx
. I think a prefix is still useful, since relay.proto
will gain quite a few other enums that maybe also want to make use of variants like ERROR
.
fb05d51
to
814f17d
Compare
I have added a Because certain errors occur on multiple nodes simultaneously (e.g. invalid request parameters), I have also added a function to coalesce duplicate errors (within a predefined time window). |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the NODE_ERROR
change, very welcome! As a follow-up to this PR, I wonder if we should not also record the last error for every peer along with the corresponding timestamp so that we could then serve this information via a corresponding note status grpc service.
This commit contians no functional change. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
65aa5e3
to
6b74ace
Compare
This introduces a new node status message sent hubble-relay. The message is used to inform clients about the connectivity of nodes participating in a request. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This adds support for NodeStatusEvents in the GetFlows RPC call. The intent of this event is to inform downstream consumers about the state of the nodes which are participating in the current request. The node state `NODE_GONE` is not used in this initial version, as the current implementation of the peer management in hubble-relay does not inform the running GetFlows request about the removal of a node. If a peer is not ready when the request is started, we mark it as `NODE_UNAVAILABLE`. If a peer errors out during a request, we indicate this via `NODE_ERROR` and propagate the received error message. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds a new stage to the hubble-relay processing which supresses duplicate errors which may occur on multiple nodes within a certain time window. The list of nodes is merged, such that the reported error contains each node on which the error occured. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
6b74ace
to
dd46c31
Compare
test-me-please |
retest-4.9 |
retest-4.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this quite extensively and I'm pretty happy with this new API change and implementation. Well done!
retest-runtime |
retest-4.19 |
The runtime failure is a known flake: #10838. Considering this PR does not affect any components in the runtime tests, I will not restart it. |
This adds a status message to the observer service implementation
of hubble-relay. The status message is added as a new variant of
the GetFlowsResponse, therefore older clients will silently ignore
it.
The new node status event is defined as follow:
The implementation makes sure to send down the initial node status
for all nodes participating in the request first, i.e. before any flows.
The node state
GONE
is not used in this initial version, as thecurrent implementation of the peer management in hubble-relay does not
inform the running GetFlows request about the removal of a node.
We mark any failed node as
UNAVAILABLE
for now.Review per commit as this PR also performs a bit of cleanup.
Fixes: #11360