New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-neigh: Reduce arping related log msg's level #15261
Conversation
82ea199
to
6aaa80e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine reducing the log msg to level "Info" but if we are printing a "Info" message with the word "Fail" in it it might be confusing to users. If they are not that important maybe considering setting them as debug or add a new metric?
@aanm I still want to see what exactly failed, so keeping them as log msg in info lvl instead of adding a metric. |
@aanm Recently we had some big changes in the arp handling code. So I'd like to observe any possible discrepancies at least for a while. Later on, we can mute the messages, and expose errors via metrics. |
Then let's prefix |
6aaa80e
to
5240b52
Compare
@aanm I've changed the logging subsystem to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, could change "Failed" -> "Unable" then those messages would read less severe, e.g. "Failed to remove neighbor entry" becomes "Unable to remove neighbor entry".
pkg/datapath/linux/node.go
Outdated
@@ -695,7 +695,7 @@ func (n *linuxNodeHandler) insertNeighbor(ctx context.Context, newNode *nodeType | |||
logfields.IPAddr: neigh.IP, | |||
logfields.HardwareAddr: neigh.HardwareAddr, | |||
logfields.LinkIndex: neigh.LinkIndex, | |||
}).WithError(err).Warn("Failed to remove neighbor entry") | |||
}).WithError(err).Info("Failed to remove neighbor entry") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this happen and not be an error? Is it possible for the entry to be removed async to this operation so it does not exist or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the case when a network operator removes an entry manually.
Beyond the scope of this patch, but I took look at the callers of insertNeighbor and shouldn't we return an error here and kick the retry logic? Otherwise we are waiting for the refresh logic to kick in from neighbor-table-refresh? I would expect a failed arp could be retried almost immediately and then backed off from there if it keeps failing. |
@jrfastab I've added a retry logic into both arping libraries we use. They will retry 3 times. If all fail, then it is up the periodic refresh to fix any outstanding issues. |
204fe6c
to
eca1bed
Compare
It is not the end of the world if any arping related operation fails (e.g. frequent connections between nodes ensure the presence of relevant L2 entries in the neigh table). So, decrease the log level of the log msgs.