-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improves the error logs during the bpf maps updating #16034
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ | |
package lbmap | ||
|
||
import ( | ||
"errors" | ||
"fmt" | ||
"net" | ||
"sort" | ||
|
@@ -30,6 +31,7 @@ import ( | |
"github.com/cilium/cilium/pkg/u8proto" | ||
|
||
"github.com/sirupsen/logrus" | ||
"golang.org/x/sys/unix" | ||
) | ||
|
||
var log = logging.DefaultLogger.WithField(logfields.LogSubsys, "map-lb") | ||
|
@@ -117,8 +119,15 @@ func (lbmap *LBBPFMap) UpsertService(p *UpsertServiceParams) error { | |
svcVal.SetRevNat(int(p.ID)) | ||
svcKey.SetBackendSlot(slot) | ||
if err := updateServiceEndpoint(svcKey, svcVal); err != nil { | ||
return fmt.Errorf("Unable to update service entry %+v => %+v: %s", | ||
svcKey, svcVal, err) | ||
if errors.Is(err, unix.E2BIG) { | ||
return fmt.Errorf("Unable to update service entry %+v => %+v: "+ | ||
"Unable to update element for LB bpf map: "+ | ||
"You can resize it with the flag \"--%s\". "+ | ||
"The resizing might break existing connections to services", | ||
svcKey, svcVal, option.LBMapEntriesName) | ||
} | ||
Comment on lines
+123
to
+128
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we're mixing two different things here, on one hand there's the error message and on the other hand there's a hint about how to handle that error message. What I'd expect is that we have structured logging messages somewhere that say something like:
Is there a reason we can't do this? |
||
|
||
return fmt.Errorf("Unable to update service entry %+v => %+v: %w", svcKey, svcVal, err) | ||
} | ||
slot++ | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit uneasy about the integration of dynamic content into the error message here, something seems a bit broken with the abstraction since I would otherwise expect the actual log message to include the map name as a dedicated field in structured logging, ie:
So I feel like if we actually studied the various locations where these error messages end up getting logged, we could probably slim down the actual error here to the crucial details. The goal of the PR being to introduce the "the map is full" / "specified key already exists" / "key does not exist" information, which the underlying errno messages do not properly convey to the user.
In my mind, the "Unable to update element" bit should be obvious from the higher layer logging message, but for safety that part seems fine to keep here.
The map name I'd expect to be properly handled by the higher layers; if you have example error messages where this is unclear then I'd be curious to see them and understand why we can't improve those error messages closer to the
log....
call rather than deep in the stack.While we're reviewing these messages, I wonder whether we've used the file descriptor info before or if there's a reasonable way to actually use this information. File descriptor is just a number, if it were invalid then the kernel will give us back
EBADFD
. But I don't think there's any way to figure out which map the file descriptor corresponds to so does it really provide any more information that could be useful while debugging an issue? Maybe we can just drop it.One other nit, wrapping error messages usually is done with a colon before the wrap format bit rather than full stop, ie please use
consider resizing it: %w
rather thanconsider resizing it. %w
. Same for the other logs below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message before included the fd, and I think that's mainly because we haven't revisited this code since we started pushing more on structured logging. In my proposal above, I even suggest maybe we shouldn't be logging the fd at all.
This PR currently proposes to take this even further by propagating the map names down through the layers to embed into the errors here. That's the piece that seems not quite right to me. I think that it's a good idea to have the map name in the logs, but I think we should do it in a structured manner, ie log the map name directly from the log statements rather than passing it down to this layer and embedding the map name in the error.
What I anticipate is that if we accept this particular aspect as-is in the PR, then in future we still need to remove the map name from the error again so that the dynamic content is properly structured for better integration with loggers (whether it's just someone grepping through the logs to find similar logs with the same
mapName
or full on log processing engines that correlate logs together based on the fields we log).Looking back over the review, it seems I missed the original time that this was brought up (#16034 (comment)) cc @brb .