Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves the error logs during the bpf maps updating #16034

Merged
merged 1 commit into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 13 additions & 4 deletions pkg/bpf/bpf_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ type bpfAttrMapOpElem struct {
}

// UpdateElementFromPointers updates the map in fd with the given value in the given key.
func UpdateElementFromPointers(fd int, structPtr unsafe.Pointer, sizeOfStruct uintptr) error {
func UpdateElementFromPointers(fd int, mapName string, structPtr unsafe.Pointer, sizeOfStruct uintptr) error {
var duration *spanstat.SpanStat
if option.Config.MetricsConfig.BPFSyscallDurationEnabled {
duration = spanstat.Start()
Expand All @@ -113,7 +113,16 @@ func UpdateElementFromPointers(fd int, structPtr unsafe.Pointer, sizeOfStruct ui
}

if ret != 0 || err != 0 {
return fmt.Errorf("Unable to update element for map with file descriptor %d: %s", fd, err)
switch err {
case unix.E2BIG:
return fmt.Errorf("Unable to update element for %s map with file descriptor %d: the map is full, please consider resizing it. %w", mapName, fd, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit uneasy about the integration of dynamic content into the error message here, something seems a bit broken with the abstraction since I would otherwise expect the actual log message to include the map name as a dedicated field in structured logging, ie:

log.WithFields(logrus.Fields{
    logfields.MapName: "lb",
    ...
}.WithError(err).Warning("Cannot update service map")

So I feel like if we actually studied the various locations where these error messages end up getting logged, we could probably slim down the actual error here to the crucial details. The goal of the PR being to introduce the "the map is full" / "specified key already exists" / "key does not exist" information, which the underlying errno messages do not properly convey to the user.

In my mind, the "Unable to update element" bit should be obvious from the higher layer logging message, but for safety that part seems fine to keep here.

The map name I'd expect to be properly handled by the higher layers; if you have example error messages where this is unclear then I'd be curious to see them and understand why we can't improve those error messages closer to the log.... call rather than deep in the stack.

While we're reviewing these messages, I wonder whether we've used the file descriptor info before or if there's a reasonable way to actually use this information. File descriptor is just a number, if it were invalid then the kernel will give us back EBADFD. But I don't think there's any way to figure out which map the file descriptor corresponds to so does it really provide any more information that could be useful while debugging an issue? Maybe we can just drop it.

One other nit, wrapping error messages usually is done with a colon before the wrap format bit rather than full stop, ie please use consider resizing it: %w rather than consider resizing it. %w. Same for the other logs below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond that I have a bit of unease about the way that these error messages are integrating dynamic content, thereby moving away from structured logging. We've almost completely switched Cilium over to structured logs to help logging systems do a better job of indexing and aggregating logs based on the field types of the logs. Any move away from that approach feels like it introduces more work for us to improve it again in future. Structured logging also helps users by clearly separating different details of the message into fields like level=error msg="Cannot update service map" map="lb" error="unable to update element: map is full" hint="You can change the size of the LB map with the flag "bpf-lb-map-max". Resizing the map might break existing connections to services."

You lost me. I output the error message following the way they were done in the code before me. Are you suggesting to completely change it in this PR?

The error message before included the fd, and I think that's mainly because we haven't revisited this code since we started pushing more on structured logging. In my proposal above, I even suggest maybe we shouldn't be logging the fd at all.

This PR currently proposes to take this even further by propagating the map names down through the layers to embed into the errors here. That's the piece that seems not quite right to me. I think that it's a good idea to have the map name in the logs, but I think we should do it in a structured manner, ie log the map name directly from the log statements rather than passing it down to this layer and embedding the map name in the error.

What I anticipate is that if we accept this particular aspect as-is in the PR, then in future we still need to remove the map name from the error again so that the dynamic content is properly structured for better integration with loggers (whether it's just someone grepping through the logs to find similar logs with the same mapName or full on log processing engines that correlate logs together based on the fields we log).

Looking back over the review, it seems I missed the original time that this was brought up (#16034 (comment)) cc @brb .

case unix.EEXIST:
return fmt.Errorf("Unable to update element for %s map with file descriptor %d: specified key already exists. %w", mapName, fd, err)
case unix.ENOENT:
return fmt.Errorf("Unable to update element for %s map with file descriptor %d: key does not exist. %w", mapName, fd, err)
default:
return fmt.Errorf("Unable to update element for %s map with file descriptor %d: %w", mapName, fd, err)
}
}

return nil
Expand All @@ -125,15 +134,15 @@ func UpdateElementFromPointers(fd int, structPtr unsafe.Pointer, sizeOfStruct ui
// bpf.BPF_NOEXIST to create new element if it didn't exist;
// bpf.BPF_EXIST to update existing element.
// Deprecated, use UpdateElementFromPointers
func UpdateElement(fd int, key, value unsafe.Pointer, flags uint64) error {
func UpdateElement(fd int, mapName string, key, value unsafe.Pointer, flags uint64) error {
uba := bpfAttrMapOpElem{
mapFd: uint32(fd),
key: uint64(uintptr(key)),
value: uint64(uintptr(value)),
flags: uint64(flags),
}

ret := UpdateElementFromPointers(fd, unsafe.Pointer(&uba), unsafe.Sizeof(uba))
ret := UpdateElementFromPointers(fd, mapName, unsafe.Pointer(&uba), unsafe.Sizeof(uba))
runtime.KeepAlive(key)
runtime.KeepAlive(value)
return ret
Expand Down
4 changes: 2 additions & 2 deletions pkg/bpf/map_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -921,7 +921,7 @@ func (m *Map) Update(key MapKey, value MapValue) error {
return err
}

err = UpdateElement(m.fd, key.GetKeyPtr(), value.GetValuePtr(), 0)
err = UpdateElement(m.fd, m.name, key.GetKeyPtr(), value.GetValuePtr(), 0)
if option.Config.MetricsConfig.BPFMapOps {
metrics.BPFMapOps.WithLabelValues(m.commonName(), metricOpUpdate, metrics.Error2Outcome(err)).Inc()
}
Expand Down Expand Up @@ -1162,7 +1162,7 @@ func (m *Map) resolveErrors(ctx context.Context) error {
switch e.DesiredAction {
case OK:
case Insert:
err := UpdateElement(m.fd, e.Key.GetKeyPtr(), e.Value.GetValuePtr(), 0)
err := UpdateElement(m.fd, m.name, e.Key.GetKeyPtr(), e.Value.GetValuePtr(), 0)
if option.Config.MetricsConfig.BPFMapOps {
metrics.BPFMapOps.WithLabelValues(m.commonName(), metricOpUpdate, metrics.Error2Outcome(err)).Inc()
}
Expand Down
2 changes: 1 addition & 1 deletion pkg/maps/cidrmap/cidrmap.go
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ func (cm *CIDRMap) InsertCIDR(cidr net.IPNet) error {
return err
}
log.WithField(logfields.Path, cm.path).Debugf("Inserting CIDR entry %s", cidr.String())
return bpf.UpdateElement(cm.Fd, unsafe.Pointer(&key), unsafe.Pointer(&entry), 0)
return bpf.UpdateElement(cm.Fd, cm.path, unsafe.Pointer(&key), unsafe.Pointer(&entry), 0)
}

// DeleteCIDR deletes an entry from 'cm' with key 'cidr'.
Expand Down
18 changes: 9 additions & 9 deletions pkg/maps/ctmap/ctmap_privileged_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ func (k *CTMapTestSuite) Benchmark_MapUpdate(c *C) {
for i := 0; i < c.N; i++ {
key.DestPort = uint16(i % 0xFFFF)
key.SourcePort = uint16(i / 0xFFFF)
err := bpf.UpdateElement(m.Map.GetFd(), unsafe.Pointer(key), unsafe.Pointer(value), 0)
err := bpf.UpdateElement(m.Map.GetFd(), m.Map.Name(), unsafe.Pointer(key), unsafe.Pointer(value), 0)
c.Assert(err, IsNil)
}

Expand Down Expand Up @@ -147,7 +147,7 @@ func (k *CTMapTestSuite) TestCtGcIcmp(c *C) {
TxBytes: 216,
Lifetime: 37459,
}
err = bpf.UpdateElement(ctMap.Map.GetFd(), unsafe.Pointer(ctKey),
err = bpf.UpdateElement(ctMap.Map.GetFd(), ctMap.Map.Name(), unsafe.Pointer(ctKey),
unsafe.Pointer(ctVal), 0)
c.Assert(err, IsNil)

Expand All @@ -169,7 +169,7 @@ func (k *CTMapTestSuite) TestCtGcIcmp(c *C) {
Addr: types.IPv4{192, 168, 34, 11},
Port: 0x3195,
}
err = bpf.UpdateElement(natMap.Map.GetFd(), unsafe.Pointer(natKey),
err = bpf.UpdateElement(natMap.Map.GetFd(), natMap.Map.Name(), unsafe.Pointer(natKey),
unsafe.Pointer(natVal), 0)
c.Assert(err, IsNil)
natKey = &nat.NatKey4{
Expand All @@ -190,7 +190,7 @@ func (k *CTMapTestSuite) TestCtGcIcmp(c *C) {
Addr: types.IPv4{192, 168, 34, 11},
Port: 0x3195,
}
err = bpf.UpdateElement(natMap.Map.GetFd(), unsafe.Pointer(natKey),
err = bpf.UpdateElement(natMap.Map.GetFd(), natMap.Map.Name(), unsafe.Pointer(natKey),
unsafe.Pointer(natVal), 0)
c.Assert(err, IsNil)

Expand Down Expand Up @@ -283,7 +283,7 @@ func (k *CTMapTestSuite) TestOrphanNatGC(c *C) {
TxBytes: 216,
Lifetime: 37459,
}
err = bpf.UpdateElement(ctMapAny.Map.GetFd(), unsafe.Pointer(ctKey),
err = bpf.UpdateElement(ctMapAny.Map.GetFd(), ctMapAny.Map.Name(), unsafe.Pointer(ctKey),
unsafe.Pointer(ctVal), 0)
c.Assert(err, IsNil)

Expand All @@ -305,7 +305,7 @@ func (k *CTMapTestSuite) TestOrphanNatGC(c *C) {
Addr: types.IPv4{10, 23, 32, 45},
Port: 0x51d6,
}
err = bpf.UpdateElement(natMap.Map.GetFd(), unsafe.Pointer(natKey),
err = bpf.UpdateElement(natMap.Map.GetFd(), natMap.Map.Name(), unsafe.Pointer(natKey),
unsafe.Pointer(natVal), 0)
c.Assert(err, IsNil)
natKey = &nat.NatKey4{
Expand All @@ -326,7 +326,7 @@ func (k *CTMapTestSuite) TestOrphanNatGC(c *C) {
Addr: types.IPv4{10, 23, 32, 45},
Port: 0x50d6,
}
err = bpf.UpdateElement(natMap.Map.GetFd(), unsafe.Pointer(natKey),
err = bpf.UpdateElement(natMap.Map.GetFd(), natMap.Map.Name(), unsafe.Pointer(natKey),
unsafe.Pointer(natVal), 0)
c.Assert(err, IsNil)

Expand Down Expand Up @@ -354,7 +354,7 @@ func (k *CTMapTestSuite) TestOrphanNatGC(c *C) {
c.Assert(len(buf), Equals, 0)

// Create only CT_INGRESS NAT entry which should be removed
err = bpf.UpdateElement(natMap.Map.GetFd(), unsafe.Pointer(natKey),
err = bpf.UpdateElement(natMap.Map.GetFd(), natMap.Map.Name(), unsafe.Pointer(natKey),
unsafe.Pointer(natVal), 0)
c.Assert(err, IsNil)

Expand Down Expand Up @@ -409,7 +409,7 @@ func (k *CTMapTestSuite) TestOrphanNatGC(c *C) {
Addr: types.IPv6{2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2},
Port: 0x51d6,
}
err = bpf.UpdateElement(natMapV6.Map.GetFd(), unsafe.Pointer(natKeyV6),
err = bpf.UpdateElement(natMapV6.Map.GetFd(), natMapV6.Map.Name(), unsafe.Pointer(natKeyV6),
unsafe.Pointer(natValV6), 0)
c.Assert(err, IsNil)

Expand Down
13 changes: 11 additions & 2 deletions pkg/maps/lbmap/lbmap.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
package lbmap

import (
"errors"
"fmt"
"net"
"sort"
Expand All @@ -30,6 +31,7 @@ import (
"github.com/cilium/cilium/pkg/u8proto"

"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)

var log = logging.DefaultLogger.WithField(logfields.LogSubsys, "map-lb")
Expand Down Expand Up @@ -117,8 +119,15 @@ func (lbmap *LBBPFMap) UpsertService(p *UpsertServiceParams) error {
svcVal.SetRevNat(int(p.ID))
svcKey.SetBackendSlot(slot)
if err := updateServiceEndpoint(svcKey, svcVal); err != nil {
return fmt.Errorf("Unable to update service entry %+v => %+v: %s",
svcKey, svcVal, err)
if errors.Is(err, unix.E2BIG) {
return fmt.Errorf("Unable to update service entry %+v => %+v: "+
"Unable to update element for LB bpf map: "+
"You can resize it with the flag \"--%s\". "+
"The resizing might break existing connections to services",
svcKey, svcVal, option.LBMapEntriesName)
}
Comment on lines +123 to +128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we're mixing two different things here, on one hand there's the error message and on the other hand there's a hint about how to handle that error message.

What I'd expect is that we have structured logging messages somewhere that say something like:

scopedLog := log
if errors.Is(err, unix.E2BIG) {
    scopedLog = scopedLog.WithField("hint", fmt.Sprintf("You can resize the loadbalancer map size with the flag \"--%s\. Resizing the map might break existing connections to services.", option.LBMapEntriesName))
}
scopedLog.WithError(err).Warn("Cannot update service")

Is there a reason we can't do this?


return fmt.Errorf("Unable to update service entry %+v => %+v: %w", svcKey, svcVal, err)
}
slot++
}
Expand Down
2 changes: 1 addition & 1 deletion pkg/probe/probe.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ func HaveFullLPM() bool {
if err != nil {
return
}
err = bpf.UpdateElement(m.GetFd(), unsafe.Pointer(&probeKey{}),
err = bpf.UpdateElement(m.GetFd(), m.Name(), unsafe.Pointer(&probeKey{}),
unsafe.Pointer(&probeValue{}), bpf.BPF_ANY)
if err != nil {
return
Expand Down