-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datapath: preserve tail call maps in between replacing interface program pairs #17744
Conversation
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'm not 100% sure about not garbage collecting the files, but if this only happens when the map size is changed (which, I guess, does not happen that often) I cannot think of a strong argument against it.
That being said, could we do it an easy version of GC where we keep only one backup directory? That is, after the backup is done we look for older backups for the same map and delete them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Did you already test this by adding random tail call somewhere and executing K8sUpdates?
I'm not 100% sure about not garbage collecting the files, but if this only happens when the map size is changed (which, I guess, does not happen that often) I cannot think of a strong argument against it.
If it becomes an issue, we could maybe rely on the file creation timestamps in bpffs to determine which backups have been made a very long time ago and are safe to delete. But I really doubt Kubernetes nodes have a lifespan long enough for this to become an issue.
That being said, could we do it an easy version of GC where we keep only one backup directory? That is, after the backup is done we look for older backups for the same map and delete them.
I don't think we can even assume datapath logic n-1 is fully loaded when we start loading datapath logic n. So there could still be bpf_lxc programs using tail call maps from the n-2 version... or even before :-/
I've tested this by creating a prog array of a different size manually and running the migration commands. Not sure how to work with
We can't really assume all k8s nodes are short-lived, as baremetal deployments are a thing as well. However, we only need to resize these maps once in a blue moon and they're rather small. If we do resize them, we should take care to grow them sufficiently (e.g. x1.5) so that adding future tail calls does not require a map migration. |
Marked as ready, only known unrelated flakes are hit. This should suffice for 1.11, followups can happen in case any issues become apparent with this approach. |
c900ed6
to
39cf95d
Compare
Is this superseding #13032? |
Yep, I think it does! |
I think we need to come to an agreement on the path to tackle #17744 (comment) before we can consider this as 'ready-to-merge'. I'll be happy to be proven wrong about the concern but let's follow where that discussion goes before rushing this into the tree. |
@@ -70,7 +70,7 @@ func replaceDatapath(ctx context.Context, ifName, objPath, progSec, progDirectio | |||
|
|||
// Temporarily rename bpffs pins of maps whose definitions have changed in | |||
// a new version of a datapath ELF. | |||
if err := bpf.StartBPFFSMigration(bpf.GetMapRoot(), objPath); err != nil { | |||
if err := bpf.StartBPFFSMigration(bpf.MapPrefixPath(), objPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's rather critical given it can deadlock datapath regeneration (or even prevent agent start?) when maps get changed.
Can you describe this in a bit more detail, also include it in the commit message? It's not immediately obvious to me which conditions would cause this kind of situation or even that the first patch is resolving such a serious bug.
On second thought, I think this is good to go. I don't see this as a particularly invasive or risky change, it just defers cleanup to be slightly later.
➕ Assuming we resolve the other feedback, this should be fairly low risk.
|
||
if ep.RequireEgressProg() { | ||
if err := replaceDatapath(ctx, ep.InterfaceName(), objPath, symbolToEndpoint, dirEgress, false, ""); err != nil { | ||
finalize, err := replaceDatapath(ctx, ep.InterfaceName(), objPath, symbolToEndpoint, dirEgress, false, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I follow right, the approach here is to ensure that both the ingress+egress programs for a given endpoint are completely reloaded before unpinning the old references. That makes some sense to me.
Similarly for the reloadHostDatapath
case above, the programs for all interfaces are loaded first before unpinning the old map references. That also makes sense. 👍
I recall that some of the discussion previously was around what happens when we reload base programs, and how that could impact upgrade. Looking through, it seems like we've managed to sidestep that question for now while resolving the immediate concerns around upgrade. Is that right? The approach here seems like a good incremental improvement and allows us to defer dealing with the init.sh
conversion and this other potential datapath disruption question until the go migration work matures.
(no changes necessary, this is more just me checking if I am following correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I think we're on the same page here. Could you specify what you mean by 'base programs'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. Base programs are the "global" programs that apply once for the node, like bpf_host, bpf_overlay, etc. Structurally in the code, when Cilium wants to update these programs, it calls into the datapath Reinitialize() function. It's typically pretty rare to reinitialize the datapath like this during normal Cilium operations.
This is in contrast to the per-endpoint bpf_lxc
programs which are cloned per-endpoint and are dynamically configured based on creation of workloads on the nodes. These endpoint BPF programs are installed via the endpoint-specific code in pkg/endpoint/bpf.go
. Thes actual program reload for endpoints goes via either:
CompileAndLoad()
: unusual; typically only triggered bycilium endpoint regenerate
CLI,- CompileOrLoad(): typical case on startup, or
ReloadDatapath()
: used when updating maps for individual endpoints, does not trigger recompilation of ELFs and does not reload the global programs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joestringer Thanks, I thought these could only regenerate on agent startup, TIL endpoint regenerate
.
This patch fixes bpffs migrations meant to occur at runtime. The tc/globals/ part of the path was missing, and the bpffs root was being scanned instead, causing migrations not to run. This would lead to endpoints continuously failing to regenerate at runtime when map definitions in the ELF differ from their pinned counterparts, e.g. during an agent upgrade. For example, when resizing CALLS_MAP and trying to replace a program, the old map will be loaded from bpffs, while the program expects to see a map of a different capacity: ``` ~ tc filter replace dev cilium-probe ingress bpf da obj bpf/bpf_lxc.o sec from-container libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/test_cilium_calls_65535': parameter mismatch libbpf: map 'test_cilium_calls_65535': error reusing pinned map libbpf: map 'test_cilium_calls_65535': failed to create: Invalid argument(-22) libbpf: failed to load object 'bpf/bpf_lxc.o' Unable to load program ``` The CLI equivalent executed as part of init.sh does work as expected, as the full path including tc/globals/ is specified there. Signed-off-by: Timo Beckers <timo@isovalent.com>
To prevent unpinning tail call maps after replacing only one half of an interface's programs, e.g. from-netdev and to-netdev, allow the caller to defer finalizer operations until a batch of interfaces have been replaced. This avoids short windows of packet drops due to missed tail calls. Fixes cilium#16561 Signed-off-by: Timo Beckers <timo@isovalent.com>
43393e8
to
01e2385
Compare
@joestringer Addressed your feedback, please take another look. I've specified the following in the first commit:
|
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks again for the fix.
} | ||
|
||
return nil | ||
finalize := func() { | ||
l := log.WithField("device", ifName).WithField("objPath", objPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there's also log.WithFields(logrus.Fields{ ... })
to specify multiple fields at once with one call. This is a more common style elsewhere in the codebase, but also this doesn't make any meaningful difference so don't worry about changing this one.
All tests are passing and reviews are in. Marking ready to merge. |
To prevent unpinning tail call maps after replacing only one half of an interface's programs, e.g.
from-netdev
andto-netdev
, allow the caller to defer finalizer operations until a batch of interfaces have been replaced.This avoids short windows of packet drops due to missed tail calls.
Fixes #16561
Also fixed a bug where we specified the wrong bpffs path to the migration code. This causes migrations to not be executed in the currently-released version of 1.11, meaning resizing maps causes endpoints regeneration failures.