Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapath: preserve tail call maps in between replacing interface program pairs #17744

Merged
merged 2 commits into from
Feb 11, 2022

Conversation

ti-mo
Copy link
Contributor

@ti-mo ti-mo commented Oct 29, 2021

To prevent unpinning tail call maps after replacing only one half of an interface's programs, e.g. from-netdev and to-netdev, allow the caller to defer finalizer operations until a batch of interfaces have been replaced.

This avoids short windows of packet drops due to missed tail calls.

Fixes #16561

Also fixed a bug where we specified the wrong bpffs path to the migration code. This causes migrations to not be executed in the currently-released version of 1.11, meaning resizing maps causes endpoints regeneration failures.

Preserve tail call maps during resize to prevent drops during agent upgrade

@ti-mo ti-mo requested review from a team and kkourt October 29, 2021 12:54
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 29, 2021
@ti-mo ti-mo added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Oct 29, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 29, 2021
@ti-mo ti-mo requested a review from brb October 29, 2021 12:56
@pchaigno pchaigno self-requested a review October 29, 2021 13:16
@ti-mo
Copy link
Contributor Author

ti-mo commented Oct 29, 2021

/test

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm not 100% sure about not garbage collecting the files, but if this only happens when the map size is changed (which, I guess, does not happen that often) I cannot think of a strong argument against it.

That being said, could we do it an easy version of GC where we keep only one backup directory? That is, after the backup is done we look for older backups for the same map and delete them.

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Did you already test this by adding random tail call somewhere and executing K8sUpdates?

I'm not 100% sure about not garbage collecting the files, but if this only happens when the map size is changed (which, I guess, does not happen that often) I cannot think of a strong argument against it.

If it becomes an issue, we could maybe rely on the file creation timestamps in bpffs to determine which backups have been made a very long time ago and are safe to delete. But I really doubt Kubernetes nodes have a lifespan long enough for this to become an issue.

That being said, could we do it an easy version of GC where we keep only one backup directory? That is, after the backup is done we look for older backups for the same map and delete them.

I don't think we can even assume datapath logic n-1 is fully loaded when we start loading datapath logic n. So there could still be bpf_lxc programs using tail call maps from the n-2 version... or even before :-/

@pchaigno pchaigno added the sig/loader Impacts the loading of BPF programs into the kernel. label Nov 1, 2021
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 2, 2021

Did you already test this by adding random tail call somewhere and executing K8sUpdates?

I've tested this by creating a prog array of a different size manually and running the migration commands. Not sure how to work with K8sUpdates.

But I really doubt Kubernetes nodes have a lifespan long enough for this to become an issue.

We can't really assume all k8s nodes are short-lived, as baremetal deployments are a thing as well. However, we only need to resize these maps once in a blue moon and they're rather small. If we do resize them, we should take care to grow them sufficiently (e.g. x1.5) so that adding future tail calls does not require a map migration.

@ti-mo ti-mo added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 2, 2021
@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 2, 2021

Marked as ready, only known unrelated flakes are hit. This should suffice for 1.11, followups can happen in case any issues become apparent with this approach.

@jrajahalme jrajahalme mentioned this pull request Nov 3, 2021
9 tasks
@jrajahalme
Copy link
Member

Is this superseding #13032?

@ti-mo
Copy link
Contributor Author

ti-mo commented Nov 3, 2021

Is this superseding #13032?

Yep, I think it does!

@joestringer joestringer removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 3, 2021
@joestringer
Copy link
Member

I think we need to come to an agreement on the path to tackle #17744 (comment) before we can consider this as 'ready-to-merge'. I'll be happy to be proven wrong about the concern but let's follow where that discussion goes before rushing this into the tree.

@pchaigno
Copy link
Member

pchaigno commented Jan 31, 2022

The Jenkins test failures are caused by #15474 (66967dd recently extended the test). Other required tests are passing team review requests are covered, so I think this is ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 31, 2022
@pchaigno pchaigno removed their assignment Jan 31, 2022
@joestringer joestringer removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 31, 2022
pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
@@ -70,7 +70,7 @@ func replaceDatapath(ctx context.Context, ifName, objPath, progSec, progDirectio

// Temporarily rename bpffs pins of maps whose definitions have changed in
// a new version of a datapath ELF.
if err := bpf.StartBPFFSMigration(bpf.GetMapRoot(), objPath); err != nil {
if err := bpf.StartBPFFSMigration(bpf.MapPrefixPath(), objPath); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's rather critical given it can deadlock datapath regeneration (or even prevent agent start?) when maps get changed.

Can you describe this in a bit more detail, also include it in the commit message? It's not immediately obvious to me which conditions would cause this kind of situation or even that the first patch is resolving such a serious bug.

On second thought, I think this is good to go. I don't see this as a particularly invasive or risky change, it just defers cleanup to be slightly later.

➕ Assuming we resolve the other feedback, this should be fairly low risk.


if ep.RequireEgressProg() {
if err := replaceDatapath(ctx, ep.InterfaceName(), objPath, symbolToEndpoint, dirEgress, false, ""); err != nil {
finalize, err := replaceDatapath(ctx, ep.InterfaceName(), objPath, symbolToEndpoint, dirEgress, false, "")
Copy link
Member

@joestringer joestringer Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow right, the approach here is to ensure that both the ingress+egress programs for a given endpoint are completely reloaded before unpinning the old references. That makes some sense to me.

Similarly for the reloadHostDatapath case above, the programs for all interfaces are loaded first before unpinning the old map references. That also makes sense. 👍

I recall that some of the discussion previously was around what happens when we reload base programs, and how that could impact upgrade. Looking through, it seems like we've managed to sidestep that question for now while resolving the immediate concerns around upgrade. Is that right? The approach here seems like a good incremental improvement and allows us to defer dealing with the init.sh conversion and this other potential datapath disruption question until the go migration work matures.

(no changes necessary, this is more just me checking if I am following correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think we're on the same page here. Could you specify what you mean by 'base programs'?

Copy link
Member

@joestringer joestringer Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Base programs are the "global" programs that apply once for the node, like bpf_host, bpf_overlay, etc. Structurally in the code, when Cilium wants to update these programs, it calls into the datapath Reinitialize() function. It's typically pretty rare to reinitialize the datapath like this during normal Cilium operations.

This is in contrast to the per-endpoint bpf_lxc programs which are cloned per-endpoint and are dynamically configured based on creation of workloads on the nodes. These endpoint BPF programs are installed via the endpoint-specific code in pkg/endpoint/bpf.go. Thes actual program reload for endpoints goes via either:

  • CompileAndLoad(): unusual; typically only triggered by cilium endpoint regenerate CLI,
  • CompileOrLoad(): typical case on startup, or
  • ReloadDatapath(): used when updating maps for individual endpoints, does not trigger recompilation of ELFs and does not reload the global programs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer Thanks, I thought these could only regenerate on agent startup, TIL endpoint regenerate.

pkg/bpf/bpffs_migrate.go Outdated Show resolved Hide resolved
This patch fixes bpffs migrations meant to occur at runtime. The tc/globals/
part of the path was missing, and the bpffs root was being scanned instead,
causing migrations not to run. This would lead to endpoints continuously
failing to regenerate at runtime when map definitions in the ELF differ from
their pinned counterparts, e.g. during an agent upgrade.

For example, when resizing CALLS_MAP and trying to replace a program, the old
map will be loaded from bpffs, while the program expects to see a map of a
different capacity:

```
~ tc filter replace dev cilium-probe ingress bpf da obj bpf/bpf_lxc.o sec from-container
libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/test_cilium_calls_65535': parameter mismatch
libbpf: map 'test_cilium_calls_65535': error reusing pinned map
libbpf: map 'test_cilium_calls_65535': failed to create: Invalid argument(-22)
libbpf: failed to load object 'bpf/bpf_lxc.o'
Unable to load program
```

The CLI equivalent executed as part of init.sh does work as expected, as the
full path including tc/globals/ is specified there.

Signed-off-by: Timo Beckers <timo@isovalent.com>
To prevent unpinning tail call maps after replacing only one half of an
interface's programs, e.g. from-netdev and to-netdev, allow the caller to
defer finalizer operations until a batch of interfaces have been replaced.

This avoids short windows of packet drops due to missed tail calls.

Fixes cilium#16561

Signed-off-by: Timo Beckers <timo@isovalent.com>
@ti-mo
Copy link
Contributor Author

ti-mo commented Feb 11, 2022

@joestringer Addressed your feedback, please take another look.

I've specified the following in the first commit:

This would lead to endpoints continuously failing to regenerate at runtime when map definitions in the ELF differ from their pinned counterparts, e.g. during an agent upgrade.

For example, when resizing CALLS_MAP and trying to replace a program, the old map will be loaded from bpffs, while the program expects to see a map of a different capacity:

~ tc filter replace dev cilium-probe ingress bpf da obj bpf/bpf_lxc.o sec from-container
libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/test_cilium_calls_65535': parameter mismatch
libbpf: map 'test_cilium_calls_65535': error reusing pinned map
libbpf: map 'test_cilium_calls_65535': failed to create: Invalid argument(-22)
libbpf: failed to load object 'bpf/bpf_lxc.o'
Unable to load program

@ti-mo
Copy link
Contributor Author

ti-mo commented Feb 11, 2022

/test

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks again for the fix.

}

return nil
finalize := func() {
l := log.WithField("device", ifName).WithField("objPath", objPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there's also log.WithFields(logrus.Fields{ ... }) to specify multiple fields at once with one call. This is a more common style elsewhere in the codebase, but also this doesn't make any meaningful difference so don't worry about changing this one.

@pchaigno
Copy link
Member

All tests are passing and reviews are in. Marking ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 11, 2022
@joestringer joestringer merged commit 9fb1668 into cilium:master Feb 11, 2022
@ti-mo ti-mo deleted the tb/prog-array-unpin branch February 11, 2022 21:17
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.2 Feb 14, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 14, 2022
@jibi jibi added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Feb 16, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.2 Feb 16, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.2 Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/loader Impacts the loading of BPF programs into the kernel.
Projects
No open projects
1.11.2
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

bpf: cilium-map-migrate might unpin in-use tailcall map
7 participants