Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid tunnel map sync controller errors #18247

Merged
merged 5 commits into from
Dec 16, 2021

Conversation

tklauser
Copy link
Member

Several users reported issues with bpf-map-sync-cilium_tunnel_map controller in an error state even though tunneling is disabled in their setup.

While the exact condition causing these error states could not be reproduced, there are several counter-measures that can be taken to avoid these issues: most importantly disabling the bpf-map-sync-cilium_tunnel_map altogether in case tunneling (or the egress gateway, which also makes use of the tunnel map) is not enabled.

See commit messages for more details on the individual changes.

Fixes: #16488

Fix an issue where the tunnel map sync controller causes errors even though tunneling is disabled.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Create the tunnel map as non-persistent instead of marking it as such in
the package level init func.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Currently, when deleteTunnelMapping(oldCIDR *cidr.CIDR, quietMode bool)
is called with quietMode = true for an inexistent entry - as could be the
case on initial node addition - we would get ENOENT on the underlying
map operation in pkg/bpf/(*map).deleteMapEntry. This would lead to the
bpf-map-sync-cilium_tunnel_map controller being started to reconcile the
error. However, in some reported cases (see cilium#16488), the controller
seems to stay in an error state failing to delete the non-existent
entry.

Avoid this situation entirely by ignoring any map delete error in case
deleteTunnelMapping is called with quietMode = true.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
The tunnel map and thus its user space cache are only needed if either
tunneling or the IPv4 egress gateway is enabled. Currently, the user
space cache of the map is created unconditionally of whether it is
actually used, leading to e.g. the bpf-map-sync-cilium_tunnel_map being
spawend unnecessarily. This controller will e.g. show up in `cilium
status --all-controllers` and might lead to confusion in setups where
tunneling and the egress gateway feature are disabled. In some reported
cases (see cilium#16488) with tunneling disabled that controller also ended up
in an errors state, not being able to recoiver. Thus, only create the
user space cache map and thus the sync controller in case it is actually
needed to avoid this class of errors.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser added area/daemon Impacts operation of the Cilium daemon. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Dec 13, 2021
@tklauser tklauser requested review from a team December 13, 2021 11:19
@tklauser tklauser requested a review from a team as a code owner December 13, 2021 11:19
@tklauser tklauser requested a review from jibi December 13, 2021 11:19
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.7 Dec 13, 2021
@tklauser
Copy link
Member Author

/test

tklauser added a commit to cilium/image-tools that referenced this pull request Dec 13, 2021
Currently, the commit index reported as part of the commit list is off
by one, e.g. in an example from
cilium/cilium#18247:

```
Retrieved 5 commits from PR #18247

{[0/5] option: correct godoc for (*DaemonConfig).TunnelingEnabled}
{[1/5] maps/tunnel: mark TunnelMap as NonPersistent on creation}
{[2/5] datapath, tunnel: correctly ignore ENOENT on tunnel mapping delete}
{[3/5] daemon: open or create tunnel map cache only if needed}
{[4/5] datapath/linux: don't attempt to delete old tunnel map entries on node addition}
```

See https://github.com/cilium/cilium/runs/4505300593?check_suite_focus=true

Fix this by passing the 1-based index, not the 0-based index to function
`check_commit`.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
The datapath.LocalNodeConfig.EnableEncapsulation field is immutable at
runtime [1], i.e. it is defined once at agent startup. The tunnel map is
created as non-persistent, meaning that any potentially pinned map would
be deleted on startup [2], [3]. In combination this means that there
cannot possibly be a case that there are left over old tunnel map
entries in the tunnel map if encapsulation is disabled.

[1] https://github.com/cilium/cilium/blob/6c169f63ec254de7777483b6f01c261215f9ec9c/pkg/datapath/node.go#L59-L64
[2] https://github.com/cilium/cilium/blob/6c169f63ec254de7777483b6f01c261215f9ec9c/pkg/maps/tunnel/tunnel.go#L48
[3] https://github.com/cilium/cilium/blob/6c169f63ec254de7777483b6f01c261215f9ec9c/pkg/bpf/map_linux.go#L104-L106

Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser
Copy link
Member Author

/test

qmonnet pushed a commit to cilium/image-tools that referenced this pull request Dec 13, 2021
Currently, the commit index reported as part of the commit list is off
by one, e.g. in an example from
cilium/cilium#18247:

```
Retrieved 5 commits from PR #18247

{[0/5] option: correct godoc for (*DaemonConfig).TunnelingEnabled}
{[1/5] maps/tunnel: mark TunnelMap as NonPersistent on creation}
{[2/5] datapath, tunnel: correctly ignore ENOENT on tunnel mapping delete}
{[3/5] daemon: open or create tunnel map cache only if needed}
{[4/5] datapath/linux: don't attempt to delete old tunnel map entries on node addition}
```

See https://github.com/cilium/cilium/runs/4505300593?check_suite_focus=true

Fix this by passing the 1-based index, not the 0-based index to function
`check_commit`.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser changed the title Avoid tunnel map sync errors Avoid tunnel map sync controller errors Dec 13, 2021
pkg/datapath/linux/node.go Show resolved Hide resolved
Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 16, 2021
@tklauser tklauser merged commit 386029b into cilium:master Dec 16, 2021
@tklauser tklauser deleted the tunnel-map-sync-fix branch December 16, 2021 10:10
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.7 Dec 16, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.7 Dec 16, 2021
@tklauser tklauser added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.10 and removed backport-pending/1.11 labels Dec 17, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.7 Dec 21, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.7 Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.10.7
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

cilium-health test fails. bpf-map-sync-cilium_tunnel_map map sync errors
3 participants