-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid tunnel map sync controller errors #18247
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Tobias Klauser <tobias@cilium.io>
Create the tunnel map as non-persistent instead of marking it as such in the package level init func. Signed-off-by: Tobias Klauser <tobias@cilium.io>
Currently, when deleteTunnelMapping(oldCIDR *cidr.CIDR, quietMode bool) is called with quietMode = true for an inexistent entry - as could be the case on initial node addition - we would get ENOENT on the underlying map operation in pkg/bpf/(*map).deleteMapEntry. This would lead to the bpf-map-sync-cilium_tunnel_map controller being started to reconcile the error. However, in some reported cases (see cilium#16488), the controller seems to stay in an error state failing to delete the non-existent entry. Avoid this situation entirely by ignoring any map delete error in case deleteTunnelMapping is called with quietMode = true. Signed-off-by: Tobias Klauser <tobias@cilium.io>
The tunnel map and thus its user space cache are only needed if either tunneling or the IPv4 egress gateway is enabled. Currently, the user space cache of the map is created unconditionally of whether it is actually used, leading to e.g. the bpf-map-sync-cilium_tunnel_map being spawend unnecessarily. This controller will e.g. show up in `cilium status --all-controllers` and might lead to confusion in setups where tunneling and the egress gateway feature are disabled. In some reported cases (see cilium#16488) with tunneling disabled that controller also ended up in an errors state, not being able to recoiver. Thus, only create the user space cache map and thus the sync controller in case it is actually needed to avoid this class of errors. Signed-off-by: Tobias Klauser <tobias@cilium.io>
tklauser
added
area/daemon
Impacts operation of the Cilium daemon.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
labels
Dec 13, 2021
/test |
tklauser
added a commit
to cilium/image-tools
that referenced
this pull request
Dec 13, 2021
Currently, the commit index reported as part of the commit list is off by one, e.g. in an example from cilium/cilium#18247: ``` Retrieved 5 commits from PR #18247 {[0/5] option: correct godoc for (*DaemonConfig).TunnelingEnabled} {[1/5] maps/tunnel: mark TunnelMap as NonPersistent on creation} {[2/5] datapath, tunnel: correctly ignore ENOENT on tunnel mapping delete} {[3/5] daemon: open or create tunnel map cache only if needed} {[4/5] datapath/linux: don't attempt to delete old tunnel map entries on node addition} ``` See https://github.com/cilium/cilium/runs/4505300593?check_suite_focus=true Fix this by passing the 1-based index, not the 0-based index to function `check_commit`. Signed-off-by: Tobias Klauser <tobias@cilium.io>
The datapath.LocalNodeConfig.EnableEncapsulation field is immutable at runtime [1], i.e. it is defined once at agent startup. The tunnel map is created as non-persistent, meaning that any potentially pinned map would be deleted on startup [2], [3]. In combination this means that there cannot possibly be a case that there are left over old tunnel map entries in the tunnel map if encapsulation is disabled. [1] https://github.com/cilium/cilium/blob/6c169f63ec254de7777483b6f01c261215f9ec9c/pkg/datapath/node.go#L59-L64 [2] https://github.com/cilium/cilium/blob/6c169f63ec254de7777483b6f01c261215f9ec9c/pkg/maps/tunnel/tunnel.go#L48 [3] https://github.com/cilium/cilium/blob/6c169f63ec254de7777483b6f01c261215f9ec9c/pkg/bpf/map_linux.go#L104-L106 Signed-off-by: Tobias Klauser <tobias@cilium.io>
tklauser
force-pushed
the
tunnel-map-sync-fix
branch
from
December 13, 2021 11:39
c49f1d1
to
4b9e0c3
Compare
/test |
qmonnet
pushed a commit
to cilium/image-tools
that referenced
this pull request
Dec 13, 2021
Currently, the commit index reported as part of the commit list is off by one, e.g. in an example from cilium/cilium#18247: ``` Retrieved 5 commits from PR #18247 {[0/5] option: correct godoc for (*DaemonConfig).TunnelingEnabled} {[1/5] maps/tunnel: mark TunnelMap as NonPersistent on creation} {[2/5] datapath, tunnel: correctly ignore ENOENT on tunnel mapping delete} {[3/5] daemon: open or create tunnel map cache only if needed} {[4/5] datapath/linux: don't attempt to delete old tunnel map entries on node addition} ``` See https://github.com/cilium/cilium/runs/4505300593?check_suite_focus=true Fix this by passing the 1-based index, not the 0-based index to function `check_commit`. Signed-off-by: Tobias Klauser <tobias@cilium.io>
tklauser
changed the title
Avoid tunnel map sync errors
Avoid tunnel map sync controller errors
Dec 13, 2021
jibi
approved these changes
Dec 14, 2021
michi-covalent
approved these changes
Dec 16, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
maintainer-s-little-helper
bot
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Dec 16, 2021
maintainer-s-little-helper
bot
moved this from Needs backport from master
to Backport pending to v1.10
in 1.10.7
Dec 16, 2021
maintainer-s-little-helper
bot
moved this from Needs backport from master
to Backport pending to v1.10
in 1.10.7
Dec 16, 2021
tklauser
added
backport-done/1.11
The backport for Cilium 1.11.x for this PR is done.
backport-done/1.10
and removed
backport-pending/1.11
labels
Dec 17, 2021
maintainer-s-little-helper
bot
moved this from Backport pending to v1.10
to Backport done to v1.10
in 1.10.7
Dec 21, 2021
maintainer-s-little-helper
bot
moved this from Backport pending to v1.10
to Backport done to v1.10
in 1.10.7
Dec 21, 2021
This was referenced Jan 18, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/daemon
Impacts operation of the Cilium daemon.
backport-done/1.11
The backport for Cilium 1.11.x for this PR is done.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Several users reported issues with
bpf-map-sync-cilium_tunnel_map
controller in an error state even though tunneling is disabled in their setup.While the exact condition causing these error states could not be reproduced, there are several counter-measures that can be taken to avoid these issues: most importantly disabling the
bpf-map-sync-cilium_tunnel_map
altogether in case tunneling (or the egress gateway, which also makes use of the tunnel map) is not enabled.See commit messages for more details on the individual changes.
Fixes: #16488