New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipcache: fix flapping labels in SelectorCache when reserved:host identity has multiple IPs #28332
Conversation
d5b7717
to
59e7f3b
Compare
Additional CI for "pod-to-node-policycidr" test: #28334. |
/test |
I'm having trouble understanding this and without understanding, it's difficult to review the PR. Could you go into more detail and also provide an example? |
This test keeps failing, but when I run it locally, it passes. I'm going to rerun it a few more times, because the error is simply one of the test pods not coming up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core change LGTM from a high level, I walked through the logic along with @christarazi . He may have some other minor feedback.
@schlosna updated based on feedback; PTAL. |
59e7f3b
to
ae66197
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments, this looks good to me.
The commit check is unhappy with a duplicated word labels
in the commit message and references to commits b099ba6 & f5d9532 that were part of #19765
{[1/1] ipcache: aggregate labels from all IPs with local host identity}
Running on ae661971b3e961b64b9f7fc28a6e162eb72ed58e
Warning: WARNING:REPEATED_WORD: Possible repeated word: 'labels'
#20:
with labels labelsreserved:host, reserved:kube-apsierver
Warning: WARNING:UNKNOWN_COMMIT_ID: Unknown commit id 'b099ba614', maybe rebased or not pulled?
#41:
Fixes: b099ba6Warning: WARNING:UNKNOWN_COMMIT_ID: Unknown commit id 'f5d95325c', maybe rebased or not pulled?
#42:
Fixes: f5d9532
ae66197
to
6bcc7b6
Compare
6bcc7b6
to
f73015e
Compare
f73015e
to
ee72e73
Compare
FYI @joestringer I was chatting with @christarazi, then I realized a few hours later that I had missed a case (albeit unlikely): if an IP loses |
The `reserved:host` identity is special: the numeric identity is fixed and the set of labels is mutable. (The datapath requires this.) So, we need to determine all prefixes that have the `reserved:host` label and capture their labels. Then, we must aggregate *all* labels from all IPs and insert them as the `reserved:host` identity labels. However, the code as written has a race condition whenever the local node has more than one IP address. This can happen when, for example vxlan or ipv6 is enabled. The basic sequence is this: 1. Insert IP A as `reserved:host` in to the ipcache. ID 1 now has labels `reserved:host` 2. Insert IP A as `reserved:kube-apiserver` in to the ipcache. ID 1 is updated with labels `reserved:host, reserved:kube-apsierver` 3. Insert IP B as `reserved:host` in to the ipcache. ID 1 is updated with labels `reserved:host`. And now policies that select `reserved:kube-apiserver` are broken Likewise, we need to always update the SelectorCache; we cannot short-circuit if the ipcache already has that identity. Again, this is needed because the identity is mutable. So this bug can take another form: 1. Insert IP A as `reserved:host` in to the ipcache. Because IP A is not known to the ipcache, treat ID 1 as a new identity and update the selector cache 2. Insert IP A as `reserved:kube-apiserver`. Mutate the labels of ID 1. But, because IP A already has ID 1, short-circuit the update to the selector cache (if the Source is the same, which it _may_ be). 3. Now the selector cache has incorrect labels for ID 1. Without this, when there are multiple IPs with the host label, the identity may flap and the SelectorCache may be missing updates. Fixes: cilium#28259 Fixes: e0d403a Fixes: 308c142 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
ee72e73
to
3887bc5
Compare
/test |
The
reserved:host
identity is special: the numeric identity is fixed and the set of labels is mutable. (The datapath requires this.) So, we need to determine all prefixes that have thereserved:host
label and capture their labels. Then, we must aggregate all labels from all IPs and insert them as thereserved:host
identity labels.However, the code as written has a race condition whenever the local node has more than one IP address. This can happen when, for example vxlan or ipv6 is enabled. The basic sequence is this:
reserved:host
in to the ipcache. ID 1 now has labelsreserved:host
reserved:kube-apiserver
in to the ipcache. ID 1 is updated with labels labelsreserved:host, reserved:kube-apsierver
reserved:host
in to the ipcache. ID 1 is updated with labelsreserved:host
And now policies are broken.
Likewise, we need to always update the SelectorCache; we cannot short-circuit if the ipcache already has that identity. Again, this is needed because the identity is mutable. So this bug can take another form:
reserved:host
in to the ipcache. Because IP A is not known to the ipcache, treat ID 1 as a new identity and update the selector cachereserved:kube-apiserver
. Mutate the labels of ID 1. But, because IP A already has ID 1, short-circuit the update to the selector cache (if the Source is the same, which it may be).Without this, when there are multiple IPs with the host label, the identity may flap and the SelectorCache may be missing updates.
Fixes: #28259
Fixes: b099ba6
Fixes: f5d9532