-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pkg/controller UpdateController may skip update #29015
Comments
Honestly, I really dislike the expectation that Anyways, this is totally a problem; kicking this over to our friendly local Foundations team. |
(assigned to @joamaki, feel free to delegate) |
The update basically functions as a update + trigger anyway so we really should just separate those concerns. What's worrying is the expectation that people had when they where writing the UpdateController could be either a) there's no guarantee this will run (despite the comments 😄 ) or b) this will block on the first run (my initial assumption). Might be hard to change to making controllers immutable since that would require rethinking a lot of uses (i.e. update returns an error if it already exists, and would require a draining before allowing update?). In the future the intention is probably to use hive.Job where we can (although I'm not sure it's meant as a exact replacement for pkg/controller), but with how much we rely on pkg/controller we should probably find some reasonable way to fix this. |
It seems that #25579 introduced part of the regression mentioned here. Previously, we retrieved the parameters after the update was detected, hence ensuring that we always pulled in the latest version (although there could still be race conditions if either the trigger or interval channels unblocked at the same time). Now, instead, we will miss newer updates given that the parameters are propagated through the update channel, which may be full after the first update. One fix could be to always retrieve the parameters before a new iteration of the controller loop. /cc @jrajahalme |
Previously, we would lazily start the label injection controller on first update. This is unnecessary and wasteful, since it does some synchronized bookkeeping long after the agent has started. This is also a bug, since relying on UpdateController just to trigger a controller may actually drop the update (see cilium#29015). So, use TriggerController() to, well, trigger the controller, and explicitly create the controller in a new ipcache.Start() method. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
I think there are different issues to be addressed. The test above will always fail, since it is checking the The second issue is that the func TestController(t *testing.T) {
mngr := NewManager()
var foo, bar int
p1 := ControllerParams{
Group: NewGroup("foo"),
DoFunc: func(ctx context.Context) error { foo++; return nil },
}
for i := 0; i < 5; i++ {
mngr.UpdateController("test1", p1)
}
p2 := ControllerParams{
Group: NewGroup("bar"),
DoFunc: func(ctx context.Context) error { bar++; return nil },
}
for i := 0; i < 5; i++ {
mngr.UpdateController("test1", p2)
}
// since the signal from "stop" is prioritized over the one from "update",
// give the controller some additional time to complete all the expected
// DoFunc runs.
time.Sleep(50 * time.Millisecond)
mngr.RemoveAllAndWait()
assert.Equal(t, 5, foo)
assert.Equal(t, 5, bar)
} This will almost certainly fail if, as @tommyp1ckles suggested, we don't make the |
The endpoint runIdentityResolver should always aim to try resolving with the matching identity revision of the identity labels used. Otherwise the endpoint state can endup in a weird status due to cilium#29015. Signed-off-by: Ovidiu Tirla <otirla@google.com>
The endpoint runIdentityResolver should always aim to try resolving with the matching identity revision of the identity labels used. Otherwise the endpoint state can endup in a weird status due to cilium#29015. Signed-off-by: Ovidiu Tirla <otirla@google.com>
The endpoint runIdentityResolver should always aim to try resolving with the matching identity revision of the identity labels used. Otherwise the endpoint state can endup in a weird status due to #29015. Signed-off-by: Ovidiu Tirla <otirla@google.com>
The endpoint runIdentityResolver should always aim to try resolving with the matching identity revision of the identity labels used. Otherwise the endpoint state can endup in a weird status due to #29015. Signed-off-by: Ovidiu Tirla <otirla@google.com>
Is there an existing issue for this?
What happened?
Despite the comments on UpdateController, this function is works in such a way that it never blocks on the update channel, meaning that if the specified controllers channel is busy it will skip the update/invocation of the controller DoFunc.
Although controller probably isn't intended to have queuing behaviour, the fact that updating parameters and triggering are handled by the same channel means that in some circumstances a controller will ignore parameter changes in
UpdateController
.This is also complicated more by the comments on UpdateController:
cilium/pkg/controller/manager.go
Lines 49 to 50 in 80d99a6
This can be illustrated by adding the following test case to pkg/controller/controller_test.go:
The likely-hood of a update being skipped increases as the time spent in DoFunc for a controller increases. So controllers handling things such as apiserver client requests etc may be especially prone to this.
At the moment I'm not aware of any specific bugs resulting from this behaviour.
Fixing this issue would be quite simple by simply removing the select default case.
This would be risky however, as there would be a broad change of behavior where UpdateController would now become blocking for up to as long as any DoFunc can run, this could also create potential deadlocks stemming from calls to UpdateController holding mutexes that are also used in the ControllerParams DoFunc function.
Cilium Version
main branch, all current releases.
Kernel Version
n/a
Kubernetes Version
n/a
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: