Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pkg/controller UpdateController may skip update #29015

Open
2 tasks done
tommyp1ckles opened this issue Nov 7, 2023 · 5 comments
Open
2 tasks done

Pkg/controller UpdateController may skip update #29015

tommyp1ckles opened this issue Nov 7, 2023 · 5 comments
Assignees
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.

Comments

@tommyp1ckles
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Despite the comments on UpdateController, this function is works in such a way that it never blocks on the update channel, meaning that if the specified controllers channel is busy it will skip the update/invocation of the controller DoFunc.

Although controller probably isn't intended to have queuing behaviour, the fact that updating parameters and triggering are handled by the same channel means that in some circumstances a controller will ignore parameter changes in UpdateController.

This is also complicated more by the comments on UpdateController:

// Updating a controller will cause the DoFunc to be run immediately regardless
// of any previous conditions. It will also cause any statistics to be reset.

This can be illustrated by adding the following test case to pkg/controller/controller_test.go:

func TestController(t *testing.T) {
	mngr := NewManager()
	g := NewGroup("foo")
	g2 := NewGroup("bar")
	noop := func(ctx context.Context) error { return nil }
	p1 := ControllerParams{
		Group:  g,
		DoFunc: noop,
	}
	p2 := ControllerParams{
		Group:  g2,
		DoFunc: noop,
	}
	for i := 0; i < 10; i++ {
		if i > 5 {
			mngr.UpdateController("test1", p2)
		} else {
			mngr.updateController("test1", p1)
		}
	}
	// Last five updates should change group to "bar".
	assert.Equal(t, "bar", mngr.controllers["test1"].group.Name) // This will almost certainly fail!!!
}

The likely-hood of a update being skipped increases as the time spent in DoFunc for a controller increases. So controllers handling things such as apiserver client requests etc may be especially prone to this.

At the moment I'm not aware of any specific bugs resulting from this behaviour.

Fixing this issue would be quite simple by simply removing the select default case.

This would be risky however, as there would be a broad change of behavior where UpdateController would now become blocking for up to as long as any DoFunc can run, this could also create potential deadlocks stemming from calls to UpdateController holding mutexes that are also used in the ControllerParams DoFunc function.

Cilium Version

main branch, all current releases.

Kernel Version

n/a

Kubernetes Version

n/a

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tommyp1ckles tommyp1ckles added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels Nov 7, 2023
@squeed
Copy link
Contributor

squeed commented Nov 10, 2023

Honestly, I really dislike the expectation that UpdateController() is also used to trigger updates. I would personally rather see the parameters be immutable after creation, and only TriggerController() be allowed. Maybe changing timing parameters would be OK.

Anyways, this is totally a problem; kicking this over to our friendly local Foundations team.

@squeed squeed added the sig/agent Cilium agent related. label Nov 10, 2023
@squeed
Copy link
Contributor

squeed commented Nov 10, 2023

(assigned to @joamaki, feel free to delegate)

@tommyp1ckles
Copy link
Contributor Author

Honestly, I really dislike the expectation that UpdateController() is also used to trigger updates. I would personally rather see the parameters be immutable after creation, and only TriggerController() be allowed.

The update basically functions as a update + trigger anyway so we really should just separate those concerns. What's worrying is the expectation that people had when they where writing the UpdateController could be either a) there's no guarantee this will run (despite the comments 😄 ) or b) this will block on the first run (my initial assumption).

Might be hard to change to making controllers immutable since that would require rethinking a lot of uses (i.e. update returns an error if it already exists, and would require a draining before allowing update?).

In the future the intention is probably to use hive.Job where we can (although I'm not sure it's meant as a exact replacement for pkg/controller), but with how much we rely on pkg/controller we should probably find some reasonable way to fix this.

@giorio94
Copy link
Member

It seems that #25579 introduced part of the regression mentioned here. Previously, we retrieved the parameters after the update was detected, hence ensuring that we always pulled in the latest version (although there could still be race conditions if either the trigger or interval channels unblocked at the same time). Now, instead, we will miss newer updates given that the parameters are propagated through the update channel, which may be full after the first update. One fix could be to always retrieve the parameters before a new iteration of the controller loop.

/cc @jrajahalme

squeed added a commit to squeed/cilium that referenced this issue Dec 1, 2023
Previously, we would lazily start the label injection controller on
first update. This is unnecessary and wasteful, since it does some
synchronized bookkeeping long after the agent has started.

This is also a bug, since relying on UpdateController just to trigger a
controller may actually drop the update (see cilium#29015).

So, use TriggerController() to, well, trigger the controller, and
explicitly create the controller in a new ipcache.Start() method.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@pippolo84
Copy link
Member

I think there are different issues to be addressed.

The test above will always fail, since it is checking the group.Name field in the embedded controller, not the params.Group.Name into the managedController.
So, the first issue is the duplication of the Group in both controller and managedController. The first one is copied in createControllerLocked and never updated when UpdateController is called. This leads to the usage of the old Group in recordSuccess and recordError after UpdateController has been called and potentially changed it.

The second issue is that the update channel is buffered and it is signaled in updateController with a select and a default case. This means that if an update is still "in flight", the next ones won't be seen at all by the controller.
To prove this, we can use a modified version of the above test:

func TestController(t *testing.T) {
	mngr := NewManager()

	var foo, bar int

	p1 := ControllerParams{
		Group:  NewGroup("foo"),
		DoFunc: func(ctx context.Context) error { foo++; return nil },
	}
	for i := 0; i < 5; i++ {
		mngr.UpdateController("test1", p1)
	}

	p2 := ControllerParams{
		Group:  NewGroup("bar"),
		DoFunc: func(ctx context.Context) error { bar++; return nil },
	}
	for i := 0; i < 5; i++ {
		mngr.UpdateController("test1", p2)
	}

	// since the signal from "stop" is prioritized over the one from "update",
	// give the controller some additional time to complete all the expected
	// DoFunc runs.
	time.Sleep(50 * time.Millisecond)

	mngr.RemoveAllAndWait()

	assert.Equal(t, 5, foo)
	assert.Equal(t, 5, bar)
}

This will almost certainly fail if, as @tommyp1ckles suggested, we don't make the update channel unbuffered and we remove the default case in the select.
But this is actually a change to the UpdateController semantic if we consider the implementation before #25579: in that case, we considered only the latest update, while making the channel unbuffered forces the consumer to always wait for the current operation to complete.
If we want to keep the ability to "overwrite" the updates that have not been consumed by the controller yet, we should revert to the logic previous to #25579, as @giorio94 suggested.

ovidiutirla added a commit to ovidiutirla/cilium that referenced this issue May 31, 2024
The endpoint runIdentityResolver should always aim to try resolving with
the matching identity revision of the identity labels used.

Otherwise the endpoint state can endup in a weird status due to cilium#29015.

Signed-off-by: Ovidiu Tirla <otirla@google.com>
ovidiutirla added a commit to ovidiutirla/cilium that referenced this issue May 31, 2024
The endpoint runIdentityResolver should always aim to try resolving with
the matching identity revision of the identity labels used.

Otherwise the endpoint state can endup in a weird status due to cilium#29015.

Signed-off-by: Ovidiu Tirla <otirla@google.com>
github-merge-queue bot pushed a commit that referenced this issue Jun 3, 2024
The endpoint runIdentityResolver should always aim to try resolving with
the matching identity revision of the identity labels used.

Otherwise the endpoint state can endup in a weird status due to #29015.

Signed-off-by: Ovidiu Tirla <otirla@google.com>
sayboras pushed a commit that referenced this issue Jun 4, 2024
The endpoint runIdentityResolver should always aim to try resolving with
the matching identity revision of the identity labels used.

Otherwise the endpoint state can endup in a weird status due to #29015.

Signed-off-by: Ovidiu Tirla <otirla@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

5 participants