daemon, config: regenerate endpoint datapath on agent config change #13971

jaffcheng · 2020-11-10T12:52:54Z

Right now changing some agent options like PolicyAuditMode from
cilium CLI succeeds but doesn't affect datapath, e.g.:

cilium config PolicyAuditMode=true

This patch fixes this by applying agent config changes to endpoint
header file and triggering endpoint datapath regeneration.

Fixes: #13902
Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com

christarazi

Thanks for the PR. Overall, I think it looks good. A few items to address below.

daemon/cmd/daemon.go

pkg/datapath/linux/config/config.go

pkg/endpoint/endpoint.go

pkg/endpointmanager/manager.go

christarazi

LGTM, just a few minor nits to address. Doesn't require another look from me. Can you rebase your PR on lastest master to pick up some CI changes? Currently, we have some Docker rate limiting issues and a rebase would pick up the new mitigations, so that we can run tests against this PR.

pkg/endpointmanager/manager.go

daemon/cmd/daemon.go

christarazi · 2020-11-14T05:13:23Z

test-me-please

sayboras · 2020-11-14T07:16:06Z

test-me-please

jibi · 2020-11-16T09:36:10Z

test-me-please

daemon/cmd/daemon.go

pkg/datapath/linux/config/config.go

jibi · 2020-11-17T15:09:46Z

test-me-please

jibi · 2020-11-17T19:18:30Z

Changes LGTM, but there are a few tests in the Runtime-4.9 suite failing

jibi · 2020-11-17T19:18:58Z

test-gke

jaffcheng · 2020-11-18T04:40:50Z

Thanks for the comments! Added a comment on reusing policyTriggerMetrics and split the change of using Config.Opts.Opts[option.PolicyAuditMode] to a separate commit. Please take another look

jibi · 2020-11-18T08:58:53Z

test-me-please

jaffcheng · 2020-11-18T12:17:13Z

test-me-please

Runtime-4.9 failed tests, not sure if related to this PR:
Suite-runtime.RuntimePolicies Test egress with L7 policy to outside cluster
Suite-runtime.RuntimePolicies Extended HTTP Methods tests
Suite-runtime.RuntimePolicies CIDR L3 Policy validates toCIDR

jibi · 2020-11-18T14:06:43Z

Runtime-4.9 failed tests, not sure if related to this PR:

I don't think this is a flake in the CI as your branch was rebased recently and the Runtime-4.9 test suite is green on master

aanm

Just one small nit

daemon/cmd/config.go

joestringer

Thanks for following up and investigating the concerns I brought up. I thought a bit longer about the solution here and I think we need to go further to provide strong guarantees and reduce the potential for long blocking mutexes in the agent. More detail below.

joestringer · 2021-01-14T22:00:29Z

daemon/cmd/config.go

@@ -51,29 +51,30 @@ func (h *patchConfig) Handle(params PatchConfigParams) middleware.Responder {

 	// Serialize configuration updates to the daemon.
 	option.Config.ConfigPatchMutex.Lock()
+	defer option.Config.ConfigPatchMutex.Unlock()


These locking changes make me nervous. It looks very easy to misunderstand the locking patterns and introduce a deadlock. I've looked for about half an hour now and I can't convince myself it's correct (it may be; but even if it is, it looks easy to break in future).

I believe this is attempting to address the comments here:
#13971 (comment)

Furthermore, I think it is trying to address a case like:

API request comes in to configure some setting A

Previous configuration is stored

Configuration is applied

Lock is released

Reinitialization is triggered

API request comes in to configure some setting B

Configuration is applied concurrently with reinitialization

Lock is released

Reinitialize is triggered

The step (1.iv) fails, so we need to undo the attempt at reconfiguration

Overwrite the configuration with the original state, which undoes setting B modifications

The step (2.iii) succeeds, but (3) already reverted the changes

Clearly this is not the behaviour we want the API to have so I agree we should address it 👍

That said, I don't think it's a good idea to extend the locking over the entire Reinitialize() and try to enforce various places underneath to grab the right locks at the right time. This extends the locking over a long period, and also forces various locations to have more understanding of the locking because the locks aren't following a simple obvious "lock, read/change, unlock" pattern. In fact, the Reinitialize() path is shelling out to an external script which runs other processes like a compiler and so on, so we have very few guarantees about how long that will take. It could take seconds or even minutes in a CPU-starved environment (not typical, but possible).

The general way we solve this problem in other cases like policyAdd is to have an eventqueue where the events are serialized, then the reconfiguration is handled from there. If the caller wants to wait for the results, there is a channel-based mechanism to listen for the success or failure of the changes. See putPolicy.Handle and its callee Daemon.PolicyAdd() for examples.

Why is this different from just grabbing the lock? It allows serialization of the API requests without forcing the underlying data structures to be locked for a long period. The locking of the underlying structures, if necessary, can be handled briefly from various goroutines without introducing arbitrarily long delays. I believe this is already the case today for data like IPv4NativeRoutingCIDR, which currently has quite granular locking (which this PR is currently proposing to extend to a long-held lock).

With this approach in place, the core logic of a new eventqueue's struct ConfigModifyEvent Handle() function could store the changes, lock, modify, unlock, Reinitialize(), then choose to either Lock+revert+Unlock or return success depending on the result of the reinitialization.

Originally I thought that maybe there's an alternative path, Undoing the changes in the event of a failure without holding the lock the whole time. For simple cases this might be doable to check that the settings are configured how we expect before reverting them in the error handling, but overall the most robust approach would be to follow the EventQueue approach.

Thanks for the detailed write-up! I don't like the locking changes of various places as well. Sorry for not realizing the potential long execution time of Reinitialize and the benefit of using an event queue. Will try the suggested approach.

daemon/cmd/config.go

daemon/cmd/daemon.go

PolicyAuditMode can be changed at runtime by cilium CLI/API, so we should respect the mutable config option. Fixes: cilium#13902 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

Currently, if patch agent config API failed to recompile base programs, API will respond with an error but agent options may have been configured but not taking effect. These changes may take effect unexpectedly at next regeneration. This patch reverts agent configuration in such situation. To make the process of changes and reversion serialized, requests are now handled in the newly introduced event queue configModifyQueue. Related: cilium#13902 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

Right now changing some agent options like `PolicyAuditMode` from cilium CLI succeeds but doesn't affect datapath, e.g.: cilium config PolicyAuditMode=true This patch fixes this by applying agent config changes to endpoint header file and triggering endpoint datapath regeneration. Fixes: cilium#13902 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

aanm · 2021-01-30T13:16:28Z

test-me-please

joestringer

Nice work!

joestringer · 2021-02-09T22:45:12Z

GKE CI job failed, but unclear why due to age. I'll rekick it.

joestringer · 2021-02-09T22:45:18Z

retest-gke

joestringer · 2021-02-09T22:47:04Z

I had originally nominated this for v1.9 backport assuming it is a fairly small change and given the proximity to the v1.9.0 release. At this point, I think the changes are structurally a bit more significant and this goes a bit outside the usual bugfixes for the latest release. Unless I hear a strong argument for why this change is safe to backport to v1.9, I will opt to drop that backport proposal and instead we'll include it in the upcoming v1.10 release.

jaffcheng requested review from a team November 10, 2020 12:52

jaffcheng requested a review from a team as a code owner November 10, 2020 12:52

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 10, 2020

maintainer-s-little-helper bot added this to In progress in 1.10.0 Nov 10, 2020

jaffcheng requested review from jibi and jrajahalme November 10, 2020 12:53

maintainer-s-little-helper bot assigned jibi and jrajahalme Nov 10, 2020

joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Nov 10, 2020

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 10, 2020

joestringer added the needs-backport/1.9 label Nov 10, 2020

christarazi requested changes Nov 11, 2020

View reviewed changes

jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from 7725130 to db1968b Compare November 12, 2020 05:04

christarazi approved these changes Nov 12, 2020

View reviewed changes

pkg/endpointmanager/manager.go Outdated Show resolved Hide resolved

daemon/cmd/daemon.go Outdated Show resolved Hide resolved

jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from db1968b to 66fba81 Compare November 13, 2020 06:48

jibi reviewed Nov 16, 2020

View reviewed changes

daemon/cmd/daemon.go Show resolved Hide resolved

pkg/datapath/linux/config/config.go Show resolved Hide resolved

jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from 66fba81 to a5f4601 Compare November 17, 2020 07:51

joestringer added this to Needs backport from master in 1.9.1 Nov 17, 2020

jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from a5f4601 to 366ea93 Compare November 18, 2020 04:29

jaffcheng requested a review from joestringer January 14, 2021 07:10

maintainer-s-little-helper bot assigned joestringer Jan 14, 2021

aanm approved these changes Jan 14, 2021

View reviewed changes

daemon/cmd/config.go Outdated Show resolved Hide resolved

jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from 3115ee4 to 86fbab3 Compare January 14, 2021 11:07

joestringer requested changes Jan 14, 2021

View reviewed changes

aanm added this to Needs backport from master in 1.9.3 Jan 20, 2021

aanm removed this from Needs backport from master in 1.9.2 Jan 20, 2021

aanm added this to Needs backport from master in 1.9.4 Jan 22, 2021

aanm removed this from Needs backport from master in 1.9.3 Jan 22, 2021

jaffcheng force-pushed the regen-datapath-on-agent-config-change branch 2 times, most recently from c72059d to b9682b5 Compare January 27, 2021 13:07

jaffcheng requested review from joestringer and removed request for a team January 27, 2021 13:17

christarazi reviewed Jan 27, 2021

View reviewed changes

daemon/cmd/config.go Outdated Show resolved Hide resolved

daemon/cmd/daemon.go Outdated Show resolved Hide resolved

jaffcheng added 3 commits January 28, 2021 09:23

datapath: check PolicyAuditMode by mutable config options

a0db469

PolicyAuditMode can be changed at runtime by cilium CLI/API, so we should respect the mutable config option. Fixes: cilium#13902 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from b9682b5 to 428afc2 Compare January 28, 2021 09:28

christarazi added this to Needs backport from master in 1.9.5 Feb 3, 2021

christarazi removed this from Needs backport from master in 1.9.4 Feb 3, 2021

joestringer approved these changes Feb 9, 2021

View reviewed changes

joestringer removed the needs-backport/1.9 label Feb 9, 2021

joestringer removed this from Needs backport from master in 1.9.5 Feb 10, 2021

joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 10, 2021

borkmann merged commit f54f9d7 into cilium:master Feb 10, 2021

jaffcheng deleted the regen-datapath-on-agent-config-change branch February 11, 2021 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon, config: regenerate endpoint datapath on agent config change #13971

daemon, config: regenerate endpoint datapath on agent config change #13971

jaffcheng commented Nov 10, 2020 •

edited

christarazi left a comment

christarazi left a comment

christarazi commented Nov 14, 2020

sayboras commented Nov 14, 2020

jibi commented Nov 16, 2020

jibi commented Nov 17, 2020

jibi commented Nov 17, 2020

jibi commented Nov 17, 2020

jaffcheng commented Nov 18, 2020

jibi commented Nov 18, 2020

jaffcheng commented Nov 18, 2020

jibi commented Nov 18, 2020

aanm left a comment

joestringer left a comment

joestringer Jan 14, 2021 •

edited

jaffcheng Jan 15, 2021

aanm commented Jan 30, 2021

joestringer left a comment

joestringer commented Feb 9, 2021

joestringer commented Feb 9, 2021

joestringer commented Feb 9, 2021

daemon, config: regenerate endpoint datapath on agent config change #13971

daemon, config: regenerate endpoint datapath on agent config change #13971

Conversation

jaffcheng commented Nov 10, 2020 • edited

cilium config PolicyAuditMode=true

christarazi left a comment

Choose a reason for hiding this comment

christarazi left a comment

Choose a reason for hiding this comment

christarazi commented Nov 14, 2020

sayboras commented Nov 14, 2020

jibi commented Nov 16, 2020

jibi commented Nov 17, 2020

jibi commented Nov 17, 2020

jibi commented Nov 17, 2020

jaffcheng commented Nov 18, 2020

jibi commented Nov 18, 2020

jaffcheng commented Nov 18, 2020

jibi commented Nov 18, 2020

aanm left a comment

Choose a reason for hiding this comment

joestringer left a comment

Choose a reason for hiding this comment

joestringer Jan 14, 2021 • edited

Choose a reason for hiding this comment

jaffcheng Jan 15, 2021

Choose a reason for hiding this comment

aanm commented Jan 30, 2021

joestringer left a comment

Choose a reason for hiding this comment

joestringer commented Feb 9, 2021

joestringer commented Feb 9, 2021

joestringer commented Feb 9, 2021

jaffcheng commented Nov 10, 2020 •

edited

joestringer Jan 14, 2021 •

edited