Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon, config: regenerate endpoint datapath on agent config change #13971

Merged

Conversation

jaffcheng
Copy link
Contributor

@jaffcheng jaffcheng commented Nov 10, 2020

Right now changing some agent options like PolicyAuditMode from
cilium CLI succeeds but doesn't affect datapath, e.g.:

cilium config PolicyAuditMode=true

This patch fixes this by applying agent config changes to endpoint
header file and triggering endpoint datapath regeneration.

Fixes: #13902
Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com

@jaffcheng jaffcheng requested review from a team November 10, 2020 12:52
@jaffcheng jaffcheng requested a review from a team as a code owner November 10, 2020 12:52
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 10, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Nov 10, 2020
@joestringer joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Nov 10, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 10, 2020
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Overall, I think it looks good. A few items to address below.

daemon/cmd/daemon.go Outdated Show resolved Hide resolved
daemon/cmd/daemon.go Outdated Show resolved Hide resolved
pkg/datapath/linux/config/config.go Outdated Show resolved Hide resolved
pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved
pkg/endpointmanager/manager.go Outdated Show resolved Hide resolved
@jaffcheng jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from 7725130 to db1968b Compare November 12, 2020 05:04
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor nits to address. Doesn't require another look from me. Can you rebase your PR on lastest master to pick up some CI changes? Currently, we have some Docker rate limiting issues and a rebase would pick up the new mitigations, so that we can run tests against this PR.

pkg/endpointmanager/manager.go Outdated Show resolved Hide resolved
daemon/cmd/daemon.go Outdated Show resolved Hide resolved
@jaffcheng jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from db1968b to 66fba81 Compare November 13, 2020 06:48
@christarazi
Copy link
Member

test-me-please

2 similar comments
@sayboras
Copy link
Member

test-me-please

@jibi
Copy link
Member

jibi commented Nov 16, 2020

test-me-please

daemon/cmd/daemon.go Show resolved Hide resolved
pkg/datapath/linux/config/config.go Show resolved Hide resolved
@jaffcheng jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from 66fba81 to a5f4601 Compare November 17, 2020 07:51
@jibi
Copy link
Member

jibi commented Nov 17, 2020

test-me-please

@jibi
Copy link
Member

jibi commented Nov 17, 2020

Changes LGTM, but there are a few tests in the Runtime-4.9 suite failing

@jibi
Copy link
Member

jibi commented Nov 17, 2020

test-gke

@joestringer joestringer added this to Needs backport from master in 1.9.1 Nov 17, 2020
@jaffcheng jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from a5f4601 to 366ea93 Compare November 18, 2020 04:29
@jaffcheng
Copy link
Contributor Author

Thanks for the comments! Added a comment on reusing policyTriggerMetrics and split the change of using Config.Opts.Opts[option.PolicyAuditMode] to a separate commit. Please take another look

@jibi
Copy link
Member

jibi commented Nov 18, 2020

test-me-please

@jaffcheng
Copy link
Contributor Author

test-me-please

Runtime-4.9 failed tests, not sure if related to this PR:
Suite-runtime.RuntimePolicies Test egress with L7 policy to outside cluster
Suite-runtime.RuntimePolicies Extended HTTP Methods tests
Suite-runtime.RuntimePolicies CIDR L3 Policy validates toCIDR

@jibi
Copy link
Member

jibi commented Nov 18, 2020

Runtime-4.9 failed tests, not sure if related to this PR:

I don't think this is a flake in the CI as your branch was rebased recently and the Runtime-4.9 test suite is green on master

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small nit

daemon/cmd/config.go Outdated Show resolved Hide resolved
@jaffcheng jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from 3115ee4 to 86fbab3 Compare January 14, 2021 11:07
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up and investigating the concerns I brought up. I thought a bit longer about the solution here and I think we need to go further to provide strong guarantees and reduce the potential for long blocking mutexes in the agent. More detail below.

@@ -51,29 +51,30 @@ func (h *patchConfig) Handle(params PatchConfigParams) middleware.Responder {

// Serialize configuration updates to the daemon.
option.Config.ConfigPatchMutex.Lock()
defer option.Config.ConfigPatchMutex.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These locking changes make me nervous. It looks very easy to misunderstand the locking patterns and introduce a deadlock. I've looked for about half an hour now and I can't convince myself it's correct (it may be; but even if it is, it looks easy to break in future).

I believe this is attempting to address the comments here:
#13971 (comment)

Furthermore, I think it is trying to address a case like:

  1. API request comes in to configure some setting A
    1. Previous configuration is stored
    2. Configuration is applied
    3. Lock is released
    4. Reinitialization is triggered
  2. API request comes in to configure some setting B
    1. Configuration is applied concurrently with reinitialization
    2. Lock is released
    3. Reinitialize is triggered
  3. The step (1.iv) fails, so we need to undo the attempt at reconfiguration
    1. Overwrite the configuration with the original state, which undoes setting B modifications
  4. The step (2.iii) succeeds, but (3) already reverted the changes

Clearly this is not the behaviour we want the API to have so I agree we should address it 👍

That said, I don't think it's a good idea to extend the locking over the entire Reinitialize() and try to enforce various places underneath to grab the right locks at the right time. This extends the locking over a long period, and also forces various locations to have more understanding of the locking because the locks aren't following a simple obvious "lock, read/change, unlock" pattern. In fact, the Reinitialize() path is shelling out to an external script which runs other processes like a compiler and so on, so we have very few guarantees about how long that will take. It could take seconds or even minutes in a CPU-starved environment (not typical, but possible).

The general way we solve this problem in other cases like policyAdd is to have an eventqueue where the events are serialized, then the reconfiguration is handled from there. If the caller wants to wait for the results, there is a channel-based mechanism to listen for the success or failure of the changes. See putPolicy.Handle and its callee Daemon.PolicyAdd() for examples.

Why is this different from just grabbing the lock? It allows serialization of the API requests without forcing the underlying data structures to be locked for a long period. The locking of the underlying structures, if necessary, can be handled briefly from various goroutines without introducing arbitrarily long delays. I believe this is already the case today for data like IPv4NativeRoutingCIDR, which currently has quite granular locking (which this PR is currently proposing to extend to a long-held lock).

With this approach in place, the core logic of a new eventqueue's struct ConfigModifyEvent Handle() function could store the changes, lock, modify, unlock, Reinitialize(), then choose to either Lock+revert+Unlock or return success depending on the result of the reinitialization.

Originally I thought that maybe there's an alternative path, Undoing the changes in the event of a failure without holding the lock the whole time. For simple cases this might be doable to check that the settings are configured how we expect before reverting them in the error handling, but overall the most robust approach would be to follow the EventQueue approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed write-up! I don't like the locking changes of various places as well. Sorry for not realizing the potential long execution time of Reinitialize and the benefit of using an event queue. Will try the suggested approach.

@aanm aanm added this to Needs backport from master in 1.9.3 Jan 20, 2021
@aanm aanm removed this from Needs backport from master in 1.9.2 Jan 20, 2021
@aanm aanm added this to Needs backport from master in 1.9.4 Jan 22, 2021
@aanm aanm removed this from Needs backport from master in 1.9.3 Jan 22, 2021
@jaffcheng jaffcheng force-pushed the regen-datapath-on-agent-config-change branch 2 times, most recently from c72059d to b9682b5 Compare January 27, 2021 13:07
@jaffcheng jaffcheng requested review from joestringer and removed request for a team January 27, 2021 13:17
daemon/cmd/config.go Outdated Show resolved Hide resolved
daemon/cmd/daemon.go Outdated Show resolved Hide resolved
PolicyAuditMode can be changed at runtime by cilium CLI/API, so we
should respect the mutable config option.

Fixes: cilium#13902
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Currently, if patch agent config API failed to recompile base programs,
API will respond with an error but agent options may have been configured but
not taking effect. These changes may take effect unexpectedly at next regeneration.

This patch reverts agent configuration in such situation.
To make the process of changes and reversion serialized, requests are now
handled in the newly introduced event queue configModifyQueue.

Related: cilium#13902
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Right now changing some agent options like `PolicyAuditMode` from
cilium CLI succeeds but doesn't affect datapath, e.g.:
  cilium config PolicyAuditMode=true
This patch fixes this by applying agent config changes to endpoint
header file and triggering endpoint datapath regeneration.

Fixes: cilium#13902
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng jaffcheng force-pushed the regen-datapath-on-agent-config-change branch from b9682b5 to 428afc2 Compare January 28, 2021 09:28
@aanm
Copy link
Member

aanm commented Jan 30, 2021

test-me-please

@christarazi christarazi added this to Needs backport from master in 1.9.5 Feb 3, 2021
@christarazi christarazi removed this from Needs backport from master in 1.9.4 Feb 3, 2021
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@joestringer
Copy link
Member

GKE CI job failed, but unclear why due to age. I'll rekick it.

@joestringer
Copy link
Member

retest-gke

@joestringer
Copy link
Member

I had originally nominated this for v1.9 backport assuming it is a fairly small change and given the proximity to the v1.9.0 release. At this point, I think the changes are structurally a bit more significant and this goes a bit outside the usual bugfixes for the latest release. Unless I hear a strong argument for why this change is safe to backport to v1.9, I will opt to drop that backport proposal and instead we'll include it in the upcoming v1.10 release.

@joestringer joestringer removed this from Needs backport from master in 1.9.5 Feb 10, 2021
@joestringer joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 10, 2021
@borkmann borkmann merged commit f54f9d7 into cilium:master Feb 10, 2021
@jaffcheng jaffcheng deleted the regen-datapath-on-agent-config-change branch February 11, 2021 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

change PolicyAuditMode config option through CLI doesn't affect datapath
10 participants