-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon, config: regenerate endpoint datapath on agent config change #13971
daemon, config: regenerate endpoint datapath on agent config change #13971
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Overall, I think it looks good. A few items to address below.
7725130
to
db1968b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few minor nits to address. Doesn't require another look from me. Can you rebase your PR on lastest master to pick up some CI changes? Currently, we have some Docker rate limiting issues and a rebase would pick up the new mitigations, so that we can run tests against this PR.
db1968b
to
66fba81
Compare
test-me-please |
2 similar comments
test-me-please |
test-me-please |
66fba81
to
a5f4601
Compare
test-me-please |
Changes LGTM, but there are a few tests in the |
test-gke |
a5f4601
to
366ea93
Compare
Thanks for the comments! Added a comment on reusing |
test-me-please |
Runtime-4.9 failed tests, not sure if related to this PR: |
I don't think this is a flake in the CI as your branch was rebased recently and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small nit
3115ee4
to
86fbab3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up and investigating the concerns I brought up. I thought a bit longer about the solution here and I think we need to go further to provide strong guarantees and reduce the potential for long blocking mutexes in the agent. More detail below.
daemon/cmd/config.go
Outdated
@@ -51,29 +51,30 @@ func (h *patchConfig) Handle(params PatchConfigParams) middleware.Responder { | |||
|
|||
// Serialize configuration updates to the daemon. | |||
option.Config.ConfigPatchMutex.Lock() | |||
defer option.Config.ConfigPatchMutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These locking changes make me nervous. It looks very easy to misunderstand the locking patterns and introduce a deadlock. I've looked for about half an hour now and I can't convince myself it's correct (it may be; but even if it is, it looks easy to break in future).
I believe this is attempting to address the comments here:
#13971 (comment)
Furthermore, I think it is trying to address a case like:
- API request comes in to configure some setting A
- Previous configuration is stored
- Configuration is applied
- Lock is released
- Reinitialization is triggered
- API request comes in to configure some setting B
- Configuration is applied concurrently with reinitialization
- Lock is released
- Reinitialize is triggered
- The step (1.iv) fails, so we need to undo the attempt at reconfiguration
- Overwrite the configuration with the original state, which undoes setting B modifications
- The step (2.iii) succeeds, but (3) already reverted the changes
Clearly this is not the behaviour we want the API to have so I agree we should address it 👍
That said, I don't think it's a good idea to extend the locking over the entire Reinitialize()
and try to enforce various places underneath to grab the right locks at the right time. This extends the locking over a long period, and also forces various locations to have more understanding of the locking because the locks aren't following a simple obvious "lock, read/change, unlock" pattern. In fact, the Reinitialize()
path is shelling out to an external script which runs other processes like a compiler and so on, so we have very few guarantees about how long that will take. It could take seconds or even minutes in a CPU-starved environment (not typical, but possible).
The general way we solve this problem in other cases like policyAdd is to have an eventqueue where the events are serialized, then the reconfiguration is handled from there. If the caller wants to wait for the results, there is a channel-based mechanism to listen for the success or failure of the changes. See putPolicy.Handle
and its callee Daemon.PolicyAdd()
for examples.
Why is this different from just grabbing the lock? It allows serialization of the API requests without forcing the underlying data structures to be locked for a long period. The locking of the underlying structures, if necessary, can be handled briefly from various goroutines without introducing arbitrarily long delays. I believe this is already the case today for data like IPv4NativeRoutingCIDR
, which currently has quite granular locking (which this PR is currently proposing to extend to a long-held lock).
With this approach in place, the core logic of a new eventqueue's struct ConfigModifyEvent
Handle()
function could store the changes, lock, modify, unlock, Reinitialize()
, then choose to either Lock+revert+Unlock or return success depending on the result of the reinitialization.
Originally I thought that maybe there's an alternative path, Undoing the changes in the event of a failure without holding the lock the whole time. For simple cases this might be doable to check that the settings are configured how we expect before reverting them in the error handling, but overall the most robust approach would be to follow the EventQueue approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed write-up! I don't like the locking changes of various places as well. Sorry for not realizing the potential long execution time of Reinitialize
and the benefit of using an event queue. Will try the suggested approach.
c72059d
to
b9682b5
Compare
PolicyAuditMode can be changed at runtime by cilium CLI/API, so we should respect the mutable config option. Fixes: cilium#13902 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Currently, if patch agent config API failed to recompile base programs, API will respond with an error but agent options may have been configured but not taking effect. These changes may take effect unexpectedly at next regeneration. This patch reverts agent configuration in such situation. To make the process of changes and reversion serialized, requests are now handled in the newly introduced event queue configModifyQueue. Related: cilium#13902 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Right now changing some agent options like `PolicyAuditMode` from cilium CLI succeeds but doesn't affect datapath, e.g.: cilium config PolicyAuditMode=true This patch fixes this by applying agent config changes to endpoint header file and triggering endpoint datapath regeneration. Fixes: cilium#13902 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
b9682b5
to
428afc2
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
GKE CI job failed, but unclear why due to age. I'll rekick it. |
retest-gke |
I had originally nominated this for v1.9 backport assuming it is a fairly small change and given the proximity to the v1.9.0 release. At this point, I think the changes are structurally a bit more significant and this goes a bit outside the usual bugfixes for the latest release. Unless I hear a strong argument for why this change is safe to backport to v1.9, I will opt to drop that backport proposal and instead we'll include it in the upcoming v1.10 release. |
Right now changing some agent options like
PolicyAuditMode
fromcilium CLI succeeds but doesn't affect datapath, e.g.:
cilium config PolicyAuditMode=true
This patch fixes this by applying agent config changes to endpoint
header file and triggering endpoint datapath regeneration.
Fixes: #13902
Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com