New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator: Improve removing stale CEPs from CESs entries on start. #24596
Conversation
a175f56
to
28ae7d4
Compare
ces.removedCEPs[GetCEPNameFromCCEP(&ep, ces.ces.Namespace)] = struct{}{} | ||
ces.ces.Endpoints = | ||
append(ces.ces.Endpoints[:i], | ||
ces.ces.Endpoints[i+1:]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect more than a few endpoints to pass this check and be removed? if yes, this is a good candidate for optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is moved from a RemoveCEPFromCache function.
Before this PR it would always remove 1 endpoint.
With this PR it is possible (and tested) that it can remove more than one in case something really wrong happened but I don't think it is possible now except for some bugs.
So - the point of this PR is to make sure the state is OK after operator restart regardless of anything.
In practice I don't think in the current code it is possible to have duplicated CEP in the same CES.
However it is possible that different CEPs would be removed from the same CES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance-wise, it should be insignificant, unless the duplicate list is big, but then we have another big problem.
|
||
func (c *cesMgr) getAllCESs() []cesOperations { | ||
var cess []cesOperations | ||
for _, ces := range c.desiredCESs.getAllCESs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider preallocating cess
or using copy
(I am not sure of the type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preallocated the slice
28ae7d4
to
c5b8f47
Compare
found := false | ||
cep := storeCep.(*cilium_api_v2.CiliumEndpoint) | ||
// Skip first element for now | ||
for _, mapping := range mappings[1:] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it maybe simpler if it goes through the mappings twice? First to find the CEP with the matching ID, and then to remove all but selected (the first CEP or with matching ID).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be multiple CEPs with matching ID or no CEPs with matching ID.
The code would be something like:
matchingIdIndex := 0
for loop {
if matches id {
matchingIdIndex = currentIndex
}
for loop {
if currentIndex == matchingIdIndex {
keep
} else {
remove
}
}
I don't think it would be much simpler
ces.removedCEPs[GetCEPNameFromCCEP(&ep, ces.ces.Namespace)] = struct{}{} | ||
ces.ces.Endpoints = | ||
append(ces.ces.Endpoints[:i], | ||
ces.ces.Endpoints[i+1:]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance-wise, it should be insignificant, unless the duplicate list is big, but then we have another big problem.
12c955e
to
378989d
Compare
de3d1ae
to
29cc185
Compare
29cc185
to
2edde46
Compare
What are the next steps? |
@alan-kut I wrote a brief release note, please take a look. I'll trigger the last CI jobs now, and if those go green, it will be picked up for merge. |
/test Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
Looks good. I didn't know how to spot realease notes from other PRs, I'll try to include them for the future. |
Yeah, it's a judgement call whether or not to write a release note. I default towards yes; it's always easier to filter them at release time. If a PR doesn't need a release note, set the label |
I tried to analyze the failures but failed. |
The GKE failure seems like the annoying case where it gets stuck on the metrics finalizer. Jenkins failure #24697 |
Those are known issues, marking as ready-to-merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
From the title / commit, I did not get the impression that this was resolving a bug with CES, but rather with CEPs in general. For the latter, we have logic inside the Agent itself to cleanup CEPs, but not in the context of CESs. Could you clarify that in the commit please?
You are right, I didn't notice it. Fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! There's a small typo "instnaces" in the commit, but otherwise LGTM
Fixed the typo |
Handle situation when the CEP is duplicated in CESs. Keep only one CEP if the CEP still exists or delete all the instances if it doesn't exist anymore. Prefer keeping CEP with the correct identity if such exists. Signed-off-by: Alan Kutniewski <kutniewski@google.com>
0cb2473
to
ac0ce66
Compare
/test Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/865/ If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 I think it is flake and I didn't find it reported on the issues. |
Agreed, failure looks like a flake; marking as ready-to-merge. |
/mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 |
Handle situation when the CEP is duplicated in CESs.
Keep only one CEP if the CEP still exists or
delete all the instnaces if it doesn't exist anymore.
Prefer keeping CEP with the correct identity if such exists.
It's related to #24581 but it doesn't fully fix it.
It only fixes the reconciliation at the startup.