Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator: Improve removing stale CEPs from CESs entries on start. #24596

Merged
merged 1 commit into from Apr 11, 2023

Conversation

alan-kut
Copy link
Contributor

@alan-kut alan-kut commented Mar 28, 2023

Handle situation when the CEP is duplicated in CESs.
Keep only one CEP if the CEP still exists or
delete all the instnaces if it doesn't exist anymore.
Prefer keeping CEP with the correct identity if such exists.

It's related to #24581 but it doesn't fully fix it.
It only fixes the reconciliation at the startup.

The operator now reconciles duplicate entries in a CiliumEndpointSlice on startup.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 28, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Mar 28, 2023
operator/pkg/ciliumendpointslice/endpointslice.go Outdated Show resolved Hide resolved
operator/pkg/ciliumendpointslice/endpointslice.go Outdated Show resolved Hide resolved
operator/pkg/ciliumendpointslice/endpointslice_test.go Outdated Show resolved Hide resolved
operator/pkg/ciliumendpointslice/endpointslice_test.go Outdated Show resolved Hide resolved
ces.removedCEPs[GetCEPNameFromCCEP(&ep, ces.ces.Namespace)] = struct{}{}
ces.ces.Endpoints =
append(ces.ces.Endpoints[:i],
ces.ces.Endpoints[i+1:]...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect more than a few endpoints to pass this check and be removed? if yes, this is a good candidate for optimization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is moved from a RemoveCEPFromCache function.

Before this PR it would always remove 1 endpoint.
With this PR it is possible (and tested) that it can remove more than one in case something really wrong happened but I don't think it is possible now except for some bugs.

So - the point of this PR is to make sure the state is OK after operator restart regardless of anything.
In practice I don't think in the current code it is possible to have duplicated CEP in the same CES.
However it is possible that different CEPs would be removed from the same CES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance-wise, it should be insignificant, unless the duplicate list is big, but then we have another big problem.


func (c *cesMgr) getAllCESs() []cesOperations {
var cess []cesOperations
for _, ces := range c.desiredCESs.getAllCESs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider preallocating cess or using copy (I am not sure of the type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preallocated the slice

found := false
cep := storeCep.(*cilium_api_v2.CiliumEndpoint)
// Skip first element for now
for _, mapping := range mappings[1:] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it maybe simpler if it goes through the mappings twice? First to find the CEP with the matching ID, and then to remove all but selected (the first CEP or with matching ID).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be multiple CEPs with matching ID or no CEPs with matching ID.
The code would be something like:
matchingIdIndex := 0
for loop {
if matches id {
matchingIdIndex = currentIndex
}
for loop {
if currentIndex == matchingIdIndex {
keep
} else {
remove
}
}

I don't think it would be much simpler

ces.removedCEPs[GetCEPNameFromCCEP(&ep, ces.ces.Namespace)] = struct{}{}
ces.ces.Endpoints =
append(ces.ces.Endpoints[:i],
ces.ces.Endpoints[i+1:]...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance-wise, it should be insignificant, unless the duplicate list is big, but then we have another big problem.

@alan-kut alan-kut force-pushed the fix-stale-ceps branch 2 times, most recently from 12c955e to 378989d Compare March 28, 2023 10:37
@alan-kut alan-kut requested review from dlapcevic and removed request for AwesomePatrol March 28, 2023 10:37
@alan-kut alan-kut marked this pull request as ready for review March 28, 2023 10:37
@alan-kut alan-kut requested a review from a team as a code owner March 28, 2023 10:37
@alan-kut alan-kut requested review from squeed and removed request for dlapcevic March 28, 2023 10:37
@alan-kut alan-kut force-pushed the fix-stale-ceps branch 2 times, most recently from de3d1ae to 29cc185 Compare March 29, 2023 08:57
@alan-kut
Copy link
Contributor Author

What are the next steps?

@squeed squeed added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Mar 31, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 31, 2023
@squeed
Copy link
Contributor

squeed commented Mar 31, 2023

@alan-kut I wrote a brief release note, please take a look. I'll trigger the last CI jobs now, and if those go green, it will be picked up for merge.

@squeed
Copy link
Contributor

squeed commented Mar 31, 2023

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Error deleting resource /home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.19/src/github.com/cilium/cilium/test/k8s/manifests/host-policies.yaml: Cannot retrieve "cilium-l7rgf"'s policy revision: cannot get policy revision: ""

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

@alan-kut
Copy link
Contributor Author

Looks good.

I didn't know how to spot realease notes from other PRs, I'll try to include them for the future.

@squeed
Copy link
Contributor

squeed commented Mar 31, 2023

Yeah, it's a judgement call whether or not to write a release note. I default towards yes; it's always easier to filter them at release time. If a PR doesn't need a release note, set the label release-note/misc.

@alan-kut
Copy link
Contributor Author

alan-kut commented Apr 5, 2023

I tried to analyze the failures but failed.

@squeed
Copy link
Contributor

squeed commented Apr 5, 2023

The GKE failure seems like the annoying case where it gets stuck on the metrics finalizer. Jenkins failure #24697

@squeed
Copy link
Contributor

squeed commented Apr 5, 2023

Those are known issues, marking as ready-to-merge.

@squeed squeed added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 5, 2023
@christarazi christarazi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/operator Impacts the cilium-operator component and removed release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Apr 5, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

From the title / commit, I did not get the impression that this was resolving a bug with CES, but rather with CEPs in general. For the latter, we have logic inside the Agent itself to cleanup CEPs, but not in the context of CESs. Could you clarify that in the commit please?

@christarazi christarazi removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 5, 2023
@alan-kut alan-kut changed the title Operator: Improve removing stale CEP entries on start. Operator: Improve removing stale CEPs from CESs entries on start. Apr 6, 2023
@alan-kut
Copy link
Contributor Author

alan-kut commented Apr 6, 2023

You are right, I didn't notice it. Fixed.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! There's a small typo "instnaces" in the commit, but otherwise LGTM

@alan-kut
Copy link
Contributor Author

Fixed the typo

Handle situation when the CEP is duplicated in CESs.
Keep only one CEP if the CEP still exists or
delete all the instances if it doesn't exist anymore.
Prefer keeping CEP with the correct identity if such exists.

Signed-off-by: Alan Kutniewski <kutniewski@google.com>
@christarazi
Copy link
Member

christarazi commented Apr 11, 2023

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications

Failure Output


Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/865/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

@alan-kut
Copy link
Contributor Author

/mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19

I think it is flake and I didn't find it reported on the issues.

@squeed
Copy link
Contributor

squeed commented Apr 11, 2023

Agreed, failure looks like a flake; marking as ready-to-merge.

@squeed squeed added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 11, 2023
@dylandreimerink
Copy link
Member

/mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19

@dylandreimerink
Copy link
Member

The ci-eks hits #24774 (already fixed in master). ci-datapath should be fixed in master by (#24809).

@dylandreimerink dylandreimerink merged commit 80c6ebd into cilium:master Apr 11, 2023
54 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/bug This is a bug in the Cilium logic. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants