Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: Handle conflicts in CES update. #26455

Merged
merged 1 commit into from
Jul 19, 2023
Merged

Conversation

alan-kut
Copy link
Contributor

@alan-kut alan-kut commented Jun 23, 2023

If for some reason operator has outdated version of CES it will not be able to update such CES and it will never recover from such state.

This can happen not only due to some othe client updating CES but also when the update from operator succeeds but for some reason api-server does't return OK (it can fail after updating etcd).

Signed-off-by: Alan Kutniewski kutniewski@google.com

Fix operator entering broken state when it has outdated version of the CES in the cache.

@alan-kut alan-kut requested review from a team as code owners June 23, 2023 09:57
@alan-kut alan-kut requested review from sayboras and squeed June 23, 2023 09:57
@maintainer-s-little-helper
Copy link

Commit 9c7119712346a4d25c1ebebab955ef8b230ca140 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 23, 2023
@alan-kut
Copy link
Contributor Author

/assign dlapcevic

@alan-kut
Copy link
Contributor Author

\assign dlapcevic

@sayboras sayboras added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jun 23, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 23, 2023
@sayboras
Copy link
Member

Can you also help to correct the sign-off statement in the commit message?

Signed-off-by:  Alan Kutniewski <kutniewski@google.com>

@dlapcevic
Copy link
Contributor

/assign @dlapcevic

@sayboras sayboras requested a review from dlapcevic June 23, 2023 10:04
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 23, 2023
Copy link
Contributor

@dlapcevic dlapcevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

if !exists {
return err
}
ces.ObjectMeta = obj.(*cilium_v2.CiliumEndpointSlice).ObjectMeta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little questionable. Might want to see if you can update the obj.Spec/obj.Status instead, otherwise you might overwrite values added by another update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine because the operator doesn't want to modify CES ObjectMeta.
The operator creates CESs and then only modifies the Endpoints field.

The only remaining concern is that updates will still fail if another source (other than operator) modifies the Endpoints field, but the aim of this PR is to fix the issue of operator getting stuck on its own, just as a part of the k8s system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried about the ObjectMeta data, I'm worried that it's overwriting the entire rest of the object without concern for it's previous values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand it. I intended to overwrite ObjectMeta only and I thought this is what I did.
This is the same as updateCESInCache code called later:

func (c *cesMgr) updateCESInCache(srcCES *cilium_v2.CiliumEndpointSlice, isDeepCopy bool) {
    if ces, ok := c.desiredCESs.getCESTracker(srcCES.GetName()); ok {
        ces.backendMutex.Lock()
        defer ces.backendMutex.Unlock()
        if !isDeepCopy {
            ces.ces.ObjectMeta = srcCES.ObjectMeta
        } else {
            ces.ces = srcCES
                for _, cep := range ces.ces.Endpoints {
                    // Update the desiredCESs, to reflect all CEPs are packed in a CES
                    c.desiredCESs.insertCEP(GetCEPNameFromCCEP(&cep, ces.ces.Namespace), srcCES.GetName())
                }
            }
        } else {
            log.WithFields(logrus.Fields{
            logfields.CESName: srcCES.GetName(),
            }).Debug("Attempted to updateCESInCache non-existent, skipping.")
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are worried that in the apisever we ignore all other possible changes.

What is the case in which we shouldn't do this?
Operator should be the only controller for CES.

If some client changes k8s.Endpoints etc the controllers owning them will revert such changes. The same would happen here (to some extent) with the CiliumEndpointSlice.

Long term I think the operator should always reconcile the full state of CiliumEndpointSlice and make sure they are exactly as operator expects them to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would work better.

It would fix the problem with conflict eventually but it would introduce more problem.
It would always use metadata from the watcher cache so the metadata would be outdated after update before watch gets the event.

This is happening right now with creates. We determine create vs update based on the watch cache, we create if CES is not present in the watch cache. So we observed it happening that we create the CES and operator wants to update it before the watcher obeserves the event so it tries to create again and fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move my change to the handleErr - I think it makes more sense, will update the PR soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would always use metadata from the watcher cache so the metadata would be outdated after update before watch gets the event.

that is how kubernetes controllers work, there is no pass through cache and you have to operate with the state of the informer, it eventually be consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is happening right now with creates. We determine create vs update based on the watch cache, we create if CES is not present in the watch cache

that is the part I don't understand, why are those two different operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

I changed the place for handling conflict errors to errorhandling in endpointslices.

@alan-kut alan-kut force-pushed the op-409 branch 2 times, most recently from 48f8fb2 to b679e13 Compare June 28, 2023 13:03
@alan-kut
Copy link
Contributor Author

/test

@brb
Copy link
Member

brb commented Jun 29, 2023

Please rebase against the latest main branch to fix the ci-e2e failures.

@alan-kut alan-kut force-pushed the op-409 branch 2 times, most recently from 5e7fda2 to 097fe10 Compare June 29, 2023 09:21
@alan-kut
Copy link
Contributor Author

All tests passed except flaky K8sUpstreamNetConformance / kubernetes-e2e-net-conformance (ipv4)

@alan-kut alan-kut requested a review from aojea June 29, 2023 12:02
@aojea
Copy link
Contributor

aojea commented Jun 29, 2023

/lgtm

K8sUpstreamNetConformance / kubernetes-e2e-net-conformance (ipv4) (pull_request

this job is flaky and not required and does not seem related, is on my backlog and I hope I can get to it as it is causing a lot of troubles

@squeed squeed removed request for a team and sayboras June 30, 2023 08:07
@nbusseneau nbusseneau added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 25, 2023
@gentoo-root gentoo-root added this to Needs backport from main in 1.11.20 Jul 26, 2023
@gentoo-root gentoo-root removed this from Needs backport from main in 1.11.19 Jul 26, 2023
@gentoo-root gentoo-root added this to Needs backport from main in 1.12.13 Jul 26, 2023
@gentoo-root gentoo-root removed this from Needs backport from main in 1.12.12 Jul 26, 2023
@gentoo-root gentoo-root added this to Needs backport from main in 1.13.6 Jul 26, 2023
@gentoo-root gentoo-root removed this from Needs backport from main in 1.13.5 Jul 26, 2023
@nbusseneau nbusseneau removed needs-backport/1.11 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 26, 2023
@nbusseneau
Copy link
Member

Considering the above, I have unmarked this PR for backport to 1.11 / 1.12 / 1.13. Please add them back should we revise our stance.

@nbusseneau nbusseneau removed affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch labels Jul 26, 2023
@nebril nebril added this to Needs backport from main in 1.13.7 Aug 10, 2023
@nebril nebril removed this from Needs backport from main in 1.13.6 Aug 10, 2023
@nebril nebril added this to Needs backport from main in 1.11.21 Aug 10, 2023
@nebril nebril removed this from Needs backport from main in 1.11.20 Aug 10, 2023
@asauber asauber added this to Needs backport from main in 1.12.14 Aug 13, 2023
@asauber asauber removed this from Needs backport from main in 1.12.13 Aug 13, 2023
@michi-covalent michi-covalent added this to Needs backport from main in 1.13.8 Sep 9, 2023
@michi-covalent michi-covalent removed this from Needs backport from main in 1.13.7 Sep 9, 2023
@michi-covalent michi-covalent added this to Needs backport from main in 1.12.15 Sep 9, 2023
@michi-covalent michi-covalent removed this from Needs backport from main in 1.12.14 Sep 9, 2023
@jrajahalme jrajahalme added this to Needs backport from main in 1.13.9 Oct 17, 2023
@jrajahalme jrajahalme removed this from Needs backport from main in 1.13.8 Oct 17, 2023
@jrajahalme jrajahalme removed this from Needs backport from main in 1.13.9 Oct 17, 2023
@jrajahalme jrajahalme removed this from Needs backport from main in 1.12.15 Oct 17, 2023
gandro pushed a commit to gandro/cilium that referenced this pull request Dec 7, 2023
This commit fixes the import of EndpointSlice

Fixes: cilium#26455

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.11.21
Needs backport from main
Development

Successfully merging this pull request may close these issues.

None yet

9 participants