-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator: Handle conflicts in CES update. #26455
Conversation
Commit 9c7119712346a4d25c1ebebab955ef8b230ca140 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
/assign dlapcevic |
\assign dlapcevic |
Can you also help to correct the sign-off statement in the commit message?
|
/assign @dlapcevic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
if !exists { | ||
return err | ||
} | ||
ces.ObjectMeta = obj.(*cilium_v2.CiliumEndpointSlice).ObjectMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little questionable. Might want to see if you can update the obj.Spec/obj.Status instead, otherwise you might overwrite values added by another update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine because the operator doesn't want to modify CES ObjectMeta.
The operator creates CESs and then only modifies the Endpoints field.
The only remaining concern is that updates will still fail if another source (other than operator) modifies the Endpoints field, but the aim of this PR is to fix the issue of operator getting stuck on its own, just as a part of the k8s system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not worried about the ObjectMeta data, I'm worried that it's overwriting the entire rest of the object without concern for it's previous values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand it. I intended to overwrite ObjectMeta only and I thought this is what I did.
This is the same as updateCESInCache code called later:
func (c *cesMgr) updateCESInCache(srcCES *cilium_v2.CiliumEndpointSlice, isDeepCopy bool) {
if ces, ok := c.desiredCESs.getCESTracker(srcCES.GetName()); ok {
ces.backendMutex.Lock()
defer ces.backendMutex.Unlock()
if !isDeepCopy {
ces.ces.ObjectMeta = srcCES.ObjectMeta
} else {
ces.ces = srcCES
for _, cep := range ces.ces.Endpoints {
// Update the desiredCESs, to reflect all CEPs are packed in a CES
c.desiredCESs.insertCEP(GetCEPNameFromCCEP(&cep, ces.ces.Namespace), srcCES.GetName())
}
}
} else {
log.WithFields(logrus.Fields{
logfields.CESName: srcCES.GetName(),
}).Debug("Attempted to updateCESInCache non-existent, skipping.")
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are worried that in the apisever we ignore all other possible changes.
What is the case in which we shouldn't do this?
Operator should be the only controller for CES.
If some client changes k8s.Endpoints etc the controllers owning them will revert such changes. The same would happen here (to some extent) with the CiliumEndpointSlice.
Long term I think the operator should always reconcile the full state of CiliumEndpointSlice and make sure they are exactly as operator expects them to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would work better.
It would fix the problem with conflict eventually but it would introduce more problem.
It would always use metadata from the watcher cache so the metadata would be outdated after update before watch gets the event.
This is happening right now with creates. We determine create vs update based on the watch cache, we create if CES is not present in the watch cache. So we observed it happening that we create the CES and operator wants to update it before the watcher obeserves the event so it tries to create again and fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move my change to the handleErr - I think it makes more sense, will update the PR soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would always use metadata from the watcher cache so the metadata would be outdated after update before watch gets the event.
that is how kubernetes controllers work, there is no pass through cache and you have to operate with the state of the informer, it eventually be consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is happening right now with creates. We determine create vs update based on the watch cache, we create if CES is not present in the watch cache
that is the part I don't understand, why are those two different operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
I changed the place for handling conflict errors to errorhandling in endpointslices.
48f8fb2
to
b679e13
Compare
/test |
Please rebase against the latest |
5e7fda2
to
097fe10
Compare
All tests passed except flaky K8sUpstreamNetConformance / kubernetes-e2e-net-conformance (ipv4) |
/lgtm
this job is flaky and not required and does not seem related, is on my backlog and I hope I can get to it as it is causing a lot of troubles |
Considering the above, I have unmarked this PR for backport to 1.11 / 1.12 / 1.13. Please add them back should we revise our stance. |
This commit fixes the import of EndpointSlice Fixes: cilium#26455 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
If for some reason operator has outdated version of CES it will not be able to update such CES and it will never recover from such state.
This can happen not only due to some othe client updating CES but also when the update from operator succeeds but for some reason api-server does't return OK (it can fail after updating etcd).
Signed-off-by: Alan Kutniewski kutniewski@google.com