Conflict errors on certificaterequest updates with kubernetes 1.25 #5982

lorenzoiuri · 2023-04-24T16:10:19Z

We have developed an external issuer for cert-manager for managing our custom certificates, following the example provided.

Since upgrading our clusters to kubernetes version 1.25 we noticed that the RetryOnConflict function (in the defer block of the custom certificaterequest_controller, see code below) is triggered and, after 5 (default value) retries, it fails.
In kubernetes 1.24 this behaviour was not present.

We noticed that between one retry and the next the field managedFields.time changes, along with metadata.resourceVersion, which prevents the RetryOnConflict from working, with the error below.

Operation cannot be fulfilled on certificaterequests.cert-manager.io \"sd-amk8s-dom1-tls-rg56z\": the object has been modified; please apply your changes to the latest version and try again

We also noticed a bug fix in kubernetes 1.25 that may be the cause of our issue: ManagedFields time is correctly updated when the value of a managed field is modified.

We would like to understand if the cert-manager is responsible for the frequent modifications on the certificaterequest resource we are trying to manage.
If so, we would like to find a way to prevent the behaviour explained above.
Can you please investigate if this is a bug within the cert-manager?

Thank you.

	// Always update the resource and its status at the end of the reconcile loop
	defer func() {

		// update the annotations in any case
		annotationsFromRequestStatus(&reqStatus)
		receivederr := err

		if receivederr != nil {
			setReadyCondition(cmmeta.ConditionFalse, cmapi.CertificateRequestReasonPending, receivederr.Error())
			log.Error(receivederr, "Error managing CertificateRequest")
			// if encountering these errors we need to update the annotations
			// so no returning yet
			if receivederr != webra.ErrCallingWebRA {
				return
			}
		}

		err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			// need to refetch the resource on every try, since
			// if we get a conflict on the last update attempt then we need to get
			// the current version before updating the certificaterequest.
			var newCertReq cmapi.CertificateRequest
			log.Info("Starting RetryOnConflict, requestID is " + strconv.Itoa(int(reqStatus.RequestID)))
			if err := r.Get(ctx, req.NamespacedName, &newCertReq); err != nil {
				log.Error(err, "Couldn't fetch certificateRequest resource")
				return err
			}

			// Copy the spec, the annotations and the status from the resource that went through the reconcile loop
			// Copy the annotations in any case, the spec and status only if not error
			// Copying the annotations is necessary to avoid losing the RequestID
			newCertReq.SetAnnotations(certificateRequest.Annotations)
			if receivederr != webra.ErrCallingWebRA {
				newCertReq.Spec = certificateRequest.Spec
				newCertReq.Status = certificateRequest.Status
			}

			// Try to update
			err := r.Status().Update(ctx, newCertReq.DeepCopy())
			if err != nil {
				return err
			}
			updateErr := r.Update(ctx, newCertReq.DeepCopy())
			// sleep is required to allow sync to apiserver
			// without it, conflicts occasionally happend
			time.Sleep(1 * time.Second)
			return updateErr
		})
		if err != nil {
			// May be conflict if max retries were hit, or may be something unrelated
			// like permissions or a network error
			log.Error(err, "Couldn't update certificateRequest resource", "CertificateRequest", certificateRequest)
		}
	}()

Steps to reproduce:

Use a kubernetes cluster with version 1.25, we tried both on AKS and kind
Use cert-manager 1.11.0
Use the sample external issuer code and use the RetryOnConflict function as the one above

Environment details::

Kubernetes 1.25 and 1.24 (AKS, kind)
Cert-manager 1.11.0
go.mod file

...
require (
	github.com/cert-manager/cert-manager v1.11.0
	github.com/go-logr/logr v1.2.4
	github.com/hooklift/gowsdl v0.5.0
	k8s.io/api v0.26.3
	k8s.io/apimachinery v0.26.3
	k8s.io/client-go v0.26.3
	k8s.io/utils v0.0.0-20230313181309-38a27ef9d749
	sigs.k8s.io/controller-runtime v0.14.1
)
...

/kind bug

The text was updated successfully, but these errors were encountered:

kkendzia · 2023-05-08T08:32:47Z

I have the same issue in EKS v1.23

irbekrm · 2023-05-19T14:59:16Z

it fails

Does it actually fail to issue certificates? Resource conflicts happen on normal cert-manager operattions as multiple control loops modify the same resources, see i.e #5060 (comment). You can enable ServerSideApply if you'd like to avoid those issues.

We do not directly modify managed fields.

kkendzia · 2023-05-19T20:37:35Z

From my side it was a configuration error that aws-privateca-issuer couldn't access the secretmanager properly. This resulted in this error - which wasn't clear on the first sight

lorenzoiuri · 2023-06-27T12:54:34Z

Sorry for the delay, we thought we fixed the issue with a workaround but it happened again multiple times recently.

Does it actually fail to issue certificates? Resource conflicts happen on normal cert-manager operattions as multiple control loops modify the same resources, see i.e #5060 (comment). You can enable ServerSideApply if you'd like to avoid those issues.

Let me provide some more facts:

Our custom issuer, installed in versions of kubernetes <= 1.24, occasionally triggered multiple (usually 2) request on the API contacted for certificate issuing.
After many tests, we concluded the issue was caused by some concurrency conflicts, and was fixed introducing a delay in the retryOnConflict function (the line time.Sleep(1 * time.Second) in the first post). No further duplicate requests were observed after this change.
The same code installed in kubernetes 1.25 triggers the retryOnConflict function until it reaches its limit (5 times by default). When it fails the error is Couldn't update certificateRequest resource, Operation cannot be fulfilled on certificaterequests.cert-manager.io \"sd-amk8s-dom1-tls-prmdv\": the object has been modified; please apply your changes to the latest version and try again"} and the certificate is not issued.
Removing the delay solves the retryOnConflict error (it is not called until it fails anymore), but the problem with the duplicated requests reappears.

The duplicate requests, for what I understand, are created because the certificaterequest resource is modified in some way, by the control loops.

Does this behaviour suggest you some identifiable or known issues? (such as with the operator framework used by the sample custom issuer).

Can you please explain this sentence? @irbekrm

multiple control loops modify the same resources

We will look into the ServerSideApply but it doesn't seem a trivial change.

jetstack-bot · 2023-09-26T10:21:54Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

jetstack-bot · 2023-10-26T10:33:00Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

lorenzoiuri · 2023-10-26T11:21:28Z

We rewrote the retryOnConflict function and since we have not observed the bug again.

err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			// need to refetch the resource on every try, since
			// if we get a conflict on the last update attempt then we need to get
			// the current version before updating the certificaterequest.
			var newCertReq cmapi.CertificateRequest
			log.Info("Starting RetryOnConflict, requestID is " + strconv.Itoa(int(reqStatus.RequestID)))
			if err := r.Get(ctx, req.NamespacedName, &newCertReq); err != nil {
				log.Error(err, "Couldn't fetch certificateRequest resource")
				return err
			}

			// Copy the spec, the annotations and the status from the resource that went through the reconcile loop
			// Copy the annotations in any case, the spec and status only if not error
			// Copying the annotations is necessary to avoid losing the RequestID
			newCertReq.SetAnnotations(certificateRequest.Annotations)
			if receivederr != webra.ErrCallingWebRA {
				newCertReq.Spec = certificateRequest.Spec
				newCertReq.Status = certificateRequest.Status
			}

			// Try to update the status using an UPDATE API call
			err := r.Status().Update(ctx, newCertReq.DeepCopy())
			if err != nil {
				log.Error(err, "Couldn't update certificateRequest status", "CertificateRequest", certificateRequest)
				return err
			}

			// Convert certificateRequest.Spec to JSON string
			specJSON, err := json.Marshal(newCertReq.Spec)
			if err != nil {
				log.Error(err, "Failed to marshal certificateRequest.Spec to JSON")
				return err
			}

			// Convert certificateRequest.Annotations to JSON string
			annotationsJSON, err := json.Marshal(newCertReq.Annotations)
			if err != nil {
				log.Error(err, "Failed to marshal certificateRequest.Annotations to JSON")
				return err
			}

			// Build the JSON patch array including both spec and annotations
			resourcePatch := []byte(`[
				{"op": "replace", "path": "/spec", "value": ` + string(specJSON) + `},
				{"op": "replace", "path": "/metadata/annotations", "value": ` + string(annotationsJSON) + `}
			]`)

			// Try to update the resource using a PATCH API call
			if err := r.Patch(ctx, newCertReq.DeepCopy(), client.RawPatch(types.JSONPatchType, resourcePatch)); err != nil {
				log.Error(err, "Couldn't patch certificateRequest resource", "CertificateRequest", certificateRequest)
			}

			time.Sleep(10 * time.Second)
			return err
		})

jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 24, 2023

irbekrm added triage/support Indicates an issue that is a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 19, 2023

lorenzoiuri mentioned this issue Jun 30, 2023

Question: enable Server-Side Apply (SSA) cert-manager/sample-external-issuer#41

Closed

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2023

jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 26, 2023

lorenzoiuri closed this as completed Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflict errors on certificaterequest updates with kubernetes 1.25 #5982

Conflict errors on certificaterequest updates with kubernetes 1.25 #5982

lorenzoiuri commented Apr 24, 2023 •

edited

kkendzia commented May 8, 2023

irbekrm commented May 19, 2023

kkendzia commented May 19, 2023

lorenzoiuri commented Jun 27, 2023 •

edited

jetstack-bot commented Sep 26, 2023

jetstack-bot commented Oct 26, 2023

lorenzoiuri commented Oct 26, 2023

Conflict errors on certificaterequest updates with kubernetes 1.25 #5982

Conflict errors on certificaterequest updates with kubernetes 1.25 #5982

Comments

lorenzoiuri commented Apr 24, 2023 • edited

kkendzia commented May 8, 2023

irbekrm commented May 19, 2023

kkendzia commented May 19, 2023

lorenzoiuri commented Jun 27, 2023 • edited

jetstack-bot commented Sep 26, 2023

jetstack-bot commented Oct 26, 2023

lorenzoiuri commented Oct 26, 2023

lorenzoiuri commented Apr 24, 2023 •

edited

lorenzoiuri commented Jun 27, 2023 •

edited