Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict errors on certificaterequest updates with kubernetes 1.25 #5982

Closed
lorenzoiuri opened this issue Apr 24, 2023 · 7 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.

Comments

@lorenzoiuri
Copy link

lorenzoiuri commented Apr 24, 2023

We have developed an external issuer for cert-manager for managing our custom certificates, following the example provided.

Since upgrading our clusters to kubernetes version 1.25 we noticed that the RetryOnConflict function (in the defer block of the custom certificaterequest_controller, see code below) is triggered and, after 5 (default value) retries, it fails.
In kubernetes 1.24 this behaviour was not present.

We noticed that between one retry and the next the field managedFields.time changes, along with metadata.resourceVersion, which prevents the RetryOnConflict from working, with the error below.

Operation cannot be fulfilled on certificaterequests.cert-manager.io \"sd-amk8s-dom1-tls-rg56z\": the object has been modified; please apply your changes to the latest version and try again

We also noticed a bug fix in kubernetes 1.25 that may be the cause of our issue: ManagedFields time is correctly updated when the value of a managed field is modified.

We would like to understand if the cert-manager is responsible for the frequent modifications on the certificaterequest resource we are trying to manage.
If so, we would like to find a way to prevent the behaviour explained above.
Can you please investigate if this is a bug within the cert-manager?

Thank you.

	// Always update the resource and its status at the end of the reconcile loop
	defer func() {

		// update the annotations in any case
		annotationsFromRequestStatus(&reqStatus)
		receivederr := err

		if receivederr != nil {
			setReadyCondition(cmmeta.ConditionFalse, cmapi.CertificateRequestReasonPending, receivederr.Error())
			log.Error(receivederr, "Error managing CertificateRequest")
			// if encountering these errors we need to update the annotations
			// so no returning yet
			if receivederr != webra.ErrCallingWebRA {
				return
			}
		}

		err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			// need to refetch the resource on every try, since
			// if we get a conflict on the last update attempt then we need to get
			// the current version before updating the certificaterequest.
			var newCertReq cmapi.CertificateRequest
			log.Info("Starting RetryOnConflict, requestID is " + strconv.Itoa(int(reqStatus.RequestID)))
			if err := r.Get(ctx, req.NamespacedName, &newCertReq); err != nil {
				log.Error(err, "Couldn't fetch certificateRequest resource")
				return err
			}

			// Copy the spec, the annotations and the status from the resource that went through the reconcile loop
			// Copy the annotations in any case, the spec and status only if not error
			// Copying the annotations is necessary to avoid losing the RequestID
			newCertReq.SetAnnotations(certificateRequest.Annotations)
			if receivederr != webra.ErrCallingWebRA {
				newCertReq.Spec = certificateRequest.Spec
				newCertReq.Status = certificateRequest.Status
			}

			// Try to update
			err := r.Status().Update(ctx, newCertReq.DeepCopy())
			if err != nil {
				return err
			}
			updateErr := r.Update(ctx, newCertReq.DeepCopy())
			// sleep is required to allow sync to apiserver
			// without it, conflicts occasionally happend
			time.Sleep(1 * time.Second)
			return updateErr
		})
		if err != nil {
			// May be conflict if max retries were hit, or may be something unrelated
			// like permissions or a network error
			log.Error(err, "Couldn't update certificateRequest resource", "CertificateRequest", certificateRequest)
		}
	}()

Steps to reproduce:

  • Use a kubernetes cluster with version 1.25, we tried both on AKS and kind
  • Use cert-manager 1.11.0
  • Use the sample external issuer code and use the RetryOnConflict function as the one above

Environment details::

  • Kubernetes 1.25 and 1.24 (AKS, kind)
  • Cert-manager 1.11.0
  • go.mod file
...
require (
	github.com/cert-manager/cert-manager v1.11.0
	github.com/go-logr/logr v1.2.4
	github.com/hooklift/gowsdl v0.5.0
	k8s.io/api v0.26.3
	k8s.io/apimachinery v0.26.3
	k8s.io/client-go v0.26.3
	k8s.io/utils v0.0.0-20230313181309-38a27ef9d749
	sigs.k8s.io/controller-runtime v0.14.1
)
...

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 24, 2023
@kkendzia
Copy link

kkendzia commented May 8, 2023

I have the same issue in EKS v1.23

@irbekrm
Copy link
Contributor

irbekrm commented May 19, 2023

it fails

Does it actually fail to issue certificates? Resource conflicts happen on normal cert-manager operattions as multiple control loops modify the same resources, see i.e #5060 (comment). You can enable ServerSideApply if you'd like to avoid those issues.

We do not directly modify managed fields.

@irbekrm irbekrm added triage/support Indicates an issue that is a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 19, 2023
@kkendzia
Copy link

From my side it was a configuration error that aws-privateca-issuer couldn't access the secretmanager properly. This resulted in this error - which wasn't clear on the first sight

@lorenzoiuri
Copy link
Author

lorenzoiuri commented Jun 27, 2023

Sorry for the delay, we thought we fixed the issue with a workaround but it happened again multiple times recently.

Does it actually fail to issue certificates? Resource conflicts happen on normal cert-manager operattions as multiple control loops modify the same resources, see i.e #5060 (comment). You can enable ServerSideApply if you'd like to avoid those issues.

Let me provide some more facts:

  • Our custom issuer, installed in versions of kubernetes <= 1.24, occasionally triggered multiple (usually 2) request on the API contacted for certificate issuing.
  • After many tests, we concluded the issue was caused by some concurrency conflicts, and was fixed introducing a delay in the retryOnConflict function (the line time.Sleep(1 * time.Second) in the first post). No further duplicate requests were observed after this change.
  • The same code installed in kubernetes 1.25 triggers the retryOnConflict function until it reaches its limit (5 times by default). When it fails the error is Couldn't update certificateRequest resource, Operation cannot be fulfilled on certificaterequests.cert-manager.io \"sd-amk8s-dom1-tls-prmdv\": the object has been modified; please apply your changes to the latest version and try again"} and the certificate is not issued.
  • Removing the delay solves the retryOnConflict error (it is not called until it fails anymore), but the problem with the duplicated requests reappears.

The duplicate requests, for what I understand, are created because the certificaterequest resource is modified in some way, by the control loops.

Does this behaviour suggest you some identifiable or known issues? (such as with the operator framework used by the sample custom issuer).

Can you please explain this sentence? @irbekrm

multiple control loops modify the same resources

We will look into the ServerSideApply but it doesn't seem a trivial change.

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2023
@jetstack-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 26, 2023
@lorenzoiuri
Copy link
Author

We rewrote the retryOnConflict function and since we have not observed the bug again.

err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			// need to refetch the resource on every try, since
			// if we get a conflict on the last update attempt then we need to get
			// the current version before updating the certificaterequest.
			var newCertReq cmapi.CertificateRequest
			log.Info("Starting RetryOnConflict, requestID is " + strconv.Itoa(int(reqStatus.RequestID)))
			if err := r.Get(ctx, req.NamespacedName, &newCertReq); err != nil {
				log.Error(err, "Couldn't fetch certificateRequest resource")
				return err
			}

			// Copy the spec, the annotations and the status from the resource that went through the reconcile loop
			// Copy the annotations in any case, the spec and status only if not error
			// Copying the annotations is necessary to avoid losing the RequestID
			newCertReq.SetAnnotations(certificateRequest.Annotations)
			if receivederr != webra.ErrCallingWebRA {
				newCertReq.Spec = certificateRequest.Spec
				newCertReq.Status = certificateRequest.Status
			}

			// Try to update the status using an UPDATE API call
			err := r.Status().Update(ctx, newCertReq.DeepCopy())
			if err != nil {
				log.Error(err, "Couldn't update certificateRequest status", "CertificateRequest", certificateRequest)
				return err
			}

			// Convert certificateRequest.Spec to JSON string
			specJSON, err := json.Marshal(newCertReq.Spec)
			if err != nil {
				log.Error(err, "Failed to marshal certificateRequest.Spec to JSON")
				return err
			}

			// Convert certificateRequest.Annotations to JSON string
			annotationsJSON, err := json.Marshal(newCertReq.Annotations)
			if err != nil {
				log.Error(err, "Failed to marshal certificateRequest.Annotations to JSON")
				return err
			}

			// Build the JSON patch array including both spec and annotations
			resourcePatch := []byte(`[
				{"op": "replace", "path": "/spec", "value": ` + string(specJSON) + `},
				{"op": "replace", "path": "/metadata/annotations", "value": ` + string(annotationsJSON) + `}
			]`)

			// Try to update the resource using a PATCH API call
			if err := r.Patch(ctx, newCertReq.DeepCopy(), client.RawPatch(types.JSONPatchType, resourcePatch)); err != nil {
				log.Error(err, "Couldn't patch certificateRequest resource", "CertificateRequest", certificateRequest)
			}

			time.Sleep(10 * time.Second)
			return err
		})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

4 participants