Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a bug where a previous failed CertificateRequest was picked up during next issuance #4688

Merged
merged 3 commits into from Jan 4, 2022

Conversation

irbekrm
Copy link
Collaborator

@irbekrm irbekrm commented Dec 22, 2021

What this PR does / why we need it:

This PR fixes a bug where after a certificate request fails and the next one succeeds, the Certificate's 'Ready' condition remains false and the contents of the Secret are not updated.

See #4642 for detailed bug description and instructions how to reproduce it.

Which issue this PR fixes
fixes #4642

Special notes for your reviewer:

This bug was caused by a race condition where during an issuance where the previous issuance failed, the failed CertificateRequest gets picked up by the issuing controller and used to update Certificate's status.
This issue is fixed in ff67b2a by making the issuing controller compare timestamp on the Issuing condition on the Certificate and the failure time on the CertificateRequest and ignore the CertificateRequest if it has failed before the Issuing condition was set on the Certificate as this means that it belongs to previous issuance.

Additionally, this PR also changes the way how certificates-request-manager controller determines if it should delete a failed CertificateRequest for this revision (see #4130 for context). Previously there was a check that the failure happened at least 1 hour (default backoff period) ago. This would mean that if a user runs cmctl renew command earlier, the failed CertificateRequest won't get deleted so I've changed this to check that the CertificateRequest has failed before the Issuing condition was applied to the Certificate.

Looking at the timestamps is not the most reliable way how to determine if an action should be performed and resource statuses can be lost i.e after a backup and restore. However, in this case, the first step would always be to apply 'Issuing' condition to the Certificate in the trigger controller which is done after looking at the contents of the Secret. I cannot think of other edge cases.

Verify the fix (with Venafi TPP issuer):

  1. Create a TPP zone with policy that does not allow private key reuse
  2. Create a TPP issuer and a Certificate with spec.privateKey.rotationPolicy set to 'Never' (default)
  3. Observe successful issuance (Certificate becomes ready)
  4. Force renewal i.e by changing DNS names on Certificate
  5. Observe that cert's 'Ready' condition becomes false and there is a failed CertificateRequest
  6. Modify TPP zone to allow private key reuse
  7. Force renewal using cmctl renew
  8. Observe a new successfull CertificateRequest and Certificate's Ready condition set to true

(See here for how to reproduce the bug)

Alternative approaches:

Same alternative approaches as those in the description of #4130 and same reasons as to why they weren't chosen

Related PRs:
#4130

Release note:

Fixes a bug where a previous failed CertificateRequest was picked up during the next issuance. Thanks to @MattiasGees for raising the issue and help with debugging!

/kind bug

…oller

Issuing controller should only look at 'current' CertificateRequests

Signed-off-by: irbekrm <irbekrm@gmail.com>
…are re-issuing for the same revision

Signed-off-by: irbekrm <irbekrm@gmail.com>
@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 22, 2021
@irbekrm
Copy link
Collaborator Author

irbekrm commented Dec 22, 2021

/test pull-cert-manager-venafi
/test pull-cert-manager-e2e-v1-18
/test pull-cert-manager-e2e-v1-19
/test pull-cert-manager-e2e-v1-20
/test pull-cert-manager-e2e-v1-23
/test pull-cert-manager-e2e-v1-21

@irbekrm
Copy link
Collaborator Author

irbekrm commented Dec 22, 2021

/test pull-cert-manager-issuers-venafi-tpp

@irbekrm
Copy link
Collaborator Author

irbekrm commented Dec 22, 2021

/milestone v1.7

@jetstack-bot jetstack-bot added this to the v1.7 milestone Dec 22, 2021
Signed-off-by: irbekrm <irbekrm@gmail.com>
Copy link
Member

@jakexks jakexks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks like a gap in the state machine!
/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 4, 2022
@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: irbekrm, jakexks

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot merged commit 019d64e into cert-manager:master Jan 4, 2022
@irbekrm irbekrm deleted the renew_failed branch January 4, 2022 15:12
@munnerz
Copy link
Member

munnerz commented Aug 4, 2022

Hm, I am digging into why a failed certificate request is re-created with the same private key as the previous issuance attempt despite rotationPolicy: Always being set and I believe this PR may be related, though I'm not sure yet 👀 do you know if anything changed here with this?

// revision. Leave it to the certificate-requests controller to delete the
// CertificateRequest and create a new one.
if req.Status.FailureTime != nil &&
req.Status.FailureTime.Before(certIssuingCond.LastTransitionTime) && crReadyCond.Reason == cmapi.CertificateRequestReasonFailed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty nervous about this time comparison.. I don't think these fields are robust enough to use as part of the state machine. Why can't we inspect the actual revision on the failed CertificateRequest (stored as an annotation) to determine whether it's from a previous issuance?

As an example of how this could trip up, if two requests fail in a row then the LastTransitionTime won't be updated as the condition's state has not transitioned.

Would utilising the actual revision annotation work for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition when renewing certs that previously failed issuing
5 participants