Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2686 fixes stuck Orders issue #3805

Merged
merged 6 commits into from
May 19, 2021

Conversation

irbekrm
Copy link
Contributor

@irbekrm irbekrm commented Mar 25, 2021

What this PR does / why we need it:

See #2868 for context.

Once an ACME server has validated authorizations for an order, it should set the status of that order to 'ready' (see spec)
The order can then be finalized, after which its status becomes 'valid'.
However, it is possible that a misbehaving ACME server would, at a point in time, have validated the authorizations, but not have yet set the status of the order to 'ready'.
Currently in this scenario cert-manager orders controller would not keep polling the ACME server for changes in order's status- instead the Order CR would be stuck in a pending state.

Happy path:

  1. All authorizations are valid, so the Challenge resources for the Order have also been set to 'valid'. ACME server has also set the state of the ACME order to 'ready'.
  2. Orders controller re-syncs the Order, observes the valid Challenges and, retrieves the status of the ACME order and updates the status of the Order CR accordingly (pending -> ready)
  3. Order controller re-syncs the Order again (since its status was just changed), observes that the status is now 'ready' and finalizes the order

Sad path:

  1. All authorizations are valid, so the Challenge resources for the Order have also been set to 'valid'. ACME server has not (yet) set the state of the ACME order to 'ready' <- this goes against ACME spec.
  2. Orders controller re-syncs the Order, observes the valid Challenges and, retrieves the status of the ACME order which is still 'pending'. It tries to update the state of the Order CR (pending -> pending)- this update is probably not applied as there is no actual status change and no further re-syncs would be triggered - the Order remains pending.

Which issue this PR fixes *

Fixes #2868

Special notes for your reviewer:

I fixed this by re-queing the Order if all the Challenges are valid, but the state of the ACME order is still pending, in the same way as we do to, for example, re-queue a Certificate that needs to be renewed at a specific time.

We could have also:

A. Polled the ACME server continuosly instead of re-queing- but we don't know for how long the ACME order might remain in this state, so would we allow the worker to do this potentially 'forever' or introduce a random timeout?

B. Done something with resync intervals- however this seems to be a rarely encountered edge case, so maybe we don't want to modify order contoller's resync period just for this.

I have tested this only via the integration test for orders controller that is added with this PR.

Release note:

Fixes stuck Orders in case of a misbehaving ACME server

/kind bug

@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Mar 25, 2021
@jetstack-bot jetstack-bot added area/acme Indicates a PR directly modifies the ACME Issuer code area/testing Issues relating to testing approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 25, 2021
@irbekrm irbekrm changed the title 2686 stuck orders 2686 fixes stuck Orders issue Mar 25, 2021
@irbekrm
Copy link
Contributor Author

irbekrm commented Mar 25, 2021

/retest

@irbekrm
Copy link
Contributor Author

irbekrm commented Mar 25, 2021

A lot of setup failures lately, it seems.

@irbekrm irbekrm added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2021
@jetstack-bot jetstack-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Mar 25, 2021
@jetstack-bot jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2021
@maelvls
Copy link
Member

maelvls commented Apr 7, 2021

/milestone v1.4

@jetstack-bot jetstack-bot added this to the v1.4 milestone Apr 7, 2021
@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2021
@jetstack-bot jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2021
@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2021
@jetstack-bot jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2021
@wallrj wallrj self-assigned this May 12, 2021
Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @irbekrm

Really nice refactoring to slot in the scheduled work queue.
And thanks for adding that integration test.

I left one or two comments and suggestions below which you can answer or address as you like.

pkg/controller/acmeorders/checks.go Show resolved Hide resolved
pkg/controller/acmeorders/checks.go Show resolved Hide resolved
pkg/controller/acmeorders/sync.go Outdated Show resolved Hide resolved
pkg/controller/acmeorders/controller.go Show resolved Hide resolved
pkg/controller/acmeorders/sync.go Outdated Show resolved Hide resolved
pkg/controller/acmeorders/sync_test.go Outdated Show resolved Hide resolved
test/integration/acme/orders_controller_test.go Outdated Show resolved Hide resolved
test/integration/acme/orders_controller_test.go Outdated Show resolved Hide resolved
pkg/controller/acmeorders/controller.go Show resolved Hide resolved
pkg/controller/acmeorders/controller.go Show resolved Hide resolved
@wallrj wallrj removed their assignment May 12, 2021
@irbekrm
Copy link
Contributor Author

irbekrm commented May 17, 2021

Thank you for the code review and good suggestions @wallrj !

I think I have addressed all and created new issues for some, please take a look.

pkg/controller/acmeorders/sync.go Outdated Show resolved Hide resolved
pkg/controller/acmeorders/sync.go Outdated Show resolved Hide resolved
test/integration/acme/orders_controller_test.go Outdated Show resolved Hide resolved
Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @irbekrm

Just a typo and a suggestion which you can ignore and /unhold if you prefer.

/lgtm
/hold

@jetstack-bot jetstack-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2021
@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2021
@jetstack-bot jetstack-bot removed the lgtm Indicates that a PR is ready to be merged. label May 19, 2021
@irbekrm
Copy link
Contributor Author

irbekrm commented May 19, 2021

Thanks for the review @wallrj !
I've tried to address the comments, please take a look.

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2021
@irbekrm
Copy link
Contributor Author

irbekrm commented May 19, 2021

/hold cancel

@jetstack-bot jetstack-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2021
To avoid stuck Orders in case of a misbehaving ACME server

Signed-off-by: irbekrm <irbekrm@gmail.com>
To allow for testing whether an item gets re-queued in unit tests

Signed-off-by: irbekrm <irbekrm@gmail.com>
Signed-off-by: irbekrm <irbekrm@gmail.com>
So that it easier used with the existing test framework and also is more similar to how most other controllers are created

Signed-off-by: irbekrm <irbekrm@gmail.com>
Signed-off-by: irbekrm <irbekrm@gmail.com>
Signed-off-by: irbekrm <irbekrm@gmail.com>
@jetstack-bot jetstack-bot removed the lgtm Indicates that a PR is ready to be merged. label May 19, 2021
Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2021
@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: irbekrm, wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot merged commit 9f5daa7 into cert-manager:master May 19, 2021
@irbekrm irbekrm deleted the 2686_stuck_orders branch May 19, 2021 14:32
@@ -144,10 +152,25 @@ func (c *controller) Sync(ctx context.Context, o *cmacme.Order) (err error) {
return err
}

acmeOrder, err := getACMEOrder(ctx, cl, o)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just stumbled across this - so one thing to note is that this controller (and all other acme controllers) have always made a point to avoid calling the ACME API unless it is strictly necessary. This is because, in instances where unexpected errors can crop up (or a bug in our code), it can lead to large amounts of requests being sent to Let's Encrypt from many different cert-manager installations. That is also why you'll notice very few calls using the ACME client throughout these controllers.

If there is no other way to achieve this without calling out to the ACME API, then that's fine - but this needs a lot of care to ensure there's no cases where we can enter into loops. It's also worth viewing it as a last resort and an expensive call to make, so gating this kind of call behind as many local checks as possible would make sense (eg given the 'happy path' today works without making this extra call, is there an error case that we could add this call into rather than having it on the happy path).

By adding it on the happy path of all syncs of orders, we have increased the number of api calls made in this controller for everyone by 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/acme Indicates a PR directly modifies the ACME Issuer code area/testing Issues relating to testing dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

acme: checking Order status periodically when Challenges are 'valid' to avoid stuck Orders
5 participants