Address new ACME v2 rate limit ("too many new orders recently") #217

gohai · 2020-01-22T22:06:21Z

The ACME v2 API has a new rate limit of 300 new orders per account per 3 hours, which easily results in the "too many new orders recently" error for new or renewing domains. (see #213)

The ACME v2 API has a new rate limit of 300 new orders per account per 3 hours, which easily results in the "too many new orders recently" error for new or renewing domains. (see auto-ssl#213) As a first step to mitigate this new requirement, throttle the rate with which renewal requests are issued to a default 60 per hour (which leaves 40 for on-demand issues per hour).

With throttling, renewal jobs can take a significant time to complete (e.g. when a bunch of domains are renewing around the same time). To make sure we can spread out the renewal as evenly as possible throughout the period, and to ensure the renewal checks on different workers stay in sync, consider the duration of the actual job execution when re-scheduling the check.

This is to increase the chances that environments using multiple servers won't have all instances renewing at the same time, which might tie up all orders available to the account, and thus prevent certificates for new domains from being issued.

This implements the minimal functionality for considering a domain's past certificate failures inside allow_domain. We believe something like this to be necessary to be able to comply to the new ACME v2 limit of 300 new orders/account/3h. A different approach which might be worth considering is saving the statistical information to (permanent) storage, like we do with certificates. This patch instead only uses a shm-based dictionary, meaning that information might be stale in a multi-server setup. (Dictionary entries are created with an expiration date, and we're running the following patch to look for a certificate in storage before calling allow_domain, meaning that this shouldn't be a problem in practice: b1f9715)

Previously, renewals were attempted 30 days before domains expired, and then every day going forward - until either the renewal succeeded or the existing certificate expired. This poses an issue with the new ACME v2 rate limit ("too many new orders recently"), because large numbers of domains might be expiring (and thus renewing) at the same time - e.g. because they all got moved onto lua-resty-auto-ssl the same time initially. Do spread the renewal dates out more evenly, this patch rolls the dice whether a domain should be renewed or not, with increasing odds it will be, between 30 and 15 days before the domain expires. Renewals for domains expiring in less than 15 days are always attempted.

adityapatadia · 2020-06-21T19:42:14Z

@GUI This is becoming a problem. Can you merge this soon?

davidmerfield · 2020-06-22T21:30:09Z

I'm running into this issue as well, would be great to get this merged

GUI · 2020-06-25T13:53:50Z

Thanks all. I'll try to review and release this weekend. Sorry for the delay!

ErisDS · 2020-07-07T12:45:51Z

Sorry to add another nagging voice to the pile, but we're also struggling with this problem.

Is there anything anyone else can do to test/review this and help build confidence for the merge?

gohai · 2020-07-07T17:12:58Z

(Committer here) To provide a bit of additional context: we've been running those changes in production since January. They are not a guaranteed "solution" to the new rate-limit introduced with ACME v2, but they change the existing code to be as steady with renewal requests as possible, so that those won't unnecessarily push you over the limit at times. But this is happening on a "best effort" basis, short of more radical changes, such as having lua-resty-auto-ssl rotate between multiple Let's Encrypt accounts etc...

In addition, 6e1ea73 adds a generic mechanism that allows setting a policy for how frequently domains are being retried, for which we weren't able to get a certificate previously. We've been using this with an exponential backoff, were we ultimately only waste a single order per days on problematic domains - but the desired behavior is probably differing amongst auto-ssl users, so this merely gives you the tools to implement this.

(The one slightly ambiguous part IMHO: we're using 6e1ea73 together with b1f9715, a change to check if we already have a cert in permanent storage before calling allow_domain, which was previously rejected during review. Without the latter patch, you might have a situation in a multi-server setup where one server serves a certificate for a domain, while another still returns the fallback certificate because of per-domain throttling. With the latter patch, auto-ssl will always return a certificate any of the servers has been able to get one.)

GUI

@gohai: There's lots of good stuff here, a big thanks for helping figure this out.

Everyone: Sorry for the delays to everyone in getting this merged. But if you want to use @gohai's branch, this all looks solid to me, I'm just wanting to mull over some different approaches before merging and releasing.

@gohai: I left some inline comments with some very initial thoughts, but I probably need to spend a bit more time giving all of this more thought. But let me know if any of my initial questions didn't make sense or you have any other thoughts. But happy to try and collaborate on this. Many thanks!

GUI · 2020-07-20T02:08:26Z

lib/resty/auto-ssl/jobs/renewal.lua

+    elseif now + (15 * 24 * 60 * 60) < cert["expiry"] then
+      local rand_value = math.random(cert["expiry"] - (30 * 24 * 60 * 60), cert["expiry"] - (15 * 24 * 60 * 60))
+      local rand_renewal_threshold = now
+      if rand_value < rand_renewal_threshold then
+        ngx.log(ngx.NOTICE, "auto-ssl: expiry date is more than 15 days out, randomly not picked for renewal: ", domain)
+        renew_check_cert_unlock(domain, storage, local_lock, distributed_lock_value)
+        return
+      end


Is this increasing likelihood approach needed with the other throttle mechanisms provided by renewals_per_hour (60/hour)? I guess I'm just trying to think through the different scenarios with this and wondering how these two approaches interact.

If you have a big batch of domains that need to be renewed, would the renewals_per_hour logic not eventually get through renewed, just at a slower pace? Or is this logic here an attempt to prioritize the renewals, so that ones expiring sooner have their renewal attempted sooner? Would this mainly be relevant if you expect your renewals to take more than 30 days at the throttled 60/hour rate?

If this is mainly intended to prioritize renewals (but you still want to saturate the 60/hour rate), then I'm wondering if maybe a slightly simpler approach would be to sort the renewals by expiration (instead of the current approach of randomizing the renewal order). That being said, there may be some performance implications of fetching all of those expiration times before renewals occur, so this may not actually be feasible, I'd need to investigate further.

Great question about how this interacts with the throttle mechanism!

I implemented this after realizing that we had a bulk of many-thousand of certificates attempting to renew at the same day, as this was the date when we migrated them onto lua-resty-auto-ssl initially.

While this went on, at some point we exceeded the rate limit, and this caused an issue, as we weren't able to generate certificates for newly-registered customer domains. This issue likely wouldn't happened if we had set our renewals_per_hour more conservatively (we must have used more than 40/h for ad-hoc certificate requests).

So the aim of this patch is really just to smoothen out the renewals over many days, anticipating that this will make it less likely that we exceed the rate limit. (Example: I am starting out using lu-resty-auto-ssl today and migrate 10k domains onto the system. Before this patch, all of these domains would simultaneously attempted for renewal 60 days from now, a job which would take 7 days to finish, at 60/h. This could be just fine. This patch changes this behavior so that instead the renewals get spread over the next 14 invocations (each of which should only take 12 hours to finish).

In case of the former, we'd be "maxing out" the 60/h for a week straight - a process which would reoccur every two months. In case of the latter, we'd be at 60/h for only half of the time, the rest of which we'd have a little bit more room for ad-hoc certificate requests to exceed 40/h. And over time, we'd inevitably spread them even more evenly, so that we'd be attempting renewal of the same number of domains on average ultimately.

Does this make more sense to you after reading this?

GUI · 2020-07-20T02:31:17Z

README.md

+#### `get_failures`
+
+The optional `get_failures` function accepts a domain name argument, and can be used to retrieve statistics about failed certificate requests concerning the domain. The function will return a table with fields `first` (timestamp of first failure encountered), `last` (timestamp of most recent failure encountered), `num` (number of failures). The function will instead return `nil` if no error has been encountered.
+
+Note: the statistics are only kept for as long as the nginx instance is running. There is no sharing across multiple servers (as in a load-balanced environment) implemented.
+
+To make use of the `get_failures` function, add the following to the `http` configuration block:
+
+```nginx
+  lua_shared_dict auto_ssl_failures 1m;
+```
+
+When this shm-based dictionary exists, `lua-resty-auto-ssl` will use it to update a record it keeps for the domain whenever a Let's Encrypt certificate request fails (for both new domains, as well as renewing ones). When a certificate request is successful, `lua-resty-auto-ssl` will delete the record it has for the domain, so that future invocations will return `nil`.
+
+The `get_failures` function can be used inside `allow_domain` to implement per-domain rate-limiting, and similar rule sets.
+
+*Example:*
+
+```lua
+auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal)
+  local failures = auto_ssl:get_failures(domain)
+  -- only attempt one certificate request per hour
+  if not failures or 3600 < ngx.now() - failures["last"] then
+    return true
+  else
+    return false
+  end
+end)
+```
+
+#### `track_failure`
+
+The optional `track_failure` function accepts a domain name argument and records a failure for this domain. This can be used to avoid repeated lookups of a domain in `allow_domain`.
+
+*Example:*
+
+```lua
+auto_ssl:set("allow_domain", function(domain, auto_ssl, ssl_options, renewal)
+  local failures = auto_ssl:get_failures(domain)
+  -- only attempt one lookup or certificate request per hour
+  if failures and ngx.now() - failures["last"] <= 3600 then
+    return false
+  end
+
+  local allow
+  -- (external lookup to check domain, e.g. via http)
+  if not allow then
+    auto_ssl:track_failure(domain)
+    return false
+  else
+    return true
+  end
+end)
+```
+


This get_failures and track_failure logic to help throttle initial registrations is definitely a flexible solution that allows for some nifty logic in the allow_domain callback. I think I see how all this fits together (nice!), but I guess I'm just mulling over other options to deal with the same issue. A couple things come to mind.

You commented:

The one slightly ambiguous part IMHO: we're using 6e1ea73 together with b1f9715, a change to check if we already have a cert in permanent storage before calling allow_domain, which was previously rejected during review. Without the latter patch, you might have a situation in a multi-server setup where one server serves a certificate for a domain, while another still returns the fallback certificate because of per-domain throttling. With the latter patch, auto-ssl will always return a certificate any of the servers has been able to get one.

The ordering issues around this does make me wondering about whether allow_domain is the right place to implement this logic. Because in a way, this seems like a slightly different concept than what allow_domain as originally intended to do. Perhaps we could introduce a different callback (eg, throttle_domain) that would get called after the storage call (eg, slide it into where you had proposed allow_domain get moved in b1f9715)? That would keep allow_domain being called before the storage call, but then allow this logic to be handled separately.

I'm also wondering if we could integrate this logic in a more built-in fashion so that users don't have to implement custom allow_domain (or other callbacks) to handle this. Similar to how renewals_per_hour is just a simple global setting, I'm wondering if we could simplify this in some fashion and integrate this logic directly into codebase, rather than relying on the callback. Or do your use-cases need more custom or per-domain logic for this type of thing?

I'd need to spend some more time investigating the feasibility of these ideas (and I realize you might have different use-cases), but these are just some initial thoughts.

Thank you for your review and thoughts, @GUI.

To your points:

Separate callbacks would definitely work for us. We don't mind the Redis lookup to see if a certificate is in storage or not, but to accommodate users that do, having two makes a lot of sense to me IMHO. (perhaps _early and _late?)

Part of my thinking behind adding this to the implementer's "toolbox" in allow_domain was the realization that one could use the track_failure(domain) mechanism also for caching any "regular" lookup that allow_domain would do, if this is desired. (Example: we have to do a HTTP call to an internal web API to decide whether we should allow a new (sub-) domain)

I do see beauty of having a simple setting, and have everything be handled behind the scenes by lua-resty-auto-ssl.

To be specific, we're currently using this exponential backoff algorithm in our setup. Unsure if something like this could be generalized to work for other users (without a callback?):

local failures = auto_ssl:get_failures(domain) if failures then -- 1 failures: next attempt after 3.75 min -- 2 failures: next attempt after 7.5 min -- 3 failures: next attempt after 15 min -- 4 failures: next attempt after 30 min -- 5 failures: next attempt after 60 min -- ... -- 10+ failures: next attempt after 24 hours local backoff = math.min(225 * 2 ^ (failures["num"] - 1), 86400) local elapsed = ngx.now() - failures["last"] if elapsed < backoff then ngx.log(ngx.NOTICE, domain .. ": " .. failures["num"] .. " cert failures, last attempt " .. math.floor(elapsed) .. "s ago, want to wait " .. backoff .. "s before trying again") return false end

Conceptually, the simplest mechanism to me would be one that (a) caches the results of allow_domain, (b) equally takes into consideration cert failures, and (c) uses permanent storage so that the tracking automatically shared across multiple servers. (My patch only does (b) and gives the implementer the tools to implement (a).)

Another approach that I didn't attempt, but which would make sense to me also: track each certificate request (success and failure), have one tunable to set the max number of certificate requests per three hours (defaulting to LE's 300), and have algorithms to adaptively throttle both renewals, and also (failing) ad-hoc certificate requests as needed. With additional tunable as needed, and using permanent storage (i.e. aggregated across multiple servers).

gohai changed the title ~~Throttle the rate of renewal requests (ACME v2 rate limit)~~ Address new ACME v2 rate limit ("too many new orders recently") Jan 22, 2020

gohai added 2 commits January 22, 2020 15:16

gohai force-pushed the acme_v2-orders-ratelimit branch from f161c9e to bc2a999 Compare January 22, 2020 23:17

gohai added 2 commits January 27, 2020 16:08

gohai force-pushed the acme_v2-orders-ratelimit branch from 665f2d3 to 6e1ea73 Compare January 28, 2020 00:17

GUI reviewed Jul 20, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address new ACME v2 rate limit ("too many new orders recently") #217

Address new ACME v2 rate limit ("too many new orders recently") #217

gohai commented Jan 22, 2020 •

edited

adityapatadia commented Jun 21, 2020

davidmerfield commented Jun 22, 2020

GUI commented Jun 25, 2020

ErisDS commented Jul 7, 2020

gohai commented Jul 7, 2020 •

edited

GUI left a comment

GUI Jul 20, 2020

gohai Jul 20, 2020 •

edited

GUI Jul 20, 2020

gohai Jul 20, 2020 •

edited

gohai Jul 21, 2020

Address new ACME v2 rate limit ("too many new orders recently") #217

Are you sure you want to change the base?

Address new ACME v2 rate limit ("too many new orders recently") #217

Conversation

gohai commented Jan 22, 2020 • edited

adityapatadia commented Jun 21, 2020

davidmerfield commented Jun 22, 2020

GUI commented Jun 25, 2020

ErisDS commented Jul 7, 2020

gohai commented Jul 7, 2020 • edited

GUI left a comment

Choose a reason for hiding this comment

GUI Jul 20, 2020

Choose a reason for hiding this comment

gohai Jul 20, 2020 • edited

Choose a reason for hiding this comment

GUI Jul 20, 2020

Choose a reason for hiding this comment

gohai Jul 20, 2020 • edited

Choose a reason for hiding this comment

gohai Jul 21, 2020

Choose a reason for hiding this comment

gohai commented Jan 22, 2020 •

edited

gohai commented Jul 7, 2020 •

edited

gohai Jul 20, 2020 •

edited

gohai Jul 20, 2020 •

edited