Dear dCache developers,
After the switch to dCache 11.2.3, we at KIT started to see HTTP 429 errors more often on our WebDAV doors and on our Web Frontends.
According to our investigations, this seems to be related to rate limiters introduced to 11.X (commits: 3f6d72a and 8e7c5b6).
While we understand now, that we have to adjust related rate limits:
| Property |
Meaning |
.limits.rate.overall |
Maximum overall request rate accepted by the service |
.limits.rate.per-client.fractions |
Maximum share of the overall request budget one client may use |
.limits.error.max-allowed |
Number of auth or permission failures allowed before temporary blocking |
.limits.error.block.window.time |
How long a client is blocked after too many auth or permission failures |
.limits.rate.per-client.block.window.time |
How long a client is blocked after exceeding its per-client request rate |
.limits.blocked-clients.idle-time |
How long blocked-client state is retained while idle |
.limits.max-blocked-clients |
Upper bound for tracked blocked-client entries |
We wonder about which numbers to use, since the current defaults in dCache are too small for our operations. In addition, we can't increase those incrementaly and then observe for a few days, since a change would require a service restart of the door and/or the frontend, which means usually a downtime if being strict.
So far, we found, that logging of particular rejections due to having too many requests is only complete on DEBUG log level, such that we have to modify the logging of the domain at runtime by an appropriate admin interface command:
log set stdout org.dcache.util.jetty.RateLimitedHandlerList DEBUG
We wonder therefore, whether it is on purpose, that some of the logged messages are only at DEBUG level by default.
All in all, we would like to discuss with you the current state of dCache in that context (logging, conditions for rejection, corresponding workflow, etc.), and ask for some guidance on appropriate limit numbers.
For documentation purposes, I'm attaching at the end of this issue also the AI-based investigations which helped us understand the situation a bit:
Thank you very much for your help in advance!
Best,
Artur on behalf of dCache admin team at KIT
Dear dCache developers,
After the switch to dCache 11.2.3, we at KIT started to see HTTP 429 errors more often on our WebDAV doors and on our Web Frontends.
According to our investigations, this seems to be related to rate limiters introduced to 11.X (commits: 3f6d72a and 8e7c5b6).
While we understand now, that we have to adjust related rate limits:
.limits.rate.overall.limits.rate.per-client.fractions.limits.error.max-allowed.limits.error.block.window.time.limits.rate.per-client.block.window.time.limits.blocked-clients.idle-time.limits.max-blocked-clientsWe wonder about which numbers to use, since the current defaults in dCache are too small for our operations. In addition, we can't increase those incrementaly and then observe for a few days, since a change would require a service restart of the door and/or the frontend, which means usually a downtime if being strict.
So far, we found, that logging of particular rejections due to having too many requests is only complete on
DEBUGlog level, such that we have to modify the logging of the domain at runtime by an appropriate admin interface command:log set stdout org.dcache.util.jetty.RateLimitedHandlerList DEBUGWe wonder therefore, whether it is on purpose, that some of the logged messages are only at DEBUG level by default.
All in all, we would like to discuss with you the current state of dCache in that context (logging, conditions for rejection, corresponding workflow, etc.), and ask for some guidance on appropriate limit numbers.
For documentation purposes, I'm attaching at the end of this issue also the AI-based investigations which helped us understand the situation a bit:
http-429-live-monitoring.md
Thank you very much for your help in advance!
Best,
Artur on behalf of dCache admin team at KIT