New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cdo::Throttle module #38142
Add Cdo::Throttle module #38142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with this area, but this looks good and well-described in the PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the simplicity and clarity of this design, but I think that does impose some limitations on how it can be used. Let me know if you'd like to chat about this more.
value[:throttled_until] = now + throttle_for if should_throttle | ||
end | ||
|
||
CDO.shared_cache.write(full_key, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a race condition here if several different processes are all reading from and writing to the value at the same time which will undercount the number of requests. (That might be acceptable given the use case, but it would be a good limitation to note.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yeah, you're right. i think it's acceptable for the current consumers (the client is caching responses and the server is doing some caching as well, so this is only called for unique/uncached requests), but i'd like to fix this to future-proof it.
i think the solution for this would be to implement a mutex or a queue here, but are there other/better solutions? i worry about a mutex because i think it would have to lock for everybody (rather than being able to lock per ID), but maybe that's okay -- i'm not super familiar with implementing/using locks in this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're using Redis, then this could be a good way to implement it: https://redislabs.com/redis-best-practices/basic-rate-limiting/. (The behavior would be a bit different as you've defined it here though.)
A local lock probably wouldn't help because it's still local to an instance and a distributed lock is difficult to implement. Two viable patterns are atomic operations (such as increment) at the cache server or optimistic concurrency (where you make sure the value that you're writing hasn't changed since you read it).
else | ||
value[:throttled_until] = nil | ||
earliest = now - period | ||
value[:request_timestamps].select! {|timestamp| timestamp >= earliest} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly worried about this. What kind of limit values do you think we will see in practice? Throttling functions typically need to be really fast and this design grows linearly in time and space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is the part that i feel weird about as well. the limit values are currently 100 requests / 60 seconds for identifiable users and 1000 requests / 60 seconds for unidentifiable users:
# Allowed number of unique requests per minute before that client is throttled. | |
# These values are fallbacks for DCDO-configured values used below. | |
REQUEST_LIMIT_PER_MIN_DEFAULT = 100 | |
REQUEST_LIMIT_PER_MIN_IP = 1000 |
the limits are configurable via DCDO values for each consumer. when a consumer is throttled, i am logging to honeybadger so we're notified and know if we need to adjust those limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have any thoughts for how to improve this? i'm trying to keep this array of timestamps small by evicting any values outside of the period (which is currently 60 seconds for all consumers) to mitigate, but there may be a better way to do this (or to track the requests differently to avoid this problem entirely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to solve this is to track counts in buckets instead of the exact timestamps. Each bucket would represent the number of requests that arrived during a particular window (say, 5 seconds) and you could sum up enough buckets to get the count for the interval that you want. Old buckets would automatically age out of the interval as time progresses. (Sorry if that's not super clear, this might be easier to explain with a sketch if it doesn't make sense.)
# @returns [Boolean] Whether or not the request should be throttled. | ||
def self.throttle(id, limit, period, throttle_for = throttle_time) | ||
full_key = CACHE_PREFIX + id.to_s | ||
value = CDO.shared_cache.read(full_key) || empty_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this result in a network call? How is failure handled (e.g. if the shared cache is down)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this will result in a network call if CDO.shared_cache
is using ElastiCache (it's the default, but falls back to a FileStore cache if initializing ElastiCache fails here).
if i'm reading the implementation correctly, nil
will be returned on failure. this is rails' implementation of read
, which calls the subclass' implementation of read_entry
, which would be this implementation when using ElastiCache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to make this more explicit and just bypass the throttler if elasticache is temporarily unavailable?
Adds a simple module that tracks whether or not a particular ID should be throttled.
Cdo::Throttle.throttle
usesCDO.shared_cache
(which uses Elasticache and is shared across all instances) to track timestamps for the given ID, then uses the providedlimit
,period
, andthrottle_time
arguments to determine how many requests are allowed in the given timeframe, and if requests are above that threshold, how long that ID should be throttled.Requests are also tracked while the ID is throttled, so making requests while throttled means that ID could stay throttled even after the original
throttle_time
has passed.Important note: Because this module uses IDs to track usage, be sure the ID you provide is unique so it doesn't override unrelated IDs.
Example:
Cdo::Throttle.throttle("profanity/a1b2c3", 10, 60, 20)
-- The request tracked by IDprofanity/a1b2c3
can make 10 requests in 60 seconds before it is throttled. Once throttled, it will stay throttled for 20 seconds.Also see example usage in #38143.
Links
Testing story
Adds unit tests.
Reviewer Checklist: