Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cdo::Throttle module #38142

Merged
merged 2 commits into from Dec 9, 2020
Merged

Add Cdo::Throttle module #38142

merged 2 commits into from Dec 9, 2020

Conversation

maddiedierker
Copy link
Contributor

@maddiedierker maddiedierker commented Dec 6, 2020

Adds a simple module that tracks whether or not a particular ID should be throttled.

Cdo::Throttle.throttle uses CDO.shared_cache (which uses Elasticache and is shared across all instances) to track timestamps for the given ID, then uses the provided limit, period, and throttle_time arguments to determine how many requests are allowed in the given timeframe, and if requests are above that threshold, how long that ID should be throttled.

Requests are also tracked while the ID is throttled, so making requests while throttled means that ID could stay throttled even after the original throttle_time has passed.

Important note: Because this module uses IDs to track usage, be sure the ID you provide is unique so it doesn't override unrelated IDs.

Example: Cdo::Throttle.throttle("profanity/a1b2c3", 10, 60, 20) -- The request tracked by ID profanity/a1b2c3 can make 10 requests in 60 seconds before it is throttled. Once throttled, it will stay throttled for 20 seconds.

Also see example usage in #38143.

Links

Testing story

Adds unit tests.

Reviewer Checklist:

  • Tests provide adequate coverage
  • Privacy and Security impacts have been assessed
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

Copy link
Contributor

@ajpal ajpal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with this area, but this looks good and well-described in the PR description

@maddiedierker maddiedierker merged commit 5cea064 into staging-next Dec 9, 2020
@maddiedierker maddiedierker deleted the cdo-throttle branch December 9, 2020 18:18
Copy link
Contributor

@jamescodeorg jamescodeorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the simplicity and clarity of this design, but I think that does impose some limitations on how it can be used. Let me know if you'd like to chat about this more.

value[:throttled_until] = now + throttle_for if should_throttle
end

CDO.shared_cache.write(full_key, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a race condition here if several different processes are all reading from and writing to the value at the same time which will undercount the number of requests. (That might be acceptable given the use case, but it would be a good limitation to note.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah, you're right. i think it's acceptable for the current consumers (the client is caching responses and the server is doing some caching as well, so this is only called for unique/uncached requests), but i'd like to fix this to future-proof it.

i think the solution for this would be to implement a mutex or a queue here, but are there other/better solutions? i worry about a mutex because i think it would have to lock for everybody (rather than being able to lock per ID), but maybe that's okay -- i'm not super familiar with implementing/using locks in this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using Redis, then this could be a good way to implement it: https://redislabs.com/redis-best-practices/basic-rate-limiting/. (The behavior would be a bit different as you've defined it here though.)

A local lock probably wouldn't help because it's still local to an instance and a distributed lock is difficult to implement. Two viable patterns are atomic operations (such as increment) at the cache server or optimistic concurrency (where you make sure the value that you're writing hasn't changed since you read it).

else
value[:throttled_until] = nil
earliest = now - period
value[:request_timestamps].select! {|timestamp| timestamp >= earliest}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried about this. What kind of limit values do you think we will see in practice? Throttling functions typically need to be really fast and this design grows linearly in time and space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is the part that i feel weird about as well. the limit values are currently 100 requests / 60 seconds for identifiable users and 1000 requests / 60 seconds for unidentifiable users:

# Allowed number of unique requests per minute before that client is throttled.
# These values are fallbacks for DCDO-configured values used below.
REQUEST_LIMIT_PER_MIN_DEFAULT = 100
REQUEST_LIMIT_PER_MIN_IP = 1000

the limits are configurable via DCDO values for each consumer. when a consumer is throttled, i am logging to honeybadger so we're notified and know if we need to adjust those limits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any thoughts for how to improve this? i'm trying to keep this array of timestamps small by evicting any values outside of the period (which is currently 60 seconds for all consumers) to mitigate, but there may be a better way to do this (or to track the requests differently to avoid this problem entirely)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to solve this is to track counts in buckets instead of the exact timestamps. Each bucket would represent the number of requests that arrived during a particular window (say, 5 seconds) and you could sum up enough buckets to get the count for the interval that you want. Old buckets would automatically age out of the interval as time progresses. (Sorry if that's not super clear, this might be easier to explain with a sketch if it doesn't make sense.)

# @returns [Boolean] Whether or not the request should be throttled.
def self.throttle(id, limit, period, throttle_for = throttle_time)
full_key = CACHE_PREFIX + id.to_s
value = CDO.shared_cache.read(full_key) || empty_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this result in a network call? How is failure handled (e.g. if the shared cache is down)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this will result in a network call if CDO.shared_cache is using ElastiCache (it's the default, but falls back to a FileStore cache if initializing ElastiCache fails here).

if i'm reading the implementation correctly, nil will be returned on failure. this is rails' implementation of read, which calls the subclass' implementation of read_entry, which would be this implementation when using ElastiCache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to make this more explicit and just bypass the throttler if elasticache is temporarily unavailable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants