Skip to content

RFC: Rate limiting (throughput quota) design

afeinberg edited this page Jul 29, 2011 · 4 revisions

Rate limiting design

Goals

When multiple tenants are hosted on the same cluster, we should be able to limit the impact of a single tenant (that happens to experience unplanned high volume or errant client) upon other tenants of the cluster.

We want to limit all of the requests: get, put, getAll and delete.

Design

A 10,000 ft view

We specify a soft limit and a hard limit. Upon violation of a soft limit, we register the violating store in a JMX getter, allowing monitoring tools to alert the owners of the store (this is the approach used by the disk quota subsystem).

Upon violation of a hard limit, we ban the store for a specified limit.

Determining whether a store violates a limit

Count the number of requests that happen during each second. Consider a store violating the limit as soon as the number of requests per second exceeds the limit. This uses the standard RequestCounter class from Voldemort (the statistics are kept for a whole interval and are accumulated during the next interval).

Banning requests to the store

There are multiple approaches to this. One approach (approach A) is respond to any requests from the client with an application exception. The network requests would still reach the server, the server would still reply. This is “load shedding”: here the load is shed at the higher levels, before the keys/values are themselves deserialized and the disk is impacted. Note: that may still, however be insufficient.

Another approach (approach B) is to send a hard exception back to the client as to cause the client’s failure detector to mark the server down and stop sending requests. This is somewhat coarse grained: all the clients from the same StoreClientFactory become banned, but: 1) usually a StoreClientFactory is associated with a single application 2) this prevents the client from sending any data to the server, meaning there’s no additional burden on the server from the “banned” traffic.

So far approach A was taken and has been integration tested. What’s needed is a manual integration test of approach A, and perhaps a side-by-side comparison with approach B.

Clone this wiki locally