## Overview
Controls rate of traffic from client to server. Rate is often set as in terms of $n$ requests per second. Any request going beyond the set limit is blocked. There are multiple advantages of using rate limiter:
- prevents system from being overwhelmed in case of excess traffic. Denial of Service attacks involve sending large number of request with the aim of resource starvation.
- prevents excess charge being incurred in case we are using per request chargeable third party API.

Rate limiter can be applied both at client and server side - however client side rate limiting cannot be relied upon since it is easy to modify client.

### Granularity
Usually, the rate limiter in production is very granular:
- rate limiter is defined in terms of $n$ number of request in unit time **per user**
- it can be even more specific so as to apply different rates for different APIs. For example, less critical APIs such as analytics can have a lower limit set, whereas more critical API such as transaction ones have higher limit set.

<img src="images/rate_limiter.png" />

A desirable rate limiter has the following properties:
- is efficient: has minimal effect on response time and uses least amount of memory possible
- rate limiter can be shared across multiple servers or processes - works well in a distributed system
- throttled users are clearly notified 
- failure with rate limiter should not effect the entire system

## Strategies
There are multiple different ways to implement a rate limiter as discussed below:

### Token Bucket Algorithm
<img src="images/token_bucket.png" />

In [None]:
// This implementation is not thread safe
class TokenBucketLimiter {
    final int RATE;  // Unit is seconds
    final int MAX_BUCKET_SIZE;
    
    TokenBucketLimiter(int rate, int maxBucketSize) {
        this.RATE = rate;
        this.MAX_BUCKET_SIZE = maxBucketSize;
    }
    
    long lastTime = System.currentTimeMillis();
    int currentBucketSize = 0;
    
    handle(Request request, Consumer<Request> successHandler, Consumer<Request> failureHandler) {
        long currentTime = System.currentTimeMillis();
        long elapsedTime = (lastTime - currentTime) / 1000; // in seconds
        lastTime = currentTime;
        
        // Update the bucket occupancy
        currentBucketSize = currentBucketSize + (RATE * elapsed);
        // Overflow? Rest the bucket in case of overflow
        if(currentBucketSize > MAX_BUCKET_SIZE) {
            currentBucketSize = MAX_BUCKET_SIZE;
        }
        
        if(currentBucketSize < 1) {
            failureHandler.accept(request);
        } else {
            currentBucketSize--; // consume 1 token
            successHandler.accept(request);
        }
    }
}

To translate this process to a distrubuted world, we'll need something like a global store like Redis. The buckets are then stored in Redis. In case of Redis, the script above needs to run atomically, something that Redis transactions can provide. Both `currentBucketSize` and `lasTime` will be stored as two set of keys per user.

In order to support burst mode (wherein we allow client to go above the limit for a short period of time), we can initialize bucket with larger number of tokens than the capacity of the bucket. This means that when the bucket is full, a user can go beyond the specified rate for a brief moment of time.

References: [KrakenD Documentation](https://www.krakend.io/docs/throttling/token-bucket/), [Wikipedia](https://en.wikipedia.org/wiki/Token_bucket)

### Fixed Window Counter
In this strategy, we divide timeline into fixed width windows, for example 1sec windows. Each incoming request increases the counter of that window by one. If the counter has already reached threshold, any incoming request in that window is rejected until new window starts.

<img src="images/fixed_window.png" />

Fixed window however may not always be accurate, allowing as much as twice the number of requests in a given rolling window:

<img src="images/fixed_window_problem.png" />

As we can see in the above figure, we have twice the max number of allowed requests in the window `1:00:0:500` and `1:00:01:500` without triggering the rate limiter.

In [None]:
class FixedWindowLimiter {
    final int RATE // Unit is seconds
    
    // In this simplistic implementation, stale entries are not being removed
    Map<Long, AtomicInteger> requestWindows = new ConcurrentHashMap<>();
    
    handle(Request request, Consumer<Request> successHandler, Consumer<Request> failureHandler) {
        long currentTime = System.currentTimeMillis();
        currentTime = currentTime / 1000 * 1000;
        
        requestWindows.putIfAbsent(currentTime, new AtomicInteger(0));
        boolean allowed = requestWindows.get(currentTime).incrementAndGet() <= RATE;

        if(!allowed) {
            failureHandler.accept(request);
        } else {
            successHandler.accept(request);
        }
    }
}

### Sliding Window Log
In this strategy, we log timestamp for each request. For every new request, we look back by the window time amount and count the number of request made in the window:

<img src="images/sliding_window_log.png" width="900" height="auto"/>

In [None]:
// Not thread safe
class SlidingWindowLogLimiter {
    final int RATE // Unit is seconds
    
    Queue<Long> requestLog = new LinkedList<>();
    
    handle(Request request, Consumer<Request> successHandler, Consumer<Request> failureHandler) {
        long currentTime = System.currentTimeMillis();
        long window = currentTime - 1000; // 1 sec window
        
        // Remove all logs outside of the window
        while(!requestLog.isEmpty() && requestLog.element() <= window) {
            requestLog.poll();
        }
        
        boolean allowed = requestLog.size() <= RATE;

        if(!allowed) {
            failureHandler.accept(request);
        } else {
            requestLog.add(currentTime);
            successHandler.accept(request);
        }
    }
}

Sliding window log is very accurate, however it takes up extra memory in form of the `requestLog` as shown in the code example above.

### Sliding Window Counter
Combines sliding window log and fixed window counter approaches. One assumption we make is that the requests made in each window is distributed equally through the duration of the window. Of course this assumption makes this strategy less accurate than sliding window log, however that balances over time. Number of requests made in a sliding window is calculated as: `req in current window + req in prev window * overlap percentage`.

<img src="images/sliding_window_counter.png" />

## Architecture
<img src="images/rate_limiter_arch.png" />

1. Request arrives from a client. The rate limiter middleware would have to now decide whether to allow or reject the request
2. Rate limits are usually defined as configurations. In this setup the rate limiter middleware fetches and saves the rule. Any updates in rule is forwarded over to the rate limiter.
3. The rate limiter saves its data on a cache preferably with transactions support, for example Redis. This cache is where buckets in case of token bucket algorithm would be stored.
4. Requests below the limit are reach application servers. In this case we have stateless APIs.
5. Requests above the limit are rejected with a status 429 response. Additional information such as wait time or currently set limit can also be in the response as headers.
6. We can also choose to save the rejected requests in a message log and retry processing them when we know the load on system is low.