(draft) Kafka-12319: Improve rate calculation for first sample window #12034

divijvaidya · 2022-04-12T13:25:00Z

Background

Apache Kafka's rate-limiting algorithm uses a variation of fixed window algorithm where the duration of time window is controlled by two configurations: quota.window.num and quota.window.size.seconds. When a client connection is received, rate of connections at that timestamp is calculated by observing the number of requests in the duration of time window defined using the configurable parameters. If the rate is above the acceptable threshold, the connection is made to wait for a duration calculated to ensure that when a connection is accepted, the overall rate should be less than (or equal to) acceptable threshold.

This algorithm has certain edge cases to handle rate calculation, such as the scenarios where 1\ extended period of time when no connections are received or when 2\ the rate calculation is occurring within the first ever window with no prior samples. Currently, both these cases are handled in the same manner. When no prior windows are available, an assumption is made that a prior window exists with 0 data points. This is an effective way to solve scenario 1 but it leads to problems for scenario 2. Let us demonstrate using examples.

Consider configuration as config.timeWindowMs() = 1s, config.samples() = 2, max connection create rate = 30/s. A workload with a uniform rate of 40/s (1 connection per 25 ms) is started.

Record events (E) at timestamps:
E1 = CurrentTimeStamp (T1)
E2 = T1 + 25ms
E3 = T1 + 50ms
...
...
E30 = T1 + 725ms

The rate calculated as per the current algorithm (assuming a pre-existing window with 0 events) would be:

Rate T1 + 20ms = 1/ (prior window size + current window elapsed time) = 1 / (1 + 0.02) = 0.98 events per second
Rate calculated at T1 + 90ms = 3/1.09s = 2.75 events per second
Rate calculated at T1 + 730ms = 30/1.73s = 17.34 events per second

Note that even after 30 events in that window, the rate is still set to ~17 events per second, which would prevent the throttling to kick-in. This would lead to a situation where the first window allows more requests than acceptable limit to pass through and thus, to compensate, the second window serves less traffic, the third window again increases the traffic and so on....
This leads to an uneven distribution traffic across windows.

Note that this un-even distribution is the primary cause of flakiness of ConnectionQuotasTest.testListenerConnectionRateLimitWhenActualRateAboveLimit()

Solution

As a solution, this PR distinguishes between the scenario 1 and 2 above. During the Rate calculation, windowSize method has been modified to treat the case of first window separately from the case when no traffic has been received in prior windows. For the first window, the windowSize is calculated as the overall length of first window. Using the new change, the above calculation changes to:

The rate calculated as per the new algorithm (assuming a pre-existing window with 0 events) would be:

Rate T1 + 20ms = 1/ (current window size) = 1 / 1 = 1 events per second
Rate calculated at T1 + 90ms = 3/1s = 3 events per second
Rate calculated at T1 + 730ms = 30/1s = 30 events per second

Note how the throttling kicks-in as expected after 30 events and thus leading to a smoother distribution of traffic across windows.

Code Changes

Modifications in Rate.java#windowSize() method to distinguish between scenario 1 and 2.
Add a method isCurrentSampleInFirstWindow in
Fix the ConnectionQuotasTest.testListenerConnectionRateLimitWhenActualRateAboveLimit() test to run for 20s as expected (earlier, there was a difference between what )
Add new tests for scenario 1 and scenario 2 in MetricsTest.java
Modified existing tests for throttling / quota management to remove the assumptions of calculation based on prior empty windows

Testing

All unit test pass using ./gradlew unitTest

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…enActualRateAboveLimit

divijvaidya · 2022-04-13T15:02:46Z

Discarding this since I found a bug in current approach. A new PR has been started at #12045

This patch bumps the bundled ZK jar from 3.5.9 to 3.8.3. This is needed to pick up a fix for CVE-2023-44981 GHSA-7286-pgfv-vxvh.

divijvaidya added 11 commits April 6, 2022 17:31

Fix flaky test ConnectionQuotasTest.testListenerConnectionRateLimitWh…

94671a3

…enActualRateAboveLimit

Fixed the case of incorrect rate for first window

50bc818

Add unit tests and comments

c9f2ae4

Rollback change to how connections are created in test

05363f8

Rollback change to epsilon

0385e8a

Formatting fixes

b21ec85

Formatting fixes

a416b2e

Fix failing UT

45bfaa4

Fix failing unit test

876e496

Fix client quora manager tests

17abe3a

Fix a rounding error

c77262b

divijvaidya changed the title ~~Kafka-12319: Improve rate calculation for first sample window~~ (draft) Kafka-12319: Improve rate calculation for first sample window Apr 12, 2022

divijvaidya closed this Apr 13, 2022

mumrah pushed a commit to confluentinc/kafka that referenced this pull request Nov 30, 2023

CONFLUENT: Bump ZK version to 3.8.3 for 6.2.x (apache#12034)

331dbe0

This patch bumps the bundled ZK jar from 3.5.9 to 3.8.3. This is needed to pick up a fix for CVE-2023-44981 GHSA-7286-pgfv-vxvh.

mumrah pushed a commit to confluentinc/kafka that referenced this pull request Nov 30, 2023

CONFLUENT: Bump ZK version to 3.8.3 for 6.2.x (apache#12034)

3a3df41

This patch bumps the bundled ZK jar from 3.5.9 to 3.8.3. This is needed to pick up a fix for CVE-2023-44981 GHSA-7286-pgfv-vxvh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(draft) Kafka-12319: Improve rate calculation for first sample window #12034

(draft) Kafka-12319: Improve rate calculation for first sample window #12034

divijvaidya commented Apr 12, 2022

divijvaidya commented Apr 13, 2022

(draft) Kafka-12319: Improve rate calculation for first sample window #12034

(draft) Kafka-12319: Improve rate calculation for first sample window #12034

Conversation

divijvaidya commented Apr 12, 2022

Background

Solution

Code Changes

Testing

Committer Checklist (excluded from commit message)

divijvaidya commented Apr 13, 2022