Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(draft) Kafka-12319: Improve rate calculation for first sample window #12034

Closed
wants to merge 11 commits into from

Conversation

divijvaidya
Copy link
Contributor

Background

Apache Kafka's rate-limiting algorithm uses a variation of fixed window algorithm where the duration of time window is controlled by two configurations: quota.window.num and quota.window.size.seconds. When a client connection is received, rate of connections at that timestamp is calculated by observing the number of requests in the duration of time window defined using the configurable parameters. If the rate is above the acceptable threshold, the connection is made to wait for a duration calculated to ensure that when a connection is accepted, the overall rate should be less than (or equal to) acceptable threshold.

This algorithm has certain edge cases to handle rate calculation, such as the scenarios where 1\ extended period of time when no connections are received or when 2\ the rate calculation is occurring within the first ever window with no prior samples. Currently, both these cases are handled in the same manner. When no prior windows are available, an assumption is made that a prior window exists with 0 data points. This is an effective way to solve scenario 1 but it leads to problems for scenario 2. Let us demonstrate using examples.

Consider configuration as config.timeWindowMs() = 1s, config.samples() = 2, max connection create rate = 30/s. A workload with a uniform rate of 40/s (1 connection per 25 ms) is started.

Record events (E) at timestamps:
E1 = CurrentTimeStamp (T1)
E2 = T1 + 25ms
E3 = T1 + 50ms
...
...
E30 = T1 + 725ms

The rate calculated as per the current algorithm (assuming a pre-existing window with 0 events) would be:

  • Rate T1 + 20ms = 1/ (prior window size + current window elapsed time) = 1 / (1 + 0.02) = 0.98 events per second
  • Rate calculated at T1 + 90ms = 3/1.09s = 2.75 events per second
  • Rate calculated at T1 + 730ms = 30/1.73s = 17.34 events per second

Note that even after 30 events in that window, the rate is still set to ~17 events per second, which would prevent the throttling to kick-in. This would lead to a situation where the first window allows more requests than acceptable limit to pass through and thus, to compensate, the second window serves less traffic, the third window again increases the traffic and so on....
This leads to an uneven distribution traffic across windows.

Note that this un-even distribution is the primary cause of flakiness of ConnectionQuotasTest.testListenerConnectionRateLimitWhenActualRateAboveLimit()

Solution

As a solution, this PR distinguishes between the scenario 1 and 2 above. During the Rate calculation, windowSize method has been modified to treat the case of first window separately from the case when no traffic has been received in prior windows. For the first window, the windowSize is calculated as the overall length of first window. Using the new change, the above calculation changes to:

The rate calculated as per the new algorithm (assuming a pre-existing window with 0 events) would be:

  • Rate T1 + 20ms = 1/ (current window size) = 1 / 1 = 1 events per second
  • Rate calculated at T1 + 90ms = 3/1s = 3 events per second
  • Rate calculated at T1 + 730ms = 30/1s = 30 events per second

Note how the throttling kicks-in as expected after 30 events and thus leading to a smoother distribution of traffic across windows.

Code Changes

  1. Modifications in Rate.java#windowSize() method to distinguish between scenario 1 and 2.
  2. Add a method isCurrentSampleInFirstWindow in
  3. Fix the ConnectionQuotasTest.testListenerConnectionRateLimitWhenActualRateAboveLimit() test to run for 20s as expected (earlier, there was a difference between what )
  4. Add new tests for scenario 1 and scenario 2 in MetricsTest.java
  5. Modified existing tests for throttling / quota management to remove the assumptions of calculation based on prior empty windows

Testing

All unit test pass using ./gradlew unitTest

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@divijvaidya divijvaidya changed the title Kafka-12319: Improve rate calculation for first sample window (draft) Kafka-12319: Improve rate calculation for first sample window Apr 12, 2022
@divijvaidya
Copy link
Contributor Author

Discarding this since I found a bug in current approach. A new PR has been started at #12045

mumrah pushed a commit to confluentinc/kafka that referenced this pull request Nov 30, 2023
This patch bumps the bundled ZK jar from 3.5.9 to 3.8.3. This is needed to pick
up a fix for CVE-2023-44981 GHSA-7286-pgfv-vxvh.
mumrah pushed a commit to confluentinc/kafka that referenced this pull request Nov 30, 2023
This patch bumps the bundled ZK jar from 3.5.9 to 3.8.3. This is needed to pick
up a fix for CVE-2023-44981 GHSA-7286-pgfv-vxvh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant