KAFKA−17999: Fix flaky test DynamicConnectionQuotaTest.testDynamicConnectionQuota #20657
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[KAFKA-17999] Deflake
DynamicConnectionQuotaTestby unifying loopback family and stabilizing connection-count waitsWhat
This PR eliminates nondeterminism in
DynamicConnectionQuotaTestcaused by IPv4/IPv6 resolution differences and timing sensitivity in connection accounting.JIRA
KAFKA-17999
Reproduction of flakiness
To force the mismatch and make the original test fail locally, you can bias the system toward IPv6 and ensure
"localhost"resolves to::1while the test counts on127.0.0.1.These steps modify
/etc/hosts; revert instructions are below.Backup and edit
/etc/hoststo favor IPv6 localhostForce JVM to prefer IPv6 and disable DNS caching
Run the test repeatedly
Observed: Test fails on every alternate run due to
java.net.SocketException: Broken pipewhen the broker applies the per-IP limit to one literal while the client connects via another, causing unexpected disconnects at the quota boundary.IMPORTANT (Do at the end, when done verifying the stability of the fix): Revert the host/network changes after reproducing
Why this flakes
Environment-dependent IPv4/IPv6 mismatch
The test previously used:
connectionCount(localAddress)wherelocalAddresswas127.0.0.1(IPv4), butconnect()dialed"localhost"which may resolve to either127.0.0.1or::1depending on the machine’s/etc/hostsand JVM preferences.Because Kafka enforces connection quotas per literal remote IP, counting on
127.0.0.1while connecting via::1(or vice versa) breaks the test’s assumptions and leads to intermittent failuresHow (the fix)
::1vs127.0.0.1) and use that same literal everywhere in the test.setUp()to avoid first-use side effects before taking connection-count baselines.connect()(client sockets),MAX_CONNECTIONS_PER_IP_OVERRIDES_CONFIG),connectionCount(localAddress)).Verification of stability
With the IPv6-favored setup above, the test below now consistently passes because counting, dialing, and per-IP override all use the same detected literal:
Scope