[DO NOT MERGE] CI triage: isolate TestHoodieClientMultiWriter to debug ZK BindException flake#19060
Closed
nsivabalan wants to merge 1 commit into
Closed
[DO NOT MERGE] CI triage: isolate TestHoodieClientMultiWriter to debug ZK BindException flake#19060nsivabalan wants to merge 1 commit into
nsivabalan wants to merge 1 commit into
Conversation
…Exception flake Recent Azure CI runs on apache#18147 and apache#18650 fail on the test-spark-java17-java-tests-part2 (spark4.0) job with BindException: Address already in use followed by HoodieLockException: Failed to connect to ZooKeeper within 10000 ms in BaseZookeeperBasedLockProvider. Neither patch touches lock providers, ZK, or the test harness, so this is a runner-resource / port-bind flake in TestingServer construction. This commit prepares an isolated repro: - Disable the GitHub Actions bot.yml (workflow_dispatch only) so the heavy GA matrix does not consume runners during triage. - Strip azure-pipelines-20230430.yml down to a single job that builds hudi-spark-datasource/hudi-spark with -am -DskipTests, then runs only `mvn test -Dtest=TestHoodieClientMultiWriter` against that module. Pre- and post-test runner-network diagnostic steps (ip_local_port_range, ss -tlnp, ulimit) make any port-bind contention visible in the Azure log. - Wrap the single `new TestingServer()` call site in TestHoodieClientMultiWriter with startTestingServerWithDiagnostics() which logs the bound port, the bind latency, JVM PID and hostname, and retries up to 5 times with backoff on nested BindException (other failures rethrow immediately). To restore master behavior, revert this commit. Do not merge from this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@hudi-bot run azure |
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DO NOT MERGE — temporary triage branch.
Recent Azure CI runs on #18147 and #18650 fail on the same Spark 4.0 java-tests-part2 job with:
Neither PR touches lock providers, ZK, or the test harness — strongly suggests a runner-resource / port-bind flake in the Curator
TestingServerused insideTestHoodieClientMultiWriter.testHoodieClientBasicMultiWriterWithEarlyConflictDetectionDirect.This branch is opened against
apache/hudi:masterpurely so the Apache Azure CI pipeline runs on it. Contents:.github/workflows/bot.yml— disabled (workflow_dispatch only) so the GA matrix does not consume runners during triage.azure-pipelines-20230430.yml— stripped from 10 jobs to a single job that buildshudi-spark-datasource/hudi-spark -am -DskipTeststhen runsmvn test -Dtest=TestHoodieClientMultiWriteronly. Pre- and post-test diagnostic steps printip_local_port_range,ss -tlnp,ulimit -aso port-bind contention is visible in the Azure log.TestHoodieClientMultiWriter.java— wraps the singlenew TestingServer()call instartTestingServerWithDiagnostics()which logs bound port, bind latency, JVM PID, hostname, and retries up to 5× on nestedBindException.Will close once we have the diagnostic data.
🤖 Generated with Claude Code