Skip to content

[DO NOT MERGE] CI triage: isolate TestHoodieClientMultiWriter to debug ZK BindException flake#19060

Closed
nsivabalan wants to merge 1 commit into
apache:masterfrom
nsivabalan:ci-triage-multiwriter-zk
Closed

[DO NOT MERGE] CI triage: isolate TestHoodieClientMultiWriter to debug ZK BindException flake#19060
nsivabalan wants to merge 1 commit into
apache:masterfrom
nsivabalan:ci-triage-multiwriter-zk

Conversation

@nsivabalan

Copy link
Copy Markdown
Contributor

DO NOT MERGE — temporary triage branch.

Recent Azure CI runs on #18147 and #18650 fail on the same Spark 4.0 java-tests-part2 job with:

Caused by: java.net.BindException: Address already in use
Caused by: org.apache.hudi.exception.HoodieLockException: Failed to connect to ZooKeeper within 10000 ms
  at org.apache.hudi.client.transaction.lock.BaseZookeeperBasedLockProvider.<init>(BaseZookeeperBasedLockProvider.java:86)

Neither PR touches lock providers, ZK, or the test harness — strongly suggests a runner-resource / port-bind flake in the Curator TestingServer used inside TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriterWithEarlyConflictDetectionDirect.

This branch is opened against apache/hudi:master purely so the Apache Azure CI pipeline runs on it. Contents:

  • .github/workflows/bot.yml — disabled (workflow_dispatch only) so the GA matrix does not consume runners during triage.
  • azure-pipelines-20230430.yml — stripped from 10 jobs to a single job that builds hudi-spark-datasource/hudi-spark -am -DskipTests then runs mvn test -Dtest=TestHoodieClientMultiWriter only. Pre- and post-test diagnostic steps print ip_local_port_range, ss -tlnp, ulimit -a so port-bind contention is visible in the Azure log.
  • TestHoodieClientMultiWriter.java — wraps the single new TestingServer() call in startTestingServerWithDiagnostics() which logs bound port, bind latency, JVM PID, hostname, and retries up to 5× on nested BindException.

Will close once we have the diagnostic data.

🤖 Generated with Claude Code

…Exception flake

Recent Azure CI runs on apache#18147 and apache#18650 fail on the
test-spark-java17-java-tests-part2 (spark4.0) job with
BindException: Address already in use followed by
HoodieLockException: Failed to connect to ZooKeeper within 10000 ms in
BaseZookeeperBasedLockProvider. Neither patch touches lock providers,
ZK, or the test harness, so this is a runner-resource / port-bind flake
in TestingServer construction.

This commit prepares an isolated repro:
  - Disable the GitHub Actions bot.yml (workflow_dispatch only) so the
    heavy GA matrix does not consume runners during triage.
  - Strip azure-pipelines-20230430.yml down to a single job that builds
    hudi-spark-datasource/hudi-spark with -am -DskipTests, then runs
    only `mvn test -Dtest=TestHoodieClientMultiWriter` against that
    module. Pre- and post-test runner-network diagnostic steps
    (ip_local_port_range, ss -tlnp, ulimit) make any port-bind
    contention visible in the Azure log.
  - Wrap the single `new TestingServer()` call site in
    TestHoodieClientMultiWriter with startTestingServerWithDiagnostics()
    which logs the bound port, the bind latency, JVM PID and hostname,
    and retries up to 5 times with backoff on nested BindException
    (other failures rethrow immediately).

To restore master behavior, revert this commit. Do not merge from this
branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nsivabalan

Copy link
Copy Markdown
Contributor Author

@hudi-bot run azure

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan closed this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants