Skip to content

Shard Windows datanode unit tests to speed up Unit-Test CI#17693

Closed
JackieTien97 wants to merge 1 commit into
apache:masterfrom
JackieTien97:speedup-windows-unit-test
Closed

Shard Windows datanode unit tests to speed up Unit-Test CI#17693
JackieTien97 wants to merge 1 commit into
apache:masterfrom
JackieTien97:speedup-windows-unit-test

Conversation

@JackieTien97
Copy link
Copy Markdown
Contributor

Summary

  • Windows runner on the Unit-Test workflow's datanode cell takes ~47 min vs ~38 min on Ubuntu. With 597 UT classes and reuseForks=false (one JVM per class), Windows process startup + NTFS IO are the bottleneck.
  • Excludes the (windows-latest, datanode) cell from the existing matrix and adds a new unit-test-windows-datanode job with 3 parallel shards via -Dsurefire.includesFile=.... Same hash-mod assignment pattern as Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI #17692, but for surefire instead of failsafe.
  • Expected wall clock: ~47 min → ~22 min on Windows; whole pipeline drops accordingly.

Design notes

  • Shard list written to $RUNNER_TEMP/unit-shard.txt (outside the repo):
    • apache-rat can never flag it as an unapproved-license file even if a future workflow reaches verify phase.
    • Survives mvn clean.
    • Doesn't depend on workspace layout.
  • find | awk (no xargs) sidesteps the Windows ARG_MAX trap fixed in a343cf5 (same root cause as the earlier 1C1D shard fix).
  • The single datanode IT (EnvScriptIT) runs only in shard 1; shards 2 and 3 add -DskipITs.
  • Sharded jobs adopt the netsh dynamicport tweak from cluster-it-1c1d.yml — 597 UTs × reuseForks=false will also burn through the Windows ephemeral port pool.
  • 597 / 3 = 199 classes per shard, balanced.

Notes for reviewers

  • Branch protection rules: the matrix shape changes — the old required check Unit-Test / unit-test (17, windows-latest, datanode) goes away, replaced by Unit-Test / unit-test-windows-datanode (1), (2), (3). If those are required checks in repo settings, an admin needs to update the rule (same situation as Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI #17692).
  • Pom files are untouched; this is a CI-only change.

Test plan

  • Confirm all 3 Windows shards complete and total wall-clock drops to ~22 min.
  • Confirm Ubuntu jobs (unit-test (17, ubuntu-latest, datanode), (17, ubuntu-latest, others)) and unit-test (17, windows-latest, others) still run as before.
  • Spot-check that no UT class is silently skipped (sum of per-shard test counts should equal pre-change total).

The Windows runner of the Unit-Test workflow has been ~47 min on the
datanode cell vs ~38 min on Ubuntu — datanode has 597 UT classes and
surefire's reuseForks=false spawns a new JVM per class, which is slow on
Windows process startup + NTFS IO.

Excludes the (windows-latest, datanode) cell from the existing matrix and
adds a new `unit-test-windows-datanode` job with 3 parallel shards, mirroring
the failsafe.includesFile pattern from apache#17692 but using surefire.includesFile.

The shard list is written to $RUNNER_TEMP (outside the repo) so apache-rat
can never see it, it survives mvn clean, and it doesn't depend on workspace
layout. `find | awk` (no xargs) sidesteps the Windows ARG_MAX trap fixed in
a343cf5. The single datanode IT (EnvScriptIT) runs only in shard 1;
shards 2 and 3 add -DskipITs.

Expected Unit-Test wall clock: ~47 min -> ~22 min.
@JackieTien97
Copy link
Copy Markdown
Contributor Author

Closing — this change is a regression, not a speedup. Closer analysis of the CI run revealed the root cause:

What I found

The master Windows datanode job's bulk time (~49 min) comes entirely from a single surefire execution (default-test) running 3629 test methods with reuseForks=false. Sharding the test list via -Dsurefire.includesFile= did not speed this up. In fact it slowed things down to ~60 min, because the CLI flag accidentally activated two dead executions in the root pom:

<!-- pom.xml:1306-1345 -->
<execution>
  <id>unit-tests</id>
  <phase>test</phase>
  <configuration>
    <includes><include>src/test/**/*Test.java</include></includes>
    ...
  </configuration>
</execution>
<execution>
  <id>integration-tests</id>
  ...
</execution>

On master these executions run in <1 sec each and match zero tests, because surefire scans target/test-classes/ and the src/test/ path prefix never matches there. But once a CLI flag like -Dsurefire.includesFile= is supplied, it replaces the broken <includes> with a valid pattern, and all three surefire executions (default-test + unit-tests + integration-tests) end up running the same test list — 3× duplication.

Concrete evidence

  • master Windows datanode log (job 76358113571): default-test ran 3629 tests over 49 min; unit-tests and integration-tests each ran for ~1 sec with 0 tests.
  • this PR's shard 3 log (job 76362582638): all three executions show "Tests run: 1192" — same 199-class shard, run 3 times.

Next step

A separate PR will remove those two dead executions from the root pom (TODO comment in the code already acknowledges they should be cleaned up). That doesn't speed up master by itself, but it's a prerequisite for any future surefire-based sharding (or -Dtest= selection) to actually work.

Speeding up Windows Unit-Test itself needs a different approach — likely reuseForks=true + forkCount=NC, or sharing build artifacts across test-only jobs — which I'll explore separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant