Shard Windows datanode unit tests to speed up Unit-Test CI#17693
Shard Windows datanode unit tests to speed up Unit-Test CI#17693JackieTien97 wants to merge 1 commit into
Conversation
The Windows runner of the Unit-Test workflow has been ~47 min on the datanode cell vs ~38 min on Ubuntu — datanode has 597 UT classes and surefire's reuseForks=false spawns a new JVM per class, which is slow on Windows process startup + NTFS IO. Excludes the (windows-latest, datanode) cell from the existing matrix and adds a new `unit-test-windows-datanode` job with 3 parallel shards, mirroring the failsafe.includesFile pattern from apache#17692 but using surefire.includesFile. The shard list is written to $RUNNER_TEMP (outside the repo) so apache-rat can never see it, it survives mvn clean, and it doesn't depend on workspace layout. `find | awk` (no xargs) sidesteps the Windows ARG_MAX trap fixed in a343cf5. The single datanode IT (EnvScriptIT) runs only in shard 1; shards 2 and 3 add -DskipITs. Expected Unit-Test wall clock: ~47 min -> ~22 min.
|
Closing — this change is a regression, not a speedup. Closer analysis of the CI run revealed the root cause: What I found The master Windows datanode job's bulk time (~49 min) comes entirely from a single surefire execution ( <!-- pom.xml:1306-1345 -->
<execution>
<id>unit-tests</id>
<phase>test</phase>
<configuration>
<includes><include>src/test/**/*Test.java</include></includes>
...
</configuration>
</execution>
<execution>
<id>integration-tests</id>
...
</execution>On master these executions run in <1 sec each and match zero tests, because surefire scans Concrete evidence
Next step A separate PR will remove those two dead executions from the root pom (TODO comment in the code already acknowledges they should be cleaned up). That doesn't speed up master by itself, but it's a prerequisite for any future surefire-based sharding (or Speeding up Windows Unit-Test itself needs a different approach — likely |
Summary
Unit-Testworkflow'sdatanodecell takes ~47 min vs ~38 min on Ubuntu. With 597 UT classes andreuseForks=false(one JVM per class), Windows process startup + NTFS IO are the bottleneck.(windows-latest, datanode)cell from the existing matrix and adds a newunit-test-windows-datanodejob with 3 parallel shards via-Dsurefire.includesFile=.... Same hash-mod assignment pattern as Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI #17692, but for surefire instead of failsafe.Design notes
$RUNNER_TEMP/unit-shard.txt(outside the repo):verifyphase.mvn clean.find | awk(no xargs) sidesteps the Windows ARG_MAX trap fixed in a343cf5 (same root cause as the earlier 1C1D shard fix).EnvScriptIT) runs only in shard 1; shards 2 and 3 add-DskipITs.netsh dynamicporttweak fromcluster-it-1c1d.yml— 597 UTs ×reuseForks=falsewill also burn through the Windows ephemeral port pool.Notes for reviewers
Unit-Test / unit-test (17, windows-latest, datanode)goes away, replaced byUnit-Test / unit-test-windows-datanode (1),(2),(3). If those are required checks in repo settings, an admin needs to update the rule (same situation as Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI #17692).Test plan
unit-test (17, ubuntu-latest, datanode),(17, ubuntu-latest, others)) andunit-test (17, windows-latest, others)still run as before.