Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI#17692
Merged
Conversation
The Windows runners for Cluster IT - 1C1D and Table Cluster IT - 1C1D are 67-77% slower than their Ubuntu counterparts, making them the bottleneck of the entire PR check pipeline (87 min and 65 min wall clock respectively). Split each pipeline's Windows job into 3 parallel matrix shards: - LocalStandaloneIT test classes (276) split for Cluster IT - 1C1D - TableLocalStandaloneIT test classes (231) split for Table Cluster IT - 1C1D Each shard uses failsafe.includesFile reading from a generated file, avoiding command-line length limits regardless of how the test suite grows. Ubuntu jobs stay as a single job since they were already fast enough. Expected wall clock reduction: - Cluster IT - 1C1D: 87 min -> ~49 min (capped by Ubuntu) - Table Cluster IT - 1C1D: 65 min -> ~39 min (capped by Ubuntu)
…ndows On Windows Git Bash ARG_MAX is much smaller than on Linux, so `xargs -0` splits the file list into many batches. Batches with no matching files make grep return 1, which makes xargs return 123, and `set -o pipefail` turns that into a hard failure for the whole shard step. Replace the pipeline with a single `grep -rl --include='*IT.java'` call. That uses one grep invocation, so its exit code reflects whether any match was found across the entire tree (which is always 0 here). Local counts on macOS confirm the logic is preserved: - LocalStandaloneIT: 276 classes - TableLocalStandaloneIT: 231 classes
The previous attempt wrote the generated shard list to
integration-test/it-shard.txt. That path is inside the repo and not
covered by the root pom.xml's RAT excludes (which only excludes
**/target/**), so the license check started warning:
Files with unapproved licenses:
D:/a/iotdb/iotdb/integration-test/it-shard.txt
We can't use a target/ subdirectory because `mvn clean verify` wipes it
before our shard file would be read. Instead, write the file to
$RUNNER_TEMP/it-shard.txt — the runner-scoped tmp dir is outside the
repository entirely, so RAT never sees it. Update both
-Dfailsafe.includesFile invocations to match.
|
This was referenced May 17, 2026
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17692 +/- ##
============================================
- Coverage 40.39% 40.39% -0.01%
Complexity 2574 2574
============================================
Files 5179 5179
Lines 349628 349628
Branches 44683 44683
============================================
- Hits 141243 141239 -4
- Misses 208385 208389 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Windows runners for Cluster IT - 1C1D and Table Cluster IT - 1C1D are 67-77% slower than their Ubuntu counterparts and are the bottleneck of the entire PR check pipeline:
Since each
forkCount=2already saturates the 4 vCPU / 16 GB GitHub VM, the only way to speed up Windows is horizontal sharding across multiple VMs.Approach
For each pipeline, split the Windows job into 3 parallel matrix shards, keeping Ubuntu as a single job (already fast enough):
@Categoryannotation (LocalStandaloneITfor 1C1D,TableLocalStandaloneITfor Table 1C1D)Using `includesFile` (instead of `-Dit.test=Class1,Class2,...`) avoids the Windows command-line length limit (~8 KB) — this scales safely even if the test suite grows to thousands of classes.
Shard distribution
After splitting:
Expected effect
The wall-clock improvement comes from the slowest Windows shard finishing in ~29 min / ~22 min instead of the full ~87 min / ~65 min. Pipeline duration is now bounded by the (faster) Ubuntu job.
Test plan