Skip to content

Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI#17692

Merged
JackieTien97 merged 3 commits into
masterfrom
speedup-windows-ci
May 17, 2026
Merged

Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI#17692
JackieTien97 merged 3 commits into
masterfrom
speedup-windows-ci

Conversation

@JackieTien97
Copy link
Copy Markdown
Contributor

Summary

Windows runners for Cluster IT - 1C1D and Table Cluster IT - 1C1D are 67-77% slower than their Ubuntu counterparts and are the bottleneck of the entire PR check pipeline:

Pipeline Ubuntu Windows Slowdown
Cluster IT - 1C1D 49 min 87 min +77%
Table Cluster IT - 1C1D 39 min 65 min +67%

Since each forkCount=2 already saturates the 4 vCPU / 16 GB GitHub VM, the only way to speed up Windows is horizontal sharding across multiple VMs.

Approach

For each pipeline, split the Windows job into 3 parallel matrix shards, keeping Ubuntu as a single job (already fast enough):

  • Each shard scans test sources for the right @Category annotation (LocalStandaloneIT for 1C1D, TableLocalStandaloneIT for Table 1C1D)
  • Test classes are distributed across shards via hash-mod (`awk 'NR%3==SHARD'`) for balanced workload
  • The shard's class list is written to `integration-test/it-shard.txt` and passed via `-Dfailsafe.includesFile=...`

Using `includesFile` (instead of `-Dit.test=Class1,Class2,...`) avoids the Windows command-line length limit (~8 KB) — this scales safely even if the test suite grows to thousands of classes.

Shard distribution

After splitting:

Pipeline Total classes Per shard
Cluster IT - 1C1D 276 92 / 92 / 92
Table Cluster IT - 1C1D 231 77 / 77 / 77

Expected effect

Pipeline Before After
Cluster IT - 1C1D ~87 min ~49 min (capped by Ubuntu)
Table Cluster IT - 1C1D ~65 min ~39 min (capped by Ubuntu)

The wall-clock improvement comes from the slowest Windows shard finishing in ~29 min / ~22 min instead of the full ~87 min / ~65 min. Pipeline duration is now bounded by the (faster) Ubuntu job.

Test plan

  • CI passes on this PR (verify both pipelines run all expected test classes across shards)
  • Compare Windows job wall clock to baseline — expect ~3x reduction
  • Verify shard upload artifacts are uniquely named on failure

The Windows runners for Cluster IT - 1C1D and Table Cluster IT - 1C1D
are 67-77% slower than their Ubuntu counterparts, making them the
bottleneck of the entire PR check pipeline (87 min and 65 min wall
clock respectively).

Split each pipeline's Windows job into 3 parallel matrix shards:
- LocalStandaloneIT test classes (276) split for Cluster IT - 1C1D
- TableLocalStandaloneIT test classes (231) split for Table Cluster IT - 1C1D

Each shard uses failsafe.includesFile reading from a generated file,
avoiding command-line length limits regardless of how the test suite grows.

Ubuntu jobs stay as a single job since they were already fast enough.

Expected wall clock reduction:
- Cluster IT - 1C1D: 87 min -> ~49 min (capped by Ubuntu)
- Table Cluster IT - 1C1D: 65 min -> ~39 min (capped by Ubuntu)
…ndows

On Windows Git Bash ARG_MAX is much smaller than on Linux, so `xargs -0`
splits the file list into many batches. Batches with no matching files
make grep return 1, which makes xargs return 123, and `set -o pipefail`
turns that into a hard failure for the whole shard step.

Replace the pipeline with a single `grep -rl --include='*IT.java'` call.
That uses one grep invocation, so its exit code reflects whether any
match was found across the entire tree (which is always 0 here).

Local counts on macOS confirm the logic is preserved:
- LocalStandaloneIT: 276 classes
- TableLocalStandaloneIT: 231 classes
The previous attempt wrote the generated shard list to
integration-test/it-shard.txt. That path is inside the repo and not
covered by the root pom.xml's RAT excludes (which only excludes
**/target/**), so the license check started warning:

  Files with unapproved licenses:
    D:/a/iotdb/iotdb/integration-test/it-shard.txt

We can't use a target/ subdirectory because `mvn clean verify` wipes it
before our shard file would be read. Instead, write the file to
$RUNNER_TEMP/it-shard.txt — the runner-scoped tmp dir is outside the
repository entirely, so RAT never sees it. Update both
-Dfailsafe.includesFile invocations to match.
@sonarqubecloud
Copy link
Copy Markdown

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.39%. Comparing base (2f57fd6) to head (02ef20a).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17692      +/-   ##
============================================
- Coverage     40.39%   40.39%   -0.01%     
  Complexity     2574     2574              
============================================
  Files          5179     5179              
  Lines        349628   349628              
  Branches      44683    44683              
============================================
- Hits         141243   141239       -4     
- Misses       208385   208389       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JackieTien97 JackieTien97 merged commit 81cab40 into master May 17, 2026
34 checks passed
@JackieTien97 JackieTien97 deleted the speedup-windows-ci branch May 17, 2026 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant