Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT#17694
Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT#17694JackieTien97 wants to merge 1 commit into
Conversation
The Multi-Cluster IT pipeline (pipe-it.yml) runs 11 parallel jobs on every PR. Of those, subscription-tree-regression-consumer is the longest pole: 72 IT classes annotated with @category(MultiClusterIT2SubscriptionTreeRegressionConsumer.class), each restarting two ScalableSingleNodeMode clusters in setUp(), executed serially in a single forkCount=1 JVM. Estimated wall clock ~30-45 min, while every other job in the workflow finishes in ~10-20 min. Split this job into 3 parallel matrix shards using the same hash-mod pattern that cluster-it-1c1d.yml introduced (commits 89748f1, a343cf5, 02ef20a). Each shard runs ~24 of the 72 classes and is expected to finish in ~12-18 min, removing this job as the workflow's bottleneck. The shard list is written to \$RUNNER_TEMP/it-shard.txt for the same RAT-avoidance reason as 1C1D. Two deviations from the 1C1D pattern: 1. The shard list emits paths relative to src/test/java/ (e.g., org/apache/iotdb/.../IoTDBFooIT.java) instead of bare class names. This suite has 6 pairs of duplicate simple names across pushconsumer/multi/ and pullconsumer/multi/ (e.g., IoTDBOneConsumerMultiTopicsTsfileIT exists in both). Bare names would cause failsafe to match both files for each entry, running those 6 classes twice across shards. 2. The other subscription / dual-cluster jobs in this workflow are not sharded. subscription-tree-regression-misc (13 classes) is borderline; arch-verification jobs (1-4 classes each) and dual-tree/dual-table jobs (9-13 classes) are well under the new shard wall clock and would not benefit. Revisit if any of them becomes the new long pole. Local counts on macOS: - Total classes matching the annotation: 72 - Per-shard distribution after hash-mod: 24/24/24 - Unique paths after sed normalization: 72 (no collisions)
|
|
Closing — this PR optimized the wrong target. After looking at actual durations from 3 recent successful Multi-Cluster IT runs,
Sharding subscription added ~10 runner-min per CI run for zero wall-clock benefit. A follow-up PR will shard the dual-* jobs above instead — that should cut Multi-Cluster IT wall clock from ~63 min to ~22 min. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17694 +/- ##
============================================
- Coverage 40.39% 40.39% -0.01%
Complexity 2574 2574
============================================
Files 5179 5179
Lines 349628 349628
Branches 44683 44683
============================================
- Hits 141243 141215 -28
- Misses 208385 208413 +28 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|



Summary
subscription-tree-regression-consumerjob inpipe-it.ymlinto 3 parallel matrix shards. Expected: ~30–45 min → ~12–18 min per shard, removing it as the Multi-Cluster IT pipeline's long pole.\$RUNNER_TEMP/it-shard.txt+-Dfailsafe.includesFilepattern from PR Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI #17692 (the 1C1D Windows sharding work).src/test/java/(e.g.org/apache/iotdb/.../IoTDBFooIT.java) rather than bare class names — needed because this suite has 6 pairs of duplicate simple names acrosspushconsumer/multi/andpullconsumer/multi/that would otherwise run twice across shards.Why this job specifically
The Multi-Cluster IT pipeline currently runs 11 parallel jobs. All others finish in 10–20 min;
subscription-tree-regression-consumerruns 72 IT classes serially in a singleforkCount=1JVM, each restarting a 2-cluster environment. That single job dictates the whole pipeline's wall clock on every PR.Other subscription/dual-cluster jobs in
pipe-it.ymlare not sharded:subscription-tree-regression-misc(13 classes) — borderline; defer to follow-upLocal verification
Test plan
subscription-tree-regression-consumer (…, 0/1/2)jobs appear underMulti-Cluster ITand all go green.Tracker
This is item #2 from
Remaining bottlenecksin the CI optimization status doc. The other two (AINode cold buildand broader Subscription/daily-it sharding) will be addressed separately.