Skip to content

Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT#17694

Closed
JackieTien97 wants to merge 1 commit into
masterfrom
shard-subscription-consumer-it
Closed

Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT#17694
JackieTien97 wants to merge 1 commit into
masterfrom
shard-subscription-consumer-it

Conversation

@JackieTien97
Copy link
Copy Markdown
Contributor

Summary

  • Splits the subscription-tree-regression-consumer job in pipe-it.yml into 3 parallel matrix shards. Expected: ~30–45 min → ~12–18 min per shard, removing it as the Multi-Cluster IT pipeline's long pole.
  • Reuses the \$RUNNER_TEMP/it-shard.txt + -Dfailsafe.includesFile pattern from PR Shard Windows IT jobs to speed up 1C1D and Table 1C1D CI #17692 (the 1C1D Windows sharding work).
  • Emits paths relative to src/test/java/ (e.g. org/apache/iotdb/.../IoTDBFooIT.java) rather than bare class names — needed because this suite has 6 pairs of duplicate simple names across pushconsumer/multi/ and pullconsumer/multi/ that would otherwise run twice across shards.

Why this job specifically

The Multi-Cluster IT pipeline currently runs 11 parallel jobs. All others finish in 10–20 min; subscription-tree-regression-consumer runs 72 IT classes serially in a single forkCount=1 JVM, each restarting a 2-cluster environment. That single job dictates the whole pipeline's wall clock on every PR.

Other subscription/dual-cluster jobs in pipe-it.yml are not sharded:

  • subscription-tree-regression-misc (13 classes) — borderline; defer to follow-up
  • arch-verification jobs (1–4 classes each) — sharding overhead would exceed savings
  • dual-tree/dual-table jobs (9–13 classes) — already under the new shard wall clock

Local verification

$ grep -rlE --include='*IT.java' '\bMultiClusterIT2SubscriptionTreeRegressionConsumer\b' integration-test/src/test/java | wc -l
72
$ for s in 0 1 2; do
    grep -rlE --include='*IT.java' '\bMultiClusterIT2SubscriptionTreeRegressionConsumer\b' integration-test/src/test/java \
      | sed 's|.*/src/test/java/||' | sort | awk -v s=\$s -v t=3 'NR%t==s' | wc -l
  done
24
24
24
$ # And unique-paths == total (i.e. no collisions after disambiguation):
$ grep -rlE --include='*IT.java' '\bMultiClusterIT2SubscriptionTreeRegressionConsumer\b' integration-test/src/test/java | sed 's|.*/src/test/java/||' | sort -u | wc -l
72

Test plan

  • CI: 3 parallel subscription-tree-regression-consumer (…, 0/1/2) jobs appear under Multi-Cluster IT and all go green.
  • Each shard finishes in ~12–18 min (down from ~30–45 min for the single un-sharded job).
  • No `Files with unapproved licenses` warning referencing `it-shard.txt` in any shard's log.
  • Union of executed test classes across the 3 shards == 72 (no class missing, no class run twice — verify via surefire-reports artifacts).

Tracker

This is item #2 from Remaining bottlenecks in the CI optimization status doc. The other two (AINode cold build and broader Subscription/daily-it sharding) will be addressed separately.

The Multi-Cluster IT pipeline (pipe-it.yml) runs 11 parallel jobs on every
PR. Of those, subscription-tree-regression-consumer is the longest pole:
72 IT classes annotated with
@category(MultiClusterIT2SubscriptionTreeRegressionConsumer.class), each
restarting two ScalableSingleNodeMode clusters in setUp(), executed
serially in a single forkCount=1 JVM. Estimated wall clock ~30-45 min,
while every other job in the workflow finishes in ~10-20 min.

Split this job into 3 parallel matrix shards using the same hash-mod
pattern that cluster-it-1c1d.yml introduced (commits 89748f1,
a343cf5, 02ef20a). Each shard runs ~24 of the 72 classes and is
expected to finish in ~12-18 min, removing this job as the workflow's
bottleneck. The shard list is written to \$RUNNER_TEMP/it-shard.txt for
the same RAT-avoidance reason as 1C1D.

Two deviations from the 1C1D pattern:

1. The shard list emits paths relative to src/test/java/ (e.g.,
   org/apache/iotdb/.../IoTDBFooIT.java) instead of bare class names.
   This suite has 6 pairs of duplicate simple names across
   pushconsumer/multi/ and pullconsumer/multi/ (e.g.,
   IoTDBOneConsumerMultiTopicsTsfileIT exists in both). Bare names would
   cause failsafe to match both files for each entry, running those 6
   classes twice across shards.

2. The other subscription / dual-cluster jobs in this workflow are not
   sharded. subscription-tree-regression-misc (13 classes) is borderline;
   arch-verification jobs (1-4 classes each) and dual-tree/dual-table
   jobs (9-13 classes) are well under the new shard wall clock and would
   not benefit. Revisit if any of them becomes the new long pole.

Local counts on macOS:
- Total classes matching the annotation: 72
- Per-shard distribution after hash-mod: 24/24/24
- Unique paths after sed normalization: 72 (no collisions)
@sonarqubecloud
Copy link
Copy Markdown

@JackieTien97
Copy link
Copy Markdown
Contributor Author

Closing — this PR optimized the wrong target.

After looking at actual durations from 3 recent successful Multi-Cluster IT runs, subscription-tree-regression-consumer was already finishing in ~5 minutes unsharded; my pre-PR estimate of 30-45 min was wrong. The real long poles are the dual-* jobs:

Job Duration Classes
dual-table-manual-basic ~63 min 13
dual-table-manual-enhanced ~62 min 11
dual-tree-auto-enhanced ~51 min 9
dual-tree-auto-basic ~42 min 12
dual-tree-manual ~27 min 11
subscription-tree-regression-consumer ~5 min 72

Sharding subscription added ~10 runner-min per CI run for zero wall-clock benefit. A follow-up PR will shard the dual-* jobs above instead — that should cut Multi-Cluster IT wall clock from ~63 min to ~22 min.

@JackieTien97 JackieTien97 deleted the shard-subscription-consumer-it branch May 17, 2026 02:11
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.39%. Comparing base (2f57fd6) to head (b996d09).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17694      +/-   ##
============================================
- Coverage     40.39%   40.39%   -0.01%     
  Complexity     2574     2574              
============================================
  Files          5179     5179              
  Lines        349628   349628              
  Branches      44683    44683              
============================================
- Hits         141243   141215      -28     
- Misses       208385   208413      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant