Speed up parallel 'make check' scheduling#14745
Conversation
Summary: Reduce wall-clock time of parallel 'make check' by improving the scheduling and granularity of slow test binaries. Three Makefile changes: 1) Refresh slow_test_regexp with current observed bottlenecks. Adds binaries (point_lock_manager_stress_test, compaction_service_test, corruption_test, comparator_db_test, external_sst_file_basic_test, rate_limiter_test, db_compaction_test, db_merge_operator_test, db_dynamic_level_test, db_bloom_filter_test, error_handler_fs_test, merge_helper_test, db_kv_checksum_test, inlineskiplist_test) whose shards take >=15s but were not being front-loaded for early queueing. Also drops stale FIXME comments that no longer apply and adds tier annotations + a maintenance recipe. 2) Add SHARD_SIZE_OVERRIDES, a per-binary override of GTEST_SHARD_SIZE, so binaries with slow individual tests (e.g. point_lock_manager_stress_test where each test is ~10s) can be chopped into more, smaller shards. The default of 10 stays for everything else. Each shard's effective size is reported in the 'Generating ... shards for ...' line. 3) Add 'make suggest-slow-tests' to print a per-binary aggregation of the most recent LOG (max single-shard time, total time, shard count) for any binary worth attention. Used to maintain the regex and override list above. Test Plan: Two runs each of 'make -j166 check', before and after this change (all compilation already finished): Before: 197s and 198s After: 123s and 125s Reduction: 37%
✅ clang-tidy: No findings on changed linesCompleted in 0.0s. |
|
@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D105332444. |
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 18ebb6f ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 18ebb6f SummaryClean, well-designed Makefile change that improves High-severity findings (0): Full review (click to expand)Findings🟡 MEDIUMM1. No validation of SHARD_SIZE before arithmetic —
|
| Concern | Status | Notes |
|---|---|---|
| CI sharding (CI_TOTAL_SHARDS) | Safe | Modular arithmetic guarantees complete coverage regardless of shard count. More shards = better load balancing across CI workers, not lost coverage. |
| valgrind_check_0 | Beneficial | Uses prioritize_long_running_tests; expanded slow_test_regexp improves valgrind scheduling too. |
| Platform compatibility | Safe | for/case/${o#*:} are POSIX-compliant; Makefile sets SHELL := bash. |
| Prefix matching (db_test vs db_test2) | Safe | Case pattern "$TEST_BINARY":* matches the whole override token, not a substring. Verified: db_test:3 does NOT match when TEST_BINARY=db_test2. |
| PARALLEL_TEST membership | Safe | Overrides for non-parallel binaries are silently ignored (no harm). |
Positive Observations
- The
SHARD_SIZE_OVERRIDESmechanism is clean and extensible —?=allows user/CI override. - The
suggest-slow-teststarget is a valuable maintenance tool that closes the feedback loop. - Tier annotations in
slow_test_regexpcomments aid future maintenance. - The echo line now includes
shard_size=for observability. - The 37% wall-clock improvement with a small, low-risk change is excellent ROI.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
|
@pdillinger merged this pull request in c48b020. |
Summary:
Reduce wall-clock time of parallel 'make check' by improving the scheduling and granularity of slow test binaries.
Three Makefile changes:
Refresh slow_test_regexp with current observed bottlenecks. Adds
binaries (point_lock_manager_stress_test, compaction_service_test,
corruption_test, comparator_db_test, external_sst_file_basic_test,
rate_limiter_test, db_compaction_test, db_merge_operator_test,
db_dynamic_level_test, db_bloom_filter_test, error_handler_fs_test,
merge_helper_test, db_kv_checksum_test, inlineskiplist_test) whose
shards take >=15s but were not being front-loaded for early
queueing. Also drops stale FIXME comments that no longer apply
and adds tier annotations + a maintenance recipe.
Add SHARD_SIZE_OVERRIDES, a per-binary override of GTEST_SHARD_SIZE,
so binaries with slow individual tests (e.g.
point_lock_manager_stress_test where each test is ~10s) can be
chopped into more, smaller shards. The default of 10 stays for
everything else. Each shard's effective size is reported in the
'Generating ... shards for ...' line.
Add 'make suggest-slow-tests' to print a per-binary aggregation of
the most recent LOG (max single-shard time, total time, shard
count) for any binary worth attention. Used to maintain the regex
and override list above.
Test Plan:
Two runs each of 'make -j166 check', before and after this change (all compilation already finished):
Before: 197s and 198s
After: 123s and 125s
Reduction: 37%