Skip to content

Speed up parallel 'make check' scheduling#14745

Closed
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:speed_up_make_check
Closed

Speed up parallel 'make check' scheduling#14745
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:speed_up_make_check

Conversation

@pdillinger
Copy link
Copy Markdown
Contributor

Summary:
Reduce wall-clock time of parallel 'make check' by improving the scheduling and granularity of slow test binaries.

Three Makefile changes:

  1. Refresh slow_test_regexp with current observed bottlenecks. Adds
    binaries (point_lock_manager_stress_test, compaction_service_test,
    corruption_test, comparator_db_test, external_sst_file_basic_test,
    rate_limiter_test, db_compaction_test, db_merge_operator_test,
    db_dynamic_level_test, db_bloom_filter_test, error_handler_fs_test,
    merge_helper_test, db_kv_checksum_test, inlineskiplist_test) whose
    shards take >=15s but were not being front-loaded for early
    queueing. Also drops stale FIXME comments that no longer apply
    and adds tier annotations + a maintenance recipe.

  2. Add SHARD_SIZE_OVERRIDES, a per-binary override of GTEST_SHARD_SIZE,
    so binaries with slow individual tests (e.g.
    point_lock_manager_stress_test where each test is ~10s) can be
    chopped into more, smaller shards. The default of 10 stays for
    everything else. Each shard's effective size is reported in the
    'Generating ... shards for ...' line.

  3. Add 'make suggest-slow-tests' to print a per-binary aggregation of
    the most recent LOG (max single-shard time, total time, shard
    count) for any binary worth attention. Used to maintain the regex
    and override list above.

Test Plan:
Two runs each of 'make -j166 check', before and after this change (all compilation already finished):
Before: 197s and 198s
After: 123s and 125s
Reduction: 37%

Summary:
Reduce wall-clock time of parallel 'make check' by improving the
scheduling and granularity of slow test binaries.

Three Makefile changes:

1) Refresh slow_test_regexp with current observed bottlenecks. Adds
   binaries (point_lock_manager_stress_test, compaction_service_test,
   corruption_test, comparator_db_test, external_sst_file_basic_test,
   rate_limiter_test, db_compaction_test, db_merge_operator_test,
   db_dynamic_level_test, db_bloom_filter_test, error_handler_fs_test,
   merge_helper_test, db_kv_checksum_test, inlineskiplist_test) whose
   shards take >=15s but were not being front-loaded for early
   queueing. Also drops stale FIXME comments that no longer apply
   and adds tier annotations + a maintenance recipe.

2) Add SHARD_SIZE_OVERRIDES, a per-binary override of GTEST_SHARD_SIZE,
   so binaries with slow individual tests (e.g.
   point_lock_manager_stress_test where each test is ~10s) can be
   chopped into more, smaller shards. The default of 10 stays for
   everything else. Each shard's effective size is reported in the
   'Generating ... shards for ...' line.

3) Add 'make suggest-slow-tests' to print a per-binary aggregation of
   the most recent LOG (max single-shard time, total time, shard
   count) for any binary worth attention. Used to maintain the regex
   and override list above.

Test Plan:
Two runs each of 'make -j166 check', before and after this change (all
compilation already finished):
Before: 197s and 198s
After: 123s and 125s
Reduction: 37%
@pdillinger pdillinger requested a review from xingbowang May 15, 2026 16:01
@meta-cla meta-cla Bot added the CLA Signed label May 15, 2026
@github-actions
Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 0.0s.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 15, 2026

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D105332444.

@github-actions
Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 18ebb6f


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions
Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 18ebb6f


Summary

Clean, well-designed Makefile change that improves make check wall-clock time by ~37% through better shard sizing and test prioritization. The approach is sound, builds incrementally on existing infrastructure, and includes a useful maintenance tool (suggest-slow-tests).

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. No validation of SHARD_SIZE before arithmetic — Makefile:~910
  • Issue: If a user overrides SHARD_SIZE_OVERRIDES with an entry like binary:0, binary:abc, or binary: (empty after colon), the shell arithmetic NUM_SHARDS=$$(( (TEST_COUNT + SHARD_SIZE - 1) / SHARD_SIZE )) will fail with division by zero (bash treats non-numeric strings as 0 in arithmetic context).
  • Root cause: No validation that SHARD_SIZE > 0 after the override lookup loop.
  • Suggested fix: Add a guard after the loop:
    if [ "$$SHARD_SIZE" -le 0 ] 2>/dev/null || [ -z "$$SHARD_SIZE" ]; then SHARD_SIZE=$(GTEST_SHARD_SIZE); fi; \
    Note: this is a defensive measure — the default values in the Makefile are all valid, so this only triggers on user misconfiguration.

🟢 LOW / NIT

L1. suggest-slow-tests inconsistent naming for non-sharded tests — Makefile:~985
  • Issue: Non-parallel tests appear in LOG as ./binary_name rather than t/run-binary-shard-N. The awk sub(/^t\/run-/, "", name) doesn't strip the ./ prefix, so non-sharded tests are listed separately from their sharded counterparts.
  • Suggested fix: Add sub(/^\.\//, "", name); after the existing sub calls.
L2. slow_test_regexp and SHARD_SIZE_OVERRIDES could document cross-reference — Makefile:~949
  • Issue: All binaries in SHARD_SIZE_OVERRIDES are also in slow_test_regexp (verified), but the relationship isn't documented. Future maintainers might update one without the other.
  • Suggested fix: Add a one-line comment noting that override binaries should also appear in slow_test_regexp for front-loading.
L3. Consider break after case match in override loop — Makefile:~905
  • Issue: The for/case loop continues iterating after a match, so if duplicate entries exist for the same binary, the last one wins silently. Adding break after the match would make first-wins semantics explicit and marginally faster.
  • Suggested fix: Change SHARD_SIZE=$${o#*:} ;; to SHARD_SIZE=$${o#*:}; break ;; (optional — current behavior is correct for the default values).

Cross-Component Analysis

Concern Status Notes
CI sharding (CI_TOTAL_SHARDS) Safe Modular arithmetic guarantees complete coverage regardless of shard count. More shards = better load balancing across CI workers, not lost coverage.
valgrind_check_0 Beneficial Uses prioritize_long_running_tests; expanded slow_test_regexp improves valgrind scheduling too.
Platform compatibility Safe for/case/${o#*:} are POSIX-compliant; Makefile sets SHELL := bash.
Prefix matching (db_test vs db_test2) Safe Case pattern "$TEST_BINARY":* matches the whole override token, not a substring. Verified: db_test:3 does NOT match when TEST_BINARY=db_test2.
PARALLEL_TEST membership Safe Overrides for non-parallel binaries are silently ignored (no harm).

Positive Observations

  • The SHARD_SIZE_OVERRIDES mechanism is clean and extensible — ?= allows user/CI override.
  • The suggest-slow-tests target is a valuable maintenance tool that closes the feedback loop.
  • Tier annotations in slow_test_regexp comments aid future maintenance.
  • The echo line now includes shard_size= for observability.
  • The 37% wall-clock improvement with a small, low-risk change is excellent ROI.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 18, 2026

@pdillinger merged this pull request in c48b020.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant