Skip to content

feat!: implement group-level pause and resume#81

Merged
deepjoy merged 8 commits into
mainfrom
pause-all-resume-all
Mar 24, 2026
Merged

feat!: implement group-level pause and resume#81
deepjoy merged 8 commits into
mainfrom
pause-all-resume-all

Conversation

@deepjoy
Copy link
Copy Markdown
Owner

@deepjoy deepjoy commented Mar 24, 2026

Summary

  • Add PauseReasons bitmask (PREEMPTION | MODULE | GLOBAL | GROUP) so multiple
    pause sources can coexist without stranding tasks — a task only resumes when
    all reasons are cleared. Retrofit every existing pause/resume path (preemption,
    module, global) to use bitmask operations instead of bare status flips.
  • Add paused_groups SQLite table and pause_reasons column on tasks, with
    in-memory HashSet mirror for fast gate checks during dispatch.
  • Implement Scheduler::pause_group / resume_group / pause_group_until with
    full lifecycle: persist state → update in-memory set → pause pending tasks →
    cancel running tasks → emit GroupPaused/GroupResumed events. Time-boxed
    pauses auto-resume via a throttled (5 s) run-loop check.
  • Gate admission rejects tasks from paused groups; new submissions are accepted
    but inserted directly as paused with the GROUP bit. Recurring next-instances and
    blocked→pending transitions also check paused_groups and downgrade accordingly.
  • Expose group pause API on DomainHandle and ModuleHandle (delegates to scheduler).
  • Add 420-line integration test suite covering submit-to-paused-group, recurring
    downgrade, blocked→paused transition, multi-reason interaction, and handle delegation.

Closes #36

Breaking changes

  • SubmitOutcome::Inserted(id)SubmitOutcome::Inserted { id, group_paused }
    callers must update pattern matches.
  • TaskStore::pause(id) now requires a PauseReasons argument: pause(id, reason).
  • PauseReasons and PausedGroupInfo added to public re-exports.

deepjoy added 8 commits March 23, 2026 22:02
…use_reasons column

Foundation for group-level pause/resume (plan 042, step 1):
- Migration 010: creates `paused_groups` table and adds `pause_reasons`
  bitmask column to `tasks` with backfill for existing paused rows
- PauseReasons newtype with PREEMPTION/MODULE/GLOBAL/GROUP bit constants
  and contains/with/without/is_empty/bits/from_bits operations
- TaskRecord gains `pause_reasons` field; row mapping reads it with
  fallback to 0 for backward compatibility
Plan 042 step 2: plumb PauseReasons through every existing pause and
resume code path so multiple pause sources can coexist without
stranded-task bugs.

Store changes (cancel_expire.rs):
- pause() now accepts a reason parameter and ORs the bit into
  pause_reasons
- pause_pending_by_type_prefix() ORs the MODULE bit; also touches
  already-paused tasks so the bit accumulates
- resume_paused_by_type_prefix() two-step: fully resume sole-reason
  tasks, clear bit from multi-reason tasks
- New: preemption_paused_tasks(), resume_preempted(), clear_pause_bit()

Scheduler changes:
- cancel_pause_emit (dispatch.rs) accepts reason; callers pass
  PREEMPTION, MODULE, or GLOBAL respectively
- Auto-resume (run_loop.rs) scoped to PREEMPTION bit only via
  preemption_paused_tasks + resume_preempted
- resume_all (control.rs) clears GLOBAL bit from DB

Schema:
- pause_reasons column added to 001_tasks.sql base schema
- Migration 010 registered in migrate() with idempotent handling

Tests: 5 new store tests covering bitmask accumulation, preemption
filtering, clear_pause_bit, and module pause/resume with assertions
on pause_reasons values.
… helpers

Group pause state management (pause_group_state, resume_group_state,
paused_groups, is_group_paused, groups_due_for_resume), bulk task
pause/resume by group key with bitmask integration, and per-group
pending/paused count queries. Pure store additions exercised by tests.
Wire SchedulerInner with in-memory paused_groups set, builder startup
loading from DB, GroupPaused/GroupResumed events, PausedGroupInfo in
snapshots, and ActiveTaskMap::pause_group for cancelling running tasks.
- pause_group / resume_group / pause_group_until public API in control.rs
- Gate admission rejects tasks whose group is paused
- Finalizer deferral: re-queue parent finalize when group is paused
- Time-boxed auto-resume checked every ~5s in the run loop
- maybe_restore_fast_dispatch re-evaluates when last group is resumed
…utcome change

Wire group-pause awareness into submit, dependency resolution, recurring
instance creation, and task-completion unblocking so newly-eligible tasks
are downgraded to paused when their group is paused.

Breaking: SubmitOutcome::Inserted is now a struct with { id, group_paused }
instead of a tuple variant.

Expose pause_group / resume_group / is_group_paused / paused_groups on
DomainHandle and ModuleHandle, delegating to the scheduler.
…p 6)

Cover submit-to-paused-group, recurring next-instance downgrade,
blocked→paused transition, multi-reason pause interaction, and
DomainHandle delegation. Also remove duplicated ALTER TABLE from
migration 010 (now lives in the earlier pause_reasons migration).
@deepjoy deepjoy force-pushed the pause-all-resume-all branch from d3f9b67 to 2756398 Compare March 24, 2026 05:04
@deepjoy deepjoy merged commit 614f831 into main Mar 24, 2026
2 checks passed
@github-actions github-actions Bot mentioned this pull request Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Comparison

Click to expand
group                                       main                                    pr
-----                                       ----                                    --
backoff_delay/constant                      1.00     43.6±0.09ns 437.9 MElem/sec    1.02     44.5±0.13ns 428.4 MElem/sec
backoff_delay/exponential                   1.00    185.9±0.53ns 102.6 MElem/sec    1.01    188.1±1.26ns 101.4 MElem/sec
backoff_delay/exponential_jitter            1.00    450.8±8.40ns 42.3 MElem/sec     1.01    453.4±0.95ns 42.1 MElem/sec
backoff_delay/linear                        1.00     76.0±0.19ns 251.1 MElem/sec    1.00     76.2±0.16ns 250.2 MElem/sec
batch_submit/1000                           1.06     35.2±3.08ms 27.7 KElem/sec     1.00     33.1±2.27ms 29.5 KElem/sec
byte_progress/byte_reporting_500            1.05    199.4±4.55ms  2.4 KElem/sec     1.00    189.6±5.35ms  2.6 KElem/sec
byte_progress/noop_500                      1.09    193.4±4.41ms  2.5 KElem/sec     1.00    178.0±3.61ms  2.7 KElem/sec
byte_progress_snapshot/100_tasks            1.00     73.9±2.87ms  1352 Elem/sec     1.08     79.6±2.34ms  1255 Elem/sec
concurrency_scaling/1                       1.07    395.3±5.70ms  1264 Elem/sec     1.00    369.3±3.71ms  1353 Elem/sec
concurrency_scaling/2                       1.07    289.3±6.39ms  1728 Elem/sec     1.00    270.7±4.53ms  1847 Elem/sec
concurrency_scaling/4                       1.07   241.9±14.43ms  2.0 KElem/sec     1.00    226.9±6.67ms  2.2 KElem/sec
concurrency_scaling/8                       1.09    194.3±5.02ms  2.5 KElem/sec     1.00    177.4±3.46ms  2.8 KElem/sec
count_by_tags/100                           1.06    133.5±4.97µs  7.3 KElem/sec     1.00    125.7±3.06µs  7.8 KElem/sec
count_by_tags/1000                          1.03    222.0±5.59µs  4.4 KElem/sec     1.00    216.0±2.49µs  4.5 KElem/sec
count_by_tags/5000                          1.00    602.3±7.05µs  1660 Elem/sec     1.02    612.4±5.18µs  1632 Elem/sec
dep_chain_dispatch/10                       1.04     11.2±0.25ms   895 Elem/sec     1.00     10.7±0.11ms   935 Elem/sec
dep_chain_dispatch/25                       1.04     27.4±0.48ms   914 Elem/sec     1.00     26.2±0.35ms   953 Elem/sec
dep_chain_dispatch/50                       1.05     54.9±1.02ms   910 Elem/sec     1.00     52.3±0.90ms   955 Elem/sec
dep_chain_submit/10                         1.04      3.1±0.15ms  3.2 KElem/sec     1.00      3.0±0.11ms  3.3 KElem/sec
dep_chain_submit/200                        1.05     80.6±5.30ms  2.4 KElem/sec     1.00     76.5±3.86ms  2.6 KElem/sec
dep_chain_submit/50                         1.03     17.2±0.84ms  2.8 KElem/sec     1.00     16.7±0.82ms  2.9 KElem/sec
dep_fan_in_dispatch/10                      1.03      6.1±0.10ms  1815 Elem/sec     1.00      5.9±0.08ms  1874 Elem/sec
dep_fan_in_dispatch/100                     1.08     43.6±0.99ms  2.3 KElem/sec     1.00     40.2±0.74ms  2.5 KElem/sec
dep_fan_in_dispatch/50                      1.08     23.0±0.91ms  2.2 KElem/sec     1.00     21.3±0.49ms  2.3 KElem/sec
dispatch_and_complete/1000                  1.08    390.1±6.97ms  2.5 KElem/sec     1.00    359.8±5.91ms  2.7 KElem/sec
dispatch_group_scaling/1                    1.05    433.7±8.30ms  1152 Elem/sec     1.00    413.6±4.97ms  1208 Elem/sec
dispatch_group_scaling/10                   1.04    431.4±6.76ms  1159 Elem/sec     1.00    416.6±5.72ms  1200 Elem/sec
dispatch_group_scaling/100                  1.04    431.9±6.25ms  1157 Elem/sec     1.00    416.1±5.41ms  1201 Elem/sec
dispatch_group_scaling/50                   1.04    435.5±6.79ms  1148 Elem/sec     1.00    418.4±8.90ms  1195 Elem/sec
dispatch_no_groups/500                      1.09    195.3±4.91ms  2.5 KElem/sec     1.00    179.4±4.28ms  2.7 KElem/sec
dispatch_one_group/500                      1.05    433.6±7.74ms  1153 Elem/sec     1.00    412.4±9.10ms  1212 Elem/sec
dispatch_permanent_failure/500              1.09    374.0±6.53ms  1336 Elem/sec     1.00    343.2±7.20ms  1456 Elem/sec
history_by_type/100                         1.04    226.2±7.10µs  4.3 KElem/sec     1.00    217.0±6.16µs  4.5 KElem/sec
history_by_type/1000                        1.00   794.4±50.08µs  1258 Elem/sec     1.03   821.0±52.27µs  1218 Elem/sec
history_by_type/5000                        1.00   795.3±42.88µs  1257 Elem/sec     1.01   805.5±40.00µs  1241 Elem/sec
history_query/100                           1.02   418.9±24.31µs  2.3 KElem/sec     1.00   411.0±16.21µs  2.4 KElem/sec
history_query/1000                          1.00   431.5±17.35µs  2.3 KElem/sec     1.01   437.2±22.60µs  2.2 KElem/sec
history_query/5000                          1.00   429.1±18.88µs  2.3 KElem/sec     1.03   443.1±23.92µs  2.2 KElem/sec
history_stats/100                           1.05    130.8±2.21µs  7.5 KElem/sec     1.00    124.1±1.24µs  7.9 KElem/sec
history_stats/1000                          1.04    197.6±1.77µs  4.9 KElem/sec     1.00    189.4±1.44µs  5.2 KElem/sec
history_stats/5000                          1.03    487.5±3.22µs  2.0 KElem/sec     1.00    471.1±2.64µs  2.1 KElem/sec
mixed_priority_dispatch/500                 1.08    241.7±8.22ms  2.0 KElem/sec     1.00    223.0±6.13ms  2.2 KElem/sec
peek_next/100                               1.10    129.8±6.01µs  7.5 KElem/sec     1.00    118.2±2.62µs  8.3 KElem/sec
peek_next/1000                              1.07    127.3±3.89µs  7.7 KElem/sec     1.00    118.6±2.60µs  8.2 KElem/sec
peek_next/5000                              1.05    125.6±3.97µs  7.8 KElem/sec     1.00    120.0±2.70µs  8.1 KElem/sec
query_ids_by_tags/100                       1.05    192.5±6.31µs  5.1 KElem/sec     1.00    184.0±4.37µs  5.3 KElem/sec
query_ids_by_tags/1000                      1.02   825.9±20.14µs  1210 Elem/sec     1.00   810.5±17.01µs  1233 Elem/sec
query_ids_by_tags/5000                      1.00      3.6±0.14ms   279 Elem/sec     1.03      3.7±0.12ms   270 Elem/sec
retryable_dead_letter/constant              1.11    115.9±1.34ms   862 Elem/sec     1.00    104.4±1.31ms   957 Elem/sec
retryable_dead_letter/exponential           1.10    114.8±0.94ms   871 Elem/sec     1.00    104.1±0.88ms   960 Elem/sec
retryable_dead_letter/exponential_jitter    1.12    116.1±1.80ms   861 Elem/sec     1.00    103.8±1.13ms   963 Elem/sec
retryable_dead_letter/linear                1.10    115.3±0.69ms   866 Elem/sec     1.00    104.7±0.82ms   954 Elem/sec
submit_dedup_hit/1000                       1.07    216.7±7.87ms  4.5 KElem/sec     1.00    203.1±5.26ms  4.8 KElem/sec
submit_tasks/1000                           1.07    189.0±5.99ms  5.2 KElem/sec     1.00    176.6±5.52ms  5.5 KElem/sec
submit_with_tags/0                          1.06     93.0±4.09ms  5.3 KElem/sec     1.00     87.7±3.30ms  5.6 KElem/sec
submit_with_tags/10                         1.06   248.7±12.88ms  2010 Elem/sec     1.00    235.2±9.58ms  2.1 KElem/sec
submit_with_tags/20                         1.06   406.7±23.10ms  1229 Elem/sec     1.00   381.9±16.21ms  1309 Elem/sec
submit_with_tags/5                          1.05    171.0±9.20ms  2.9 KElem/sec     1.00    163.2±6.72ms  3.0 KElem/sec
tag_values/100                              1.05    137.7±5.73µs  7.1 KElem/sec     1.00    130.6±3.11µs  7.5 KElem/sec
tag_values/1000                             1.05    199.6±6.53µs  4.9 KElem/sec     1.00    190.1±3.50µs  5.1 KElem/sec
tag_values/5000                             1.02    465.7±8.15µs  2.1 KElem/sec     1.00    455.2±4.41µs  2.1 KElem/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Pause / resume by group

1 participant