feat(aci): Delayed workflows cohort sharding #100038

kcons · 2025-09-22T19:00:22Z

Delayed workflows tasks are currently all scheduled every minute.
This frequency (60s) is what we want, but running them all at once makes our load bursty, and causes some rate limits to be frequently hit in Snuba querying.

This change establishes "cohorts" and schedules one cohort at a time, allowing us to smoothly transition to N groups of delayed_workflow tasks per minute, one every 1/Nth of a minute, smoothing out our workload.

Note

Introduce cohort-sharded scheduling for delayed workflows, persisting cohort run timestamps in Redis and refactoring processing/cleanup; add pydantic-based Redis helpers and a new num_cohorts option.

Workflow Engine (Scheduling & Processing)
- Introduce cohort-sharded scheduling via ProjectChooser and chosen_projects, selecting projects per run based on workflow_engine.num_cohorts.
- Persist cohort run timestamps in Redis as CohortUpdates using DelayedWorkflowClient.fetch_updates/persist_updates.
- Refactor cleanup into mark_projects_processed; compute max_project_timestamp and use conditional delete or range clear accordingly.
Buffer Layer
- Add RedisHashSortedSetBuffer.get_parsed_key/put_parsed_key for pydantic models; extend supported ops and expiry handling.
- Extend DelayedWorkflowClient with _COHORT_UPDATES_KEY and cohort update helpers.
Options
- Add workflow_engine.num_cohorts (Int, default 1).
Tests
- Convert to pytest-style and add coverage for cohort selection, Redis parsed key helpers, and new scheduling/cleanup flows.

^{Written by Cursor Bugbot for commit 87a09d0. This will update automatically on new commits. Configure here.}

kcons · 2025-09-22T19:01:02Z

TODO: Make cohort size an option, spell out the transition, use precise cleanup, add a test for num_cohorts = 1.

codecov · 2025-09-22T19:14:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #100038      +/-   ##
===========================================
+ Coverage   78.80%    81.03%   +2.22%     
===========================================
  Files        8699      8701       +2     
  Lines      385940    386024      +84     
  Branches    24413     24413              
===========================================
+ Hits       304159    312799    +8640     
+ Misses      81430     72874    -8556     
  Partials      351       351

kcons · 2025-10-03T21:09:11Z

Bug: Conditional Delete Fails, Incorrectly Clears Projects

In mark_projects_processed, the fallback path for conditional deletes calculates max_project_timestamp from all retrieved projects, not just the ones actually processed. This means if the conditional delete fails, clear_project_ids may incorrectly remove unprocessed projects from the buffer.

src/sentry/workflow_engine/processors/schedule.py#L205-L209

sentry/src/sentry/workflow_engine/processors/schedule.py

Lines 205 to 209 in 54c7d82

)

# Fallback.

buffer_client.clear_project_ids(

min=0,

max=max_project_timestamp,

This is true, but an artifact of the use_conditional_delete being required for partial clean-up. Next week the fallback will be removed.

cathteng · 2025-10-09T22:43:12Z

src/sentry/workflow_engine/buffer/redis_hash_sorted_set_buffer.py

+        value = self._execute_redis_operation(key, "get")
+        if value is None:
+            return None
+        return model.parse_raw(value)


This is cool!

Delayed workflows tasks are currently all scheduled every minute. This frequency (60s) is what we want, but running them all at once makes our load bursty, and causes some rate limits to be frequently hit in Snuba querying. This change establishes "cohorts" and schedules one cohort at a time, allowing us to smoothly transition to N groups of `delayed_workflow` tasks per minute, one every 1/Nth of a minute, smoothing out our workload.  --- > [!NOTE] > Introduce cohort-sharded scheduling for delayed workflows, persisting cohort run timestamps in Redis and refactoring processing/cleanup; add pydantic-based Redis helpers and a new num_cohorts option. > > - **Workflow Engine (Scheduling & Processing)** > - Introduce cohort-sharded scheduling via `ProjectChooser` and `chosen_projects`, selecting projects per run based on `workflow_engine.num_cohorts`. > - Persist cohort run timestamps in Redis as `CohortUpdates` using `DelayedWorkflowClient.fetch_updates/persist_updates`. > - Refactor cleanup into `mark_projects_processed`; compute `max_project_timestamp` and use conditional delete or range clear accordingly. > - **Buffer Layer** > - Add `RedisHashSortedSetBuffer.get_parsed_key/put_parsed_key` for pydantic models; extend supported ops and expiry handling. > - Extend `DelayedWorkflowClient` with `_COHORT_UPDATES_KEY` and cohort update helpers. > - **Options** > - Add `workflow_engine.num_cohorts` (Int, default `1`). > - **Tests** > - Convert to pytest-style and add coverage for cohort selection, Redis parsed key helpers, and new scheduling/cleanup flows. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 87a09d0. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Sep 22, 2025

kcons force-pushed the kcons/cohort branch from 72d5a13 to ceeec11 Compare September 22, 2025 23:49

vercel bot deployed to Preview September 22, 2025 23:51 View deployment

vercel bot deployed to Preview September 23, 2025 22:33 View deployment

kcons force-pushed the kcons/cohort branch from 42d9c5e to 92ce550 Compare September 26, 2025 22:24

vercel bot deployed to Preview September 26, 2025 22:28 View deployment

kcons force-pushed the kcons/cohort branch from d91e253 to 431a934 Compare September 29, 2025 18:01

kcons marked this pull request as ready for review September 29, 2025 18:02

kcons requested a review from a team as a code owner September 29, 2025 18:02

vercel bot deployed to Preview September 29, 2025 18:05 View deployment

kcons force-pushed the kcons/cohort branch from 87a09d0 to c73c77f Compare September 30, 2025 18:19

kcons requested a review from saponifi3d September 30, 2025 18:19

vercel bot deployed to Preview September 30, 2025 18:23 View deployment

This comment was marked as outdated.

Sign in to view

vercel bot deployed to Preview October 3, 2025 17:48 View deployment

kcons force-pushed the kcons/cohort branch from 1612609 to a64c7e4 Compare October 3, 2025 20:59

This comment was marked as outdated.

Sign in to view

vercel bot deployed to Preview October 3, 2025 21:03 View deployment

kcons force-pushed the kcons/cohort branch from 54c7d82 to 0caec12 Compare October 9, 2025 17:20

kcons requested a review from cathteng October 9, 2025 17:21

vercel bot deployed to Preview October 9, 2025 17:24 View deployment

kcons force-pushed the kcons/cohort branch from e1e4650 to 299d9e2 Compare October 9, 2025 17:28

vercel bot deployed to Preview October 9, 2025 17:32 View deployment

cathteng approved these changes Oct 9, 2025

View reviewed changes

kcons force-pushed the kcons/cohort branch from b3133f7 to 237f5c5 Compare October 10, 2025 20:36

vercel bot deployed to Preview October 10, 2025 20:40 View deployment

initial

f190380

kcons added 12 commits October 14, 2025 09:27

tests

58089e7

fix

a314467

some fixes

ca3d8bc

more

df23413

more tests

34505ec

More merge

6d8d970

fix typing

bfb12c8

fix cleanup

8594ae3

Merge

65aa4ac

dox

7a581df

fix merge error, add test for exception behavior

cd83754

killswitch

6100a8d

kcons force-pushed the kcons/cohort branch from c01b1bf to 6100a8d Compare October 14, 2025 17:43

🛠️ apply pre-commit fixes

95cd1e6

kcons enabled auto-merge (squash) October 14, 2025 17:45

vercel bot deployed to Preview October 14, 2025 17:48 View deployment

kcons merged commit 0692a36 into master Oct 14, 2025
64 checks passed

kcons deleted the kcons/cohort branch October 14, 2025 18:05

github-actions bot locked and limited conversation to collaborators Oct 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(aci): Delayed workflows cohort sharding #100038

feat(aci): Delayed workflows cohort sharding #100038

Uh oh!

kcons commented Sep 22, 2025 •

edited by cursor bot

Loading

Uh oh!

kcons commented Sep 22, 2025

Uh oh!

codecov bot commented Sep 22, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

kcons commented Oct 3, 2025

Bug: Conditional Delete Fails, Incorrectly Clears Projects

Uh oh!

cathteng Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

feat(aci): Delayed workflows cohort sharding #100038

feat(aci): Delayed workflows cohort sharding #100038

Uh oh!

Conversation

kcons commented Sep 22, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kcons commented Sep 22, 2025

Uh oh!

codecov bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

kcons commented Oct 3, 2025

Bug: Conditional Delete Fails, Incorrectly Clears Projects

Uh oh!

cathteng Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kcons commented Sep 22, 2025 •

edited by cursor bot

Loading

codecov bot commented Sep 22, 2025 •

edited

Loading