Skip to content

Conversation

@kcons
Copy link
Member

@kcons kcons commented Sep 22, 2025

Delayed workflows tasks are currently all scheduled every minute.
This frequency (60s) is what we want, but running them all at once makes our load bursty, and causes some rate limits to be frequently hit in Snuba querying.

This change establishes "cohorts" and schedules one cohort at a time, allowing us to smoothly transition to N groups of delayed_workflow tasks per minute, one every 1/Nth of a minute, smoothing out our workload.


Note

Introduce cohort-sharded scheduling for delayed workflows, persisting cohort run timestamps in Redis and refactoring processing/cleanup; add pydantic-based Redis helpers and a new num_cohorts option.

  • Workflow Engine (Scheduling & Processing)
    • Introduce cohort-sharded scheduling via ProjectChooser and chosen_projects, selecting projects per run based on workflow_engine.num_cohorts.
    • Persist cohort run timestamps in Redis as CohortUpdates using DelayedWorkflowClient.fetch_updates/persist_updates.
    • Refactor cleanup into mark_projects_processed; compute max_project_timestamp and use conditional delete or range clear accordingly.
  • Buffer Layer
    • Add RedisHashSortedSetBuffer.get_parsed_key/put_parsed_key for pydantic models; extend supported ops and expiry handling.
    • Extend DelayedWorkflowClient with _COHORT_UPDATES_KEY and cohort update helpers.
  • Options
    • Add workflow_engine.num_cohorts (Int, default 1).
  • Tests
    • Convert to pytest-style and add coverage for cohort selection, Redis parsed key helpers, and new scheduling/cleanup flows.

Written by Cursor Bugbot for commit 87a09d0. This will update automatically on new commits. Configure here.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Sep 22, 2025
@kcons
Copy link
Member Author

kcons commented Sep 22, 2025

TODO: Make cohort size an option, spell out the transition, use precise cleanup, add a test for num_cohorts = 1.

@codecov
Copy link

codecov bot commented Sep 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #100038      +/-   ##
===========================================
+ Coverage   78.80%    81.03%   +2.22%     
===========================================
  Files        8699      8701       +2     
  Lines      385940    386024      +84     
  Branches    24413     24413              
===========================================
+ Hits       304159    312799    +8640     
+ Misses      81430     72874    -8556     
  Partials      351       351              

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@kcons
Copy link
Member Author

kcons commented Oct 3, 2025

Bug: Conditional Delete Fails, Incorrectly Clears Projects

In mark_projects_processed, the fallback path for conditional deletes calculates max_project_timestamp from all retrieved projects, not just the ones actually processed. This means if the conditional delete fails, clear_project_ids may incorrectly remove unprocessed projects from the buffer.

src/sentry/workflow_engine/processors/schedule.py#L205-L209

)
# Fallback.
buffer_client.clear_project_ids(
min=0,
max=max_project_timestamp,

Fix in Cursor Fix in Web

This is true, but an artifact of the use_conditional_delete being required for partial clean-up. Next week the fallback will be removed.

value = self._execute_redis_operation(key, "get")
if value is None:
return None
return model.parse_raw(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool!

@kcons kcons enabled auto-merge (squash) October 14, 2025 17:45
@kcons kcons merged commit 0692a36 into master Oct 14, 2025
64 checks passed
@kcons kcons deleted the kcons/cohort branch October 14, 2025 18:05
chromy pushed a commit that referenced this pull request Oct 17, 2025
Delayed workflows tasks are currently all scheduled every minute.
This frequency (60s) is what we want, but running them all at once makes
our load bursty, and causes some rate limits to be frequently hit in
Snuba querying.

This change establishes "cohorts" and schedules one cohort at a time,
allowing us to smoothly transition to N groups of `delayed_workflow`
tasks per minute, one every 1/Nth of a minute, smoothing out our
workload.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduce cohort-sharded scheduling for delayed workflows, persisting
cohort run timestamps in Redis and refactoring processing/cleanup; add
pydantic-based Redis helpers and a new num_cohorts option.
> 
> - **Workflow Engine (Scheduling & Processing)**
> - Introduce cohort-sharded scheduling via `ProjectChooser` and
`chosen_projects`, selecting projects per run based on
`workflow_engine.num_cohorts`.
> - Persist cohort run timestamps in Redis as `CohortUpdates` using
`DelayedWorkflowClient.fetch_updates/persist_updates`.
> - Refactor cleanup into `mark_projects_processed`; compute
`max_project_timestamp` and use conditional delete or range clear
accordingly.
> - **Buffer Layer**
> - Add `RedisHashSortedSetBuffer.get_parsed_key/put_parsed_key` for
pydantic models; extend supported ops and expiry handling.
> - Extend `DelayedWorkflowClient` with `_COHORT_UPDATES_KEY` and cohort
update helpers.
> - **Options**
>   - Add `workflow_engine.num_cohorts` (Int, default `1`).
> - **Tests**
> - Convert to pytest-style and add coverage for cohort selection, Redis
parsed key helpers, and new scheduling/cleanup flows.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
87a09d0. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
@github-actions github-actions bot locked and limited conversation to collaborators Oct 30, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants