Skip to content

feat(deletion): Add partition support to BulkDeleteQuery and cleanup command#107906

Merged
dashed merged 2 commits intomasterfrom
alberto/partition-bulk-delete
Feb 10, 2026
Merged

feat(deletion): Add partition support to BulkDeleteQuery and cleanup command#107906
dashed merged 2 commits intomasterfrom
alberto/partition-bulk-delete

Conversation

@dashed
Copy link
Member

@dashed dashed commented Feb 10, 2026

Summary

  • Add partition parameter to BulkDeleteQuery to split deletion work across multiple runs using modulo-based row bucketing
  • Add --partition-bucket, --partition-total, and --partition-key flags to the sentry cleanup CLI command
  • Fully backward compatible: behavior is unchanged when partition flags are not provided

Problem

The daily sentry cleanup CronJob for SpikeProjections and Spike models runs a tight DELETE loop via BulkDeleteQuery._continuous_query(), creating a burst of dead tuples on db-usage-1. This triggers a massive autovacuum that causes WAL replication delay, forcing db-usage-repl to fall back to GSCP recovery.

Since valid_date values are always at midnight UTC, simply running the CronJob more frequently doesn't help — all eligible rows become deletable at the same instant. We need a way to partition the rows across multiple runs.

Solution

Add id % N partitioning to BulkDeleteQuery:

sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=0 --partition-total=4
sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=1 --partition-total=4
sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=2 --partition-total=4
sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=3 --partition-total=4

Each run adds WHERE id % 4 = {bucket} to the DELETE query, handling ~25% of eligible rows. The --partition-key flag allows using a different column (defaults to id).

Using id (auto-increment) ensures uniform distribution, following the same principle as PR #18736 which switched spike projection batching from organization_id (snowflake, uneven) to subscription.id (auto-increment, uniform).

Changes

src/sentry/db/deletion.py

  • Added partition: tuple[int, int, str] | None parameter to BulkDeleteQuery.__init__()
  • Added partition filter to the WHERE clause in execute()
  • Added partition filter to iterator() via Func(F(key), Value(total), function="MOD")

src/sentry/runner/commands/cleanup.py

  • Added --partition-bucket CLI flag (integer, 0-based bucket index)
  • Added --partition-total CLI flag (integer, total number of buckets)
  • Added --partition-key CLI flag (default: id)
  • Validation: both bucket and total must be used together, bucket must be non-negative and less than total, total must be positive
  • Threaded through cleanup()_cleanup()run_bulk_query_deletes()BulkDeleteQuery()

Test plan

  • test_partition_restriction — verifies only rows in the matching bucket are deleted
  • test_partition_with_datetime_restriction — combines partition + date filter
  • test_partition_all_buckets_cover_all_rows — verifies complete coverage across all buckets
  • test_iteration_with_partition — verifies iterator() respects partition filter
  • test_partition_bucket_exceeds_total — validation error for bucket >= total
  • test_partition_negative_bucket — validation error for negative bucket
  • test_partition_zero_total — validation error for zero total
  • test_partition_bucket_without_total — validation error when only bucket is set
  • test_partition_total_without_bucket — validation error when only total is set

Related

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 10, 2026
@dashed dashed self-assigned this Feb 10, 2026
@dashed dashed marked this pull request as ready for review February 10, 2026 17:22
@dashed dashed requested review from a team and ellisonmarks February 10, 2026 17:25
Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Add a `partition` parameter to BulkDeleteQuery that enables splitting
deletion work across multiple runs using modulo-based row bucketing.

When `partition=(bucket, total, key_column)` is provided, the DELETE
query adds `WHERE {key} % {total} = {bucket}`, so each run only handles
a fraction of eligible rows. This allows spreading deletion load across
multiple scheduled jobs to reduce dead tuple bursts and autovacuum
contention on high-churn tables like accounts_spike_projections.

The partition filter is applied in the inner SELECT subquery (where the
LIMIT is), ensuring each partition independently selects and deletes its
own candidate rows.
# Parse and validate --partition flag
parsed_partition: tuple[int, int, str] | None = None
if partition is not None:
parts = partition.split("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a thought here, is we could have had 2 separate params here for total_buckets and bucket_id or smth to make it a little more straightforward

Copy link
Contributor

@ajay-sentry ajay-sentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I see partition_key isn't used on the cron but makes sense if we want to easily update later

…cleanup command

Expose BulkDeleteQuery's partition support via the `sentry cleanup` CLI:

  --partition-bucket BUCKET  (0-based bucket index)
  --partition-total TOTAL    (total number of buckets)
  --partition-key COLUMN     (default: id)

This allows K8s CronJobs to split bulk deletion work across multiple
scheduled runs. For example, the spikeprotections cleanup can be split
into 4 jobs at 6-hour intervals, each handling ~25% of eligible rows
via `id % 4 = {0,1,2,3}`.

Includes input validation: both flags must be used together, bucket
must be non-negative and less than total, total must be positive.
@dashed dashed force-pushed the alberto/partition-bulk-delete branch from 2e2f9c0 to 4191754 Compare February 10, 2026 21:04
@dashed dashed merged commit 076b727 into master Feb 10, 2026
73 checks passed
@dashed dashed deleted the alberto/partition-bulk-delete branch February 10, 2026 22:15
jaydgoss pushed a commit that referenced this pull request Feb 12, 2026
…command (#107906)

## Summary

- Add `partition` parameter to `BulkDeleteQuery` to split deletion work
across multiple runs using modulo-based row bucketing
- Add `--partition-bucket`, `--partition-total`, and `--partition-key`
flags to the `sentry cleanup` CLI command
- Fully backward compatible: behavior is unchanged when partition flags
are not provided

## Problem

The daily `sentry cleanup` CronJob for `SpikeProjections` and `Spike`
models runs a tight DELETE loop via
`BulkDeleteQuery._continuous_query()`, creating a burst of dead tuples
on `db-usage-1`. This triggers a massive autovacuum that causes WAL
replication delay, forcing `db-usage-repl` to fall back to GSCP
recovery.

Since `valid_date` values are always at midnight UTC, simply running the
CronJob more frequently doesn't help — all eligible rows become
deletable at the same instant. We need a way to **partition the rows**
across multiple runs.

## Solution

Add `id % N` partitioning to `BulkDeleteQuery`:

```
sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=0 --partition-total=4
sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=1 --partition-total=4
sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=2 --partition-total=4
sentry cleanup --model=SpikeProjections --days=90 --partition-bucket=3 --partition-total=4
```

Each run adds `WHERE id % 4 = {bucket}` to the DELETE query, handling
~25% of eligible rows. The `--partition-key` flag allows using a
different column (defaults to `id`).

Using `id` (auto-increment) ensures uniform distribution, following the
same principle as [PR
#18736](getsentry/getsentry#18736) which
switched spike projection batching from `organization_id` (snowflake,
uneven) to `subscription.id` (auto-increment, uniform).

## Changes

### `src/sentry/db/deletion.py`
- Added `partition: tuple[int, int, str] | None` parameter to
`BulkDeleteQuery.__init__()`
- Added partition filter to the WHERE clause in `execute()`
- Added partition filter to `iterator()` via `Func(F(key), Value(total),
function="MOD")`

### `src/sentry/runner/commands/cleanup.py`
- Added `--partition-bucket` CLI flag (integer, 0-based bucket index)
- Added `--partition-total` CLI flag (integer, total number of buckets)
- Added `--partition-key` CLI flag (default: `id`)
- Validation: both bucket and total must be used together, bucket must
be non-negative and less than total, total must be positive
- Threaded through `cleanup()` → `_cleanup()` →
`run_bulk_query_deletes()` → `BulkDeleteQuery()`

## Test plan

- [x] `test_partition_restriction` — verifies only rows in the matching
bucket are deleted
- [x] `test_partition_with_datetime_restriction` — combines partition +
date filter
- [x] `test_partition_all_buckets_cover_all_rows` — verifies complete
coverage across all buckets
- [x] `test_iteration_with_partition` — verifies `iterator()` respects
partition filter
- [x] `test_partition_bucket_exceeds_total` — validation error for
bucket >= total
- [x] `test_partition_negative_bucket` — validation error for negative
bucket
- [x] `test_partition_zero_total` — validation error for zero total
- [x] `test_partition_bucket_without_total` — validation error when only
bucket is set
- [x] `test_partition_total_without_bucket` — validation error when only
total is set

## Related

- Ops PR (K8s config): getsentry/ops#19081
- Analysis: daily mass deletion on `accounts_spike_projections` causes
replication delay on `db-usage-1`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants