Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions src/sentry/deletions/defaults/group.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,27 +275,31 @@ def update_group_hash_metadata_in_batches(hash_ids: Sequence[int]) -> None:

# Use cursor-based pagination with the primary key to efficiently
# process large datasets without loading all IDs into memory or
# creating large NOT IN clauses
# creating large NOT IN clauses. We fetch IDs without ORDER BY to avoid
# database sorting overhead, then sort the small batch in Python.
last_max_id = 0
while True:
# Note: hash_ids is bounded to ~100 items (deletions.group-hashes-batch-size)
# from the caller, so this IN clause is intentionally not batched
batch_metadata_ids = list(
GroupHashMetadata.objects.filter(
seer_matched_grouphash_id__in=hash_ids, id__gt=last_max_id
)
.order_by("id")
.values_list("id", flat=True)[:batch_size]
).values_list("id", flat=True)[:batch_size]
)
if not batch_metadata_ids:
break

# Sort in Python to ensure we process lowest IDs first and can safely
# advance the cursor. Sorting a small batch (e.g., 1000 items) in Python
# is trivial and avoids database ORDER BY overhead.
batch_metadata_ids.sort()

updated = GroupHashMetadata.objects.filter(id__in=batch_metadata_ids).update(
seer_matched_grouphash=None
)
metrics.incr("deletions.group_hash_metadata.rows_updated", amount=updated, sample_rate=1.0)

last_max_id = max(batch_metadata_ids)
last_max_id = batch_metadata_ids[-1] # Last element after sorting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Cursor-based pagination broken by missing ORDER BY

Removing .order_by("id") from the query breaks cursor-based pagination. Without ORDER BY, the database returns rows in arbitrary order, not necessarily the lowest IDs. After sorting in Python and advancing the cursor with last_max_id = batch_metadata_ids[-1], any IDs between the previous cursor and the new max ID that weren't in the arbitrary batch get permanently skipped, leaving orphaned GroupHashMetadata rows that reference deleted GroupHash rows.

Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow-up on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug mentioned here is true, however, even if we're missing some hashes we should catch it as part of the ORM deletion here:

GroupHash.objects.filter(id__in=hash_ids).delete()

If this is not the case, we can have another function that does the update with sorted IDs after this one completes.

So far I have the deletion of a group with over 1M hashes humming along for the last 20 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need last_max_id at all here? Can't we just select rows until nothing is left?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're very right. I will fix it today.



def update_group_hash_metadata_in_batches_old(hash_ids: Sequence[int]) -> int:
Expand Down
Loading