Skip to content

Conversation

@armenzg
Copy link
Member

@armenzg armenzg commented Nov 7, 2025

Using order_by() requires getting all rows and ordering them, thus, making the query slower and fail when we have millions of rows.

Fixes SENTRY-5C36.

Using order_by() requires getting all rows and ordering them, thus, making the query slower and fail when we have millions of rows.

Fixes [SENTRY-5C36](https://sentry.sentry.io/issues/7006347860/).
@armenzg armenzg self-assigned this Nov 7, 2025
@armenzg armenzg requested a review from a team as a code owner November 7, 2025 16:25
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Nov 7, 2025
@armenzg armenzg enabled auto-merge (squash) November 7, 2025 16:25
metrics.incr("deletions.group_hash_metadata.rows_updated", amount=updated, sample_rate=1.0)

last_max_id = max(batch_metadata_ids)
last_max_id = batch_metadata_ids[-1] # Last element after sorting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Cursor-based pagination broken by missing ORDER BY

Removing .order_by("id") from the query breaks cursor-based pagination. Without ORDER BY, the database returns rows in arbitrary order, not necessarily the lowest IDs. After sorting in Python and advancing the cursor with last_max_id = batch_metadata_ids[-1], any IDs between the previous cursor and the new max ID that weren't in the arbitrary batch get permanently skipped, leaving orphaned GroupHashMetadata rows that reference deleted GroupHash rows.

Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow-up on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug mentioned here is true, however, even if we're missing some hashes we should catch it as part of the ORM deletion here:

GroupHash.objects.filter(id__in=hash_ids).delete()

If this is not the case, we can have another function that does the update with sorted IDs after this one completes.

So far I have the deletion of a group with over 1M hashes humming along for the last 20 minutes.

@armenzg armenzg disabled auto-merge November 7, 2025 16:36
@armenzg armenzg merged commit 4499d36 into master Nov 7, 2025
66 checks passed
@armenzg armenzg deleted the 11_07/improve_perf/armenzg branch November 7, 2025 17:12
metrics.incr("deletions.group_hash_metadata.rows_updated", amount=updated, sample_rate=1.0)

last_max_id = max(batch_metadata_ids)
last_max_id = batch_metadata_ids[-1] # Last element after sorting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need last_max_id at all here? Can't we just select rows until nothing is left?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're very right. I will fix it today.

armenzg added a commit that referenced this pull request Nov 10, 2025
In #102960, I added a more optimal query for querying GroupHashMetadata, however, there's no need to track `last_max_id` since as we update rows there will be less rows to select from.

Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
armenzg added a commit that referenced this pull request Nov 10, 2025
In #102960, I added new query for querying GroupHashMetadata, however,
using `id` with `seer_matched_grouphash_id__in` requires a composite
index which we don't have.

Even without that, we don't actually need to use `last_max_id` to keep
getting rows and updating them.

Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
Jesse-Box pushed a commit that referenced this pull request Nov 12, 2025
Using order_by() requires getting all rows and ordering them, thus,
making the query slower and fail when we have millions of rows.

Fixes [SENTRY-5C36](https://sentry.sentry.io/issues/7006347860/).
Jesse-Box pushed a commit that referenced this pull request Nov 12, 2025
In #102960, I added new query for querying GroupHashMetadata, however,
using `id` with `seer_matched_grouphash_id__in` requires a composite
index which we don't have.

Even without that, we don't actually need to use `last_max_id` to keep
getting rows and updating them.

Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
andrewshie-sentry pushed a commit that referenced this pull request Nov 13, 2025
Using order_by() requires getting all rows and ordering them, thus,
making the query slower and fail when we have millions of rows.

Fixes [SENTRY-5C36](https://sentry.sentry.io/issues/7006347860/).
andrewshie-sentry pushed a commit that referenced this pull request Nov 13, 2025
In #102960, I added new query for querying GroupHashMetadata, however,
using `id` with `seer_matched_grouphash_id__in` requires a composite
index which we don't have.

Even without that, we don't actually need to use `last_max_id` to keep
getting rows and updating them.

Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants