-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(deletions): Performance issue when selecting rows #102960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Using order_by() requires getting all rows and ordering them, thus, making the query slower and fail when we have millions of rows. Fixes [SENTRY-5C36](https://sentry.sentry.io/issues/7006347860/).
| metrics.incr("deletions.group_hash_metadata.rows_updated", amount=updated, sample_rate=1.0) | ||
|
|
||
| last_max_id = max(batch_metadata_ids) | ||
| last_max_id = batch_metadata_ids[-1] # Last element after sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Cursor-based pagination broken by missing ORDER BY
Removing .order_by("id") from the query breaks cursor-based pagination. Without ORDER BY, the database returns rows in arbitrary order, not necessarily the lowest IDs. After sorting in Python and advancing the cursor with last_max_id = batch_metadata_ids[-1], any IDs between the previous cursor and the new max ID that weren't in the arbitrary batch get permanently skipped, leaving orphaned GroupHashMetadata rows that reference deleted GroupHash rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will follow-up on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug mentioned here is true, however, even if we're missing some hashes we should catch it as part of the ORM deletion here:
sentry/src/sentry/deletions/defaults/group.py
Line 389 in 45c8473
| GroupHash.objects.filter(id__in=hash_ids).delete() |
If this is not the case, we can have another function that does the update with sorted IDs after this one completes.
So far I have the deletion of a group with over 1M hashes humming along for the last 20 minutes.
| metrics.incr("deletions.group_hash_metadata.rows_updated", amount=updated, sample_rate=1.0) | ||
|
|
||
| last_max_id = max(batch_metadata_ids) | ||
| last_max_id = batch_metadata_ids[-1] # Last element after sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need last_max_id at all here? Can't we just select rows until nothing is left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're very right. I will fix it today.
In #102960, I added a more optimal query for querying GroupHashMetadata, however, there's no need to track `last_max_id` since as we update rows there will be less rows to select from. Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
In #102960, I added new query for querying GroupHashMetadata, however, using `id` with `seer_matched_grouphash_id__in` requires a composite index which we don't have. Even without that, we don't actually need to use `last_max_id` to keep getting rows and updating them. Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
Using order_by() requires getting all rows and ordering them, thus, making the query slower and fail when we have millions of rows. Fixes [SENTRY-5C36](https://sentry.sentry.io/issues/7006347860/).
In #102960, I added new query for querying GroupHashMetadata, however, using `id` with `seer_matched_grouphash_id__in` requires a composite index which we don't have. Even without that, we don't actually need to use `last_max_id` to keep getting rows and updating them. Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
Using order_by() requires getting all rows and ordering them, thus, making the query slower and fail when we have millions of rows. Fixes [SENTRY-5C36](https://sentry.sentry.io/issues/7006347860/).
In #102960, I added new query for querying GroupHashMetadata, however, using `id` with `seer_matched_grouphash_id__in` requires a composite index which we don't have. Even without that, we don't actually need to use `last_max_id` to keep getting rows and updating them. Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
Using
order_by()requires getting all rows and ordering them, thus, making the query slower and fail when we have millions of rows.Fixes SENTRY-5C36.