-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(deletions): Only use seer_matched_grouphash to filter #103051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In #102960, I added a more optimal query for querying GroupHashMetadata, however, there's no need to track `last_max_id` since as we update rows there will be less rows to select from. Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
| # process large datasets without loading all IDs into memory or | ||
| # creating large NOT IN clauses. We fetch IDs without ORDER BY to avoid | ||
| # database sorting overhead, then sort the small batch in Python. | ||
| last_max_id = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was mentioned by @wedamija in here:
#102960 (comment)
| # from the caller, so this IN clause is intentionally not batched | ||
| batch_metadata_ids = list( | ||
| GroupHashMetadata.objects.filter( | ||
| seer_matched_grouphash_id__in=hash_ids, id__gt=last_max_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need id__gt=last_max_id since we don't care about ordering at all. Once the rows' seer_matched_grouphash column is updated (see block below), we will never be able to select those rows again.
GroupHashMetadata.objects.filter(id__in=batch_metadata_ids).update(
seer_matched_grouphash=None
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem with this query is that we're using two columns, thus, it would require a composite index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres will pick one either the primary key, or foreign key index and then sequential scan from there. Going down to one column in the condition makes query planning simpler.
| last_max_id = batch_metadata_ids[-1] # Last element after sorting | ||
|
|
||
|
|
||
| def update_group_hash_metadata_in_batches_old(hash_ids: Sequence[int]) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're dropping the old code as the new query is better and has not caused any worse issues.
markstory
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
| # from the caller, so this IN clause is intentionally not batched | ||
| batch_metadata_ids = list( | ||
| GroupHashMetadata.objects.filter( | ||
| seer_matched_grouphash_id__in=hash_ids, id__gt=last_max_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres will pick one either the primary key, or foreign key index and then sequential scan from there. Going down to one column in the condition makes query planning simpler.
Issues attributed to commits in this pull requestThis pull request was merged and Sentry observed the following issues: |
In #102960, I added new query for querying GroupHashMetadata, however, using `id` with `seer_matched_grouphash_id__in` requires a composite index which we don't have. Even without that, we don't actually need to use `last_max_id` to keep getting rows and updating them. Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
Not production but the custom script. |
In #102960, I added new query for querying GroupHashMetadata, however, using `id` with `seer_matched_grouphash_id__in` requires a composite index which we don't have. Even without that, we don't actually need to use `last_max_id` to keep getting rows and updating them. Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
In #102960, I added new query for querying GroupHashMetadata, however, using
idwithseer_matched_grouphash_id__inrequires a composite index which we don't have.Even without that, we don't actually need to use
last_max_idto keep getting rows and updating them.Fixes SENTRY-5C2V.