Skip to content

Conversation

@armenzg
Copy link
Member

@armenzg armenzg commented Nov 10, 2025

In #102960, I added new query for querying GroupHashMetadata, however, using id with seer_matched_grouphash_id__in requires a composite index which we don't have.

Even without that, we don't actually need to use last_max_id to keep getting rows and updating them.

Fixes SENTRY-5C2V.

In #102960, I added a more optimal query for querying GroupHashMetadata, however, there's no need to track `last_max_id` since as we update rows there will be less rows to select from.

Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
@armenzg armenzg self-assigned this Nov 10, 2025
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Nov 10, 2025
# process large datasets without loading all IDs into memory or
# creating large NOT IN clauses. We fetch IDs without ORDER BY to avoid
# database sorting overhead, then sort the small batch in Python.
last_max_id = 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mentioned by @wedamija in here:
#102960 (comment)

# from the caller, so this IN clause is intentionally not batched
batch_metadata_ids = list(
GroupHashMetadata.objects.filter(
seer_matched_grouphash_id__in=hash_ids, id__gt=last_max_id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need id__gt=last_max_id since we don't care about ordering at all. Once the rows' seer_matched_grouphash column is updated (see block below), we will never be able to select those rows again.

GroupHashMetadata.objects.filter(id__in=batch_metadata_ids).update(
    seer_matched_grouphash=None
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with this query is that we're using two columns, thus, it would require a composite index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres will pick one either the primary key, or foreign key index and then sequential scan from there. Going down to one column in the condition makes query planning simpler.

last_max_id = batch_metadata_ids[-1] # Last element after sorting


def update_group_hash_metadata_in_batches_old(hash_ids: Sequence[int]) -> int:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're dropping the old code as the new query is better and has not caused any worse issues.

@armenzg armenzg marked this pull request as ready for review November 10, 2025 13:02
@armenzg armenzg requested a review from a team as a code owner November 10, 2025 13:02
@armenzg armenzg changed the title fix(deletions): Do not use last_max_id fix(deletions): Only use seer_matched_grouphash to filter Nov 10, 2025
Copy link
Member

@markstory markstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

# from the caller, so this IN clause is intentionally not batched
batch_metadata_ids = list(
GroupHashMetadata.objects.filter(
seer_matched_grouphash_id__in=hash_ids, id__gt=last_max_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres will pick one either the primary key, or foreign key index and then sequential scan from there. Going down to one column in the condition makes query planning simpler.

@armenzg armenzg merged commit ae37037 into master Nov 10, 2025
67 checks passed
@armenzg armenzg deleted the 11_07/more_perf/armenzg branch November 10, 2025 16:31
@sentry
Copy link

sentry bot commented Nov 12, 2025

Issues attributed to commits in this pull request

This pull request was merged and Sentry observed the following issues:

Jesse-Box pushed a commit that referenced this pull request Nov 12, 2025
In #102960, I added new query for querying GroupHashMetadata, however,
using `id` with `seer_matched_grouphash_id__in` requires a composite
index which we don't have.

Even without that, we don't actually need to use `last_max_id` to keep
getting rows and updating them.

Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
@armenzg
Copy link
Member Author

armenzg commented Nov 12, 2025

Issues attributed to commits in this pull request

This pull request was merged and Sentry observed the following issues:

Not production but the custom script.

andrewshie-sentry pushed a commit that referenced this pull request Nov 13, 2025
In #102960, I added new query for querying GroupHashMetadata, however,
using `id` with `seer_matched_grouphash_id__in` requires a composite
index which we don't have.

Even without that, we don't actually need to use `last_max_id` to keep
getting rows and updating them.

Fixes [SENTRY-5C2V](https://sentry.sentry.io/issues/7005021677/).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants