-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(similarity): add functionality to backfill a cohort of projects #73075
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #73075 +/- ##
==========================================
+ Coverage 77.94% 77.98% +0.03%
==========================================
Files 6593 6621 +28
Lines 295229 295489 +260
Branches 50884 50892 +8
==========================================
+ Hits 230127 230423 +296
+ Misses 58824 58789 -35
+ Partials 6278 6277 -1
|
) | ||
|
||
|
||
def get_project_for_batch(last_processed_project_index, cohort_list, cohort_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would this be better in utils.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! There were some spots that weren't super clear to me (as someone who hasn't been in the backfill code all along) which I pointed out, but feel free to take or leave those suggestions as you see fit.
src/sentry/tasks/embeddings_grouping/backfill_seer_grouping_records_for_project.py
Show resolved
Hide resolved
@@ -89,11 +134,13 @@ def backfill_seer_grouping_records_for_project( | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly out of scope for this PR, since it's not new here, but reading the backfill code for the first time, it's not clear quite what's happening here between lines 130 and 166. Why are we filtering the snuba results? has_snuba_row
and has_nodestore_row
imply that there might be some without snuba and/or nodestore rows? How/when would that happen? Similarly, with_no_embeddings
... do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them)? What is the hashes dict, and what would make it empty?
Could we add a few comments explaining what's happening and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and a few reasons: 1) the row got expired in clickhouse, and for nodestore 2) it's not so much that there were no nodestore rows, but that the stacktraces in nodestore weren't eligible for seer.
do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them
yes, in case the backfill gets ran twice or something. we don't backfill them.
What is the hashes dict, and what would make it empty?
it is the mapping of group hashes to group_id
Possibly out of scope for this PR, since it's not new here, but reading the backfill code for the first time, it's not clear quite what's happening here between lines 130 and 166. Why are we filtering the snuba results? has_snuba_row and has_nodestore_row imply that there might be some without snuba and/or nodestore rows? How/when would that happen? Similarly, with_no_embeddings... do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them)? What is the hashes dict, and what would make it empty?
Could we add a few comments explaining what's happening and why?
if we end up needing to put a lot more work into the backfill, yes, but i imagine as we release this featue we'll end up deleting it soon, so could add comments if it becomes opaque to us as we work o it.
src/sentry/tasks/embeddings_grouping/backfill_seer_grouping_records_for_project.py
Show resolved
Hide resolved
src/sentry/tasks/embeddings_grouping/backfill_seer_grouping_records_for_project.py
Show resolved
Hide resolved
def initialize_backfill( | ||
project_id: int, | ||
cohort: str | list[int] | None, | ||
last_processed_group_index: int | None, | ||
last_processed_project_index: int | None, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many times/when does this get called? Passing in the last_processed
indices implies it's more than just at the beginning. Could we add this info to a docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this gets called at the start of every backfill batch, which is what one celery task is. and can add docstrings / comments if we end up needing to loop more folks into this code.
if last_processed_group_index is None: | ||
last_processed_group_index_ret = int( | ||
redis_client.get(make_backfill_grouping_index_redis_key(project_id)) or 0 | ||
) | ||
else: | ||
last_processed_group_index_ret = last_processed_group_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
The way it was before (we only overwrite if the value is None
and otherwise just return the value as given) is a little easier to grok that splitting it into an if/else and giving the value a new name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy hates re-assignment :(
def make_backfill_grouping_index_redis_key(project_id: int): | ||
redis_key = "grouping_record_backfill.last_processed_grouping_index" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Why grouping_index
here and group_index
elsewhere? Aren't they all keeping track of which group you're on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, just a small inconsistency. can end up fixing if it causes confusion
follow ups from #73075
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
cohort
,last_processed_project_index
to the celery task parameters