Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(similarity): add functionality to backfill a cohort of projects #73075

Merged
merged 14 commits into from
Jun 21, 2024

Conversation

JoshFerge
Copy link
Member

@JoshFerge JoshFerge commented Jun 20, 2024

  • since we need to daisy chain our celery tasks, we can't simply loop through a list of projects and call the backfill on them, we need to daisy chain the celery call over the projects
  • add new parameters, cohort, last_processed_project_index to the celery task parameters
  • in places where we'd simply return before, we'll now try to call the next backfill, and if there are more project_ids in the cohort, we'll start on the next project instead of returning
  • we add logic to correctly increment the project_id_index between jobs
  • adds tests. there are a few tests that could be written to test the redis key and feature branches, but we can follow on if these scenarios cause issues..

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jun 20, 2024
@JoshFerge JoshFerge marked this pull request as ready for review June 20, 2024 20:29
Copy link

codecov bot commented Jun 20, 2024

Codecov Report

Attention: Patch coverage is 87.75510% with 6 lines in your changes missing coverage. Please review.

Project coverage is 77.98%. Comparing base (54c67da) to head (82cfc1f).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #73075      +/-   ##
==========================================
+ Coverage   77.94%   77.98%   +0.03%     
==========================================
  Files        6593     6621      +28     
  Lines      295229   295489     +260     
  Branches    50884    50892       +8     
==========================================
+ Hits       230127   230423     +296     
+ Misses      58824    58789      -35     
+ Partials     6278     6277       -1     
Files Coverage Δ
src/sentry/conf/server.py 87.74% <100.00%> (+0.02%) ⬆️
...ping/backfill_seer_grouping_records_for_project.py 82.35% <90.47%> (+3.92%) ⬆️
src/sentry/tasks/embeddings_grouping/utils.py 91.07% <85.18%> (-0.80%) ⬇️

... and 41 files with indirect coverage changes

)


def get_project_for_batch(last_processed_project_index, cohort_list, cohort_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would this be better in utils.py?

src/sentry/tasks/embeddings_grouping/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@jangjodi jangjodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Member

@lobsterkatie lobsterkatie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! There were some spots that weren't super clear to me (as someone who hasn't been in the backfill code all along) which I pointed out, but feel free to take or leave those suggestions as you see fit.

@@ -89,11 +134,13 @@ def backfill_seer_grouping_records_for_project(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly out of scope for this PR, since it's not new here, but reading the backfill code for the first time, it's not clear quite what's happening here between lines 130 and 166. Why are we filtering the snuba results? has_snuba_row and has_nodestore_row imply that there might be some without snuba and/or nodestore rows? How/when would that happen? Similarly, with_no_embeddings... do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them)? What is the hashes dict, and what would make it empty?

Could we add a few comments explaining what's happening and why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and a few reasons: 1) the row got expired in clickhouse, and for nodestore 2) it's not so much that there were no nodestore rows, but that the stacktraces in nodestore weren't eligible for seer.

do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them
yes, in case the backfill gets ran twice or something. we don't backfill them.

What is the hashes dict, and what would make it empty?
it is the mapping of group hashes to group_id

Possibly out of scope for this PR, since it's not new here, but reading the backfill code for the first time, it's not clear quite what's happening here between lines 130 and 166. Why are we filtering the snuba results? has_snuba_row and has_nodestore_row imply that there might be some without snuba and/or nodestore rows? How/when would that happen? Similarly, with_no_embeddings... do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them)? What is the hashes dict, and what would make it empty?

Could we add a few comments explaining what's happening and why?

if we end up needing to put a lot more work into the backfill, yes, but i imagine as we release this featue we'll end up deleting it soon, so could add comments if it becomes opaque to us as we work o it.

Comment on lines +95 to +100
def initialize_backfill(
project_id: int,
cohort: str | list[int] | None,
last_processed_group_index: int | None,
last_processed_project_index: int | None,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many times/when does this get called? Passing in the last_processed indices implies it's more than just at the beginning. Could we add this info to a docstring?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this gets called at the start of every backfill batch, which is what one celery task is. and can add docstrings / comments if we end up needing to loop more folks into this code.

Comment on lines +114 to +119
if last_processed_group_index is None:
last_processed_group_index_ret = int(
redis_client.get(make_backfill_grouping_index_redis_key(project_id)) or 0
)
else:
last_processed_group_index_ret = last_processed_group_index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

The way it was before (we only overwrite if the value is None and otherwise just return the value as given) is a little easier to grok that splitting it into an if/else and giving the value a new name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy hates re-assignment :(

Comment on lines +475 to +476
def make_backfill_grouping_index_redis_key(project_id: int):
redis_key = "grouping_record_backfill.last_processed_grouping_index"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Why grouping_index here and group_index elsewhere? Aren't they all keeping track of which group you're on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, just a small inconsistency. can end up fixing if it causes confusion

src/sentry/tasks/embeddings_grouping/utils.py Show resolved Hide resolved
@JoshFerge JoshFerge merged commit 118771a into master Jun 21, 2024
48 of 49 checks passed
@JoshFerge JoshFerge deleted the jferg/cohorts branch June 21, 2024 03:32
JoshFerge added a commit that referenced this pull request Jun 21, 2024
priscilawebdev pushed a commit that referenced this pull request Jun 24, 2024
Copy link

sentry-io bot commented Jun 24, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ AssertionError: assert {'request_has...ersion': 'v0'} == {'request_has...ersion': 'v0'} pytest.runtest.protocol tests/sentry/tasks/test... View Issue
  • ‼️ AssertionError: assert {'request_has...ersion': 'v0'} == {'request_has...ersion': 'v0'} pytest.runtest.protocol tests/sentry/tasks/test... View Issue
  • ‼️ AssertionError: assert {'request_has...ersion': 'v0'} == {'request_has...ersion': 'v0'} pytest.runtest.protocol tests/sentry/tasks/test... View Issue

Did you find this useful? React with a 👍 or 👎

@github-actions github-actions bot locked and limited conversation to collaborators Jul 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants