feat(similarity): add functionality to backfill a cohort of projects #73075

JoshFerge · 2024-06-20T20:00:07Z

since we need to daisy chain our celery tasks, we can't simply loop through a list of projects and call the backfill on them, we need to daisy chain the celery call over the projects
add new parameters, cohort, last_processed_project_index to the celery task parameters
in places where we'd simply return before, we'll now try to call the next backfill, and if there are more project_ids in the cohort, we'll start on the next project instead of returning
we add logic to correctly increment the project_id_index between jobs
adds tests. there are a few tests that could be written to test the redis key and feature branches, but we can follow on if these scenarios cause issues..

codecov · 2024-06-20T20:32:23Z

Codecov Report

Attention: Patch coverage is 87.75510% with 6 lines in your changes missing coverage. Please review.

Project coverage is 77.98%. Comparing base (54c67da) to head (82cfc1f).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #73075      +/-   ##
==========================================
+ Coverage   77.94%   77.98%   +0.03%     
==========================================
  Files        6593     6621      +28     
  Lines      295229   295489     +260     
  Branches    50884    50892       +8     
==========================================
+ Hits       230127   230423     +296     
+ Misses      58824    58789      -35     
+ Partials     6278     6277       -1

Files	Coverage Δ
src/sentry/conf/server.py	`87.74% <100.00%> (+0.02%)`	⬆️
...ping/backfill_seer_grouping_records_for_project.py	`82.35% <90.47%> (+3.92%)`	⬆️
src/sentry/tasks/embeddings_grouping/utils.py	`91.07% <85.18%> (-0.80%)`	⬇️

... and 41 files with indirect coverage changes

jangjodi · 2024-06-20T22:00:55Z

src/sentry/tasks/embeddings_grouping/backfill_seer_grouping_records_for_project.py

+        )
+
+
+def get_project_for_batch(last_processed_project_index, cohort_list, cohort_name):


nit: would this be better in utils.py?

src/sentry/tasks/embeddings_grouping/utils.py

jangjodi

lgtm!

lobsterkatie

Looks good overall! There were some spots that weren't super clear to me (as someone who hasn't been in the backfill code all along) which I pointed out, but feel free to take or leave those suggestions as you see fit.

src/sentry/tasks/embeddings_grouping/backfill_seer_grouping_records_for_project.py

lobsterkatie · 2024-06-21T02:53:06Z

src/sentry/tasks/embeddings_grouping/backfill_seer_grouping_records_for_project.py

@@ -89,11 +134,13 @@ def backfill_seer_grouping_records_for_project(



Possibly out of scope for this PR, since it's not new here, but reading the backfill code for the first time, it's not clear quite what's happening here between lines 130 and 166. Why are we filtering the snuba results? has_snuba_row and has_nodestore_row imply that there might be some without snuba and/or nodestore rows? How/when would that happen? Similarly, with_no_embeddings... do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them)? What is the hashes dict, and what would make it empty?

Could we add a few comments explaining what's happening and why?

yes, and a few reasons: 1) the row got expired in clickhouse, and for nodestore 2) it's not so much that there were no nodestore rows, but that the stacktraces in nodestore weren't eligible for seer.

do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them
yes, in case the backfill gets ran twice or something. we don't backfill them.

What is the hashes dict, and what would make it empty?
it is the mapping of group hashes to group_id

Possibly out of scope for this PR, since it's not new here, but reading the backfill code for the first time, it's not clear quite what's happening here between lines 130 and 166. Why are we filtering the snuba results? has_snuba_row and has_nodestore_row imply that there might be some without snuba and/or nodestore rows? How/when would that happen? Similarly, with_no_embeddings... do some of the ones we're backfilling already have embeddings (and if so, why are we backfilling them)? What is the hashes dict, and what would make it empty?

Could we add a few comments explaining what's happening and why?

if we end up needing to put a lot more work into the backfill, yes, but i imagine as we release this featue we'll end up deleting it soon, so could add comments if it becomes opaque to us as we work o it.

src/sentry/tasks/embeddings_grouping/backfill_seer_grouping_records_for_project.py

lobsterkatie · 2024-06-21T02:53:29Z

src/sentry/tasks/embeddings_grouping/utils.py

+def initialize_backfill(
+    project_id: int,
+    cohort: str | list[int] | None,
+    last_processed_group_index: int | None,
+    last_processed_project_index: int | None,
+):


How many times/when does this get called? Passing in the last_processed indices implies it's more than just at the beginning. Could we add this info to a docstring?

this gets called at the start of every backfill batch, which is what one celery task is. and can add docstrings / comments if we end up needing to loop more folks into this code.

lobsterkatie · 2024-06-21T02:53:41Z

src/sentry/tasks/embeddings_grouping/utils.py

+    if last_processed_group_index is None:
+        last_processed_group_index_ret = int(
+            redis_client.get(make_backfill_grouping_index_redis_key(project_id)) or 0
+        )
+    else:
+        last_processed_group_index_ret = last_processed_group_index


Nit:

The way it was before (we only overwrite if the value is None and otherwise just return the value as given) is a little easier to grok that splitting it into an if/else and giving the value a new name.

mypy hates re-assignment :(

lobsterkatie · 2024-06-21T03:26:58Z

src/sentry/tasks/embeddings_grouping/utils.py

+def make_backfill_grouping_index_redis_key(project_id: int):
+    redis_key = "grouping_record_backfill.last_processed_grouping_index"


Nit:

Why grouping_index here and group_index elsewhere? Aren't they all keeping track of which group you're on?

no reason, just a small inconsistency. can end up fixing if it causes confusion

src/sentry/tasks/embeddings_grouping/utils.py

follow ups from #73075

sentry-io · 2024-06-24T18:02:04Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ AssertionError: assert {'request_has...ersion': 'v0'} == {'request_has...ersion': 'v0'} pytest.runtest.protocol tests/sentry/tasks/test... View Issue
‼️ AssertionError: assert {'request_has...ersion': 'v0'} == {'request_has...ersion': 'v0'} pytest.runtest.protocol tests/sentry/tasks/test... View Issue
‼️ AssertionError: assert {'request_has...ersion': 'v0'} == {'request_has...ersion': 'v0'} pytest.runtest.protocol tests/sentry/tasks/test... View Issue

_{Did you find this useful? React with a 👍 or 👎}

JoshFerge added 10 commits June 18, 2024 18:02

a few more tests passing

8f6582c

a few more tests passing

842073b

fix tests

08fd006

fix tests

e2e574c

fix types

d2fecb7

fix types

77f0c77

fix

d36cbe5

fix

5b5ae2a

self review 1

cbb14eb

self review 2

0853dfb

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jun 20, 2024

JoshFerge marked this pull request as ready for review June 20, 2024 20:29

JoshFerge requested review from jangjodi and lobsterkatie June 20, 2024 20:29

JoshFerge added 2 commits June 20, 2024 13:35

Merge branch 'master' into jferg/cohorts

cfb0ee9

add test for cohort setting

b2150c2

vercel bot deployed to Preview June 20, 2024 20:46 View deployment

jangjodi reviewed Jun 20, 2024

View reviewed changes

fix bug in project redis key making

901d511

vercel bot deployed to Preview June 20, 2024 22:31 View deployment

jangjodi reviewed Jun 20, 2024

View reviewed changes

Merge branch 'master' into jferg/cohorts

82cfc1f

JoshFerge enabled auto-merge (squash) June 20, 2024 22:58

vercel bot deployed to Preview June 20, 2024 23:00 View deployment

lobsterkatie approved these changes Jun 21, 2024

View reviewed changes

JoshFerge merged commit 118771a into master Jun 21, 2024
48 of 49 checks passed

JoshFerge deleted the jferg/cohorts branch June 21, 2024 03:32

JoshFerge mentioned this pull request Jun 21, 2024

fix(similarity): cohort backfill pr adjustments #73126

Merged

JoshFerge added a commit that referenced this pull request Jun 21, 2024

fix(similarity): cohort backfill pr adjustments (#73126)

9d03ade

follow ups from #73075

priscilawebdev pushed a commit that referenced this pull request Jun 24, 2024

fix(similarity): cohort backfill pr adjustments (#73126)

c7c224f

follow ups from #73075

github-actions bot locked and limited conversation to collaborators Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(similarity): add functionality to backfill a cohort of projects #73075

feat(similarity): add functionality to backfill a cohort of projects #73075

JoshFerge commented Jun 20, 2024 •

edited

Loading

codecov bot commented Jun 20, 2024 •

edited

Loading

jangjodi Jun 20, 2024

jangjodi left a comment

lobsterkatie left a comment

lobsterkatie Jun 21, 2024

JoshFerge Jun 21, 2024

lobsterkatie Jun 21, 2024

JoshFerge Jun 21, 2024

lobsterkatie Jun 21, 2024

JoshFerge Jun 21, 2024

lobsterkatie Jun 21, 2024

JoshFerge Jun 21, 2024

sentry-io bot commented Jun 24, 2024 •

edited

Loading

		)


		def get_project_for_batch(last_processed_project_index, cohort_list, cohort_name):

		@@ -89,11 +134,13 @@ def backfill_seer_grouping_records_for_project(

		def make_backfill_grouping_index_redis_key(project_id: int):
		redis_key = "grouping_record_backfill.last_processed_grouping_index"

feat(similarity): add functionality to backfill a cohort of projects #73075

feat(similarity): add functionality to backfill a cohort of projects #73075

Conversation

JoshFerge commented Jun 20, 2024 • edited Loading

codecov bot commented Jun 20, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

jangjodi left a comment

Choose a reason for hiding this comment

lobsterkatie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sentry-io bot commented Jun 24, 2024 • edited Loading

Suspect Issues

JoshFerge commented Jun 20, 2024 •

edited

Loading

codecov bot commented Jun 20, 2024 •

edited

Loading

sentry-io bot commented Jun 24, 2024 •

edited

Loading