Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(similarity): Add delete record by hash task #72767

Merged
merged 21 commits into from
Jun 26, 2024

Conversation

jangjodi
Copy link
Member

Add the task that calls the seer delete record by hash endpoint
Given a list of hashes, batch by size 20

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jun 14, 2024
@jangjodi jangjodi marked this pull request as draft June 14, 2024 13:09
Copy link
Member

@JoshFerge JoshFerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to register the celery task? and could we make batch size an option? i have a feeling we can probably go pretty high with it, into the thousands

@jangjodi
Copy link
Member Author

do we need to register the celery task? and could we make batch size an option? i have a feeling we can probably go pretty high with it, into the thousands

The task is registered here. and yes, I've made it an option, good idea - thanks!

@jangjodi jangjodi requested a review from JoshFerge June 20, 2024 18:12
len_hashes = len(hashes)
end_index = min(last_deleted_index + batch_size, len_hashes)
delete_grouping_records_by_hash(project_id, hashes[last_deleted_index:end_index])
if end_index < len_hashes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to call the task again instead of just calling the delete_grouping_records_by_hash function repetitively?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's based on the logic we used for the backfill - I didn't want this task to timeout, so I'm respawning the task instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is our current timeout set to on the celery task?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha whoops just realized I didn't even add a timeout, but I was thinking of using the same timeouts from the backfill soft_time_limit=60 * 15, time_limit=60 * (15 + 5)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. i'd consider thinking about increasing the batch size to 100, and optionally doing some calls in a loop, so that way we have to spawn less celery tasks, as each one should run pretty quick (less than a second)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I didn't realize they were that quick - thanks!

Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@449de6f). Learn more about missing BASE report.
Report is 1 commits behind head on master.

Current head 8694303 differs from pull request most recent head c4ecda3

Please upload reports for the commit c4ecda3 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #72767   +/-   ##
=========================================
  Coverage          ?   78.08%           
=========================================
  Files             ?     6620           
  Lines             ?   295396           
  Branches          ?    50892           
=========================================
  Hits              ?   230671           
  Misses            ?    58461           
  Partials          ?     6264           
Files Coverage Δ
src/sentry/conf/server.py 87.78% <100.00%> (ø)
src/sentry/options/defaults.py 100.00% <100.00%> (ø)
src/sentry/tasks/embeddings_grouping/utils.py 91.07% <100.00%> (ø)
src/sentry/seer/similarity/grouping_records.py 84.84% <93.75%> (ø)
...ntry/tasks/delete_seer_grouping_records_by_hash.py 92.30% <92.30%> (ø)

Base automatically changed from jodi/delete-grouping-records-by-hash to master June 26, 2024 16:12
@jangjodi jangjodi merged commit 34c5f0b into master Jun 26, 2024
48 checks passed
@jangjodi jangjodi deleted the jodi/delete-record-by-hash-task branch June 26, 2024 21:26
@github-actions github-actions bot locked and limited conversation to collaborators Jul 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants