Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(similarity): Add delete record by hash task #72767

Merged
merged 21 commits into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
ff6a517
feat(similarity-embedding): Delete grouping records on group delete
jangjodi May 28, 2024
071cb4e
fix: Add mock to test
jangjodi May 28, 2024
89278c8
ref: Separate changes
jangjodi Jun 13, 2024
0095b85
feat(similarity): Add delete record by hash task
jangjodi Jun 14, 2024
15a166d
Merge branch 'master' into jodi/delete-grouping-records-by-hash
jangjodi Jun 14, 2024
0a8f8fb
Merge branch 'jodi/delete-grouping-records-by-hash' into jodi/delete-…
jangjodi Jun 14, 2024
72b01fe
fix: Remove unneeded json.dumps
jangjodi Jun 14, 2024
7f9eabb
Merge remote-tracking branch 'origin' into jodi/delete-grouping-recor…
jangjodi Jun 14, 2024
ea9a59b
Merge branch 'master' into jodi/delete-grouping-records-by-hash
jangjodi Jun 20, 2024
c63330e
fix: Test
jangjodi Jun 20, 2024
e3b7268
ref: Add option for record delete batch size
jangjodi Jun 20, 2024
be7f3d7
fix: Test paths
jangjodi Jun 20, 2024
0231899
Merge branch 'jodi/delete-grouping-records-by-hash' into jodi/delete-…
jangjodi Jun 20, 2024
f429cab
fix: Option usage
jangjodi Jun 20, 2024
f606070
ref: Change batch size to 100
jangjodi Jun 20, 2024
a8f5733
fix: Test
jangjodi Jun 21, 2024
13cba90
Merge branch 'master' into jodi/delete-grouping-records-by-hash
jangjodi Jun 25, 2024
b44d88d
fix: Merge
jangjodi Jun 25, 2024
e147c4f
ref: Add timeout
jangjodi Jun 25, 2024
8694303
Merge branch 'jodi/delete-grouping-records-by-hash' into jodi/delete-…
jangjodi Jun 25, 2024
c4ecda3
Merge branch 'master' into jodi/delete-record-by-hash-task
jangjodi Jun 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/sentry/options/defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -935,6 +935,12 @@
flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE,
)

register(
"embeddings-grouping.seer.delete-record-batch-size",
default=100,
flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE,
)

# ## sentry.killswitches
#
# The following options are documented in sentry.killswitches in more detail
Expand Down
33 changes: 33 additions & 0 deletions src/sentry/tasks/delete_seer_grouping_records_by_hash.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from typing import Any

from sentry import options
from sentry.seer.similarity.grouping_records import delete_grouping_records_by_hash
from sentry.silo.base import SiloMode
from sentry.tasks.base import instrumented_task


@instrumented_task(
name="sentry.tasks.delete_seer_grouping_records_by_hash",
queue="delete_seer_grouping_records_by_hash",
max_retries=0,
silo_mode=SiloMode.REGION,
)
def delete_seer_grouping_records_by_hash(
project_id: int,
hashes: list[str],
last_deleted_index: int = 0,
soft_time_limit=60 * 15,
time_limit=60 * (15 + 5),
*args: Any,
**kwargs: Any,
) -> None:
"""
Task to delete seer grouping records by hash list.
Calls the seer delete by hash endpoint with batches of hashes of size `BATCH_SIZE`.
"""
batch_size = options.get("embeddings-grouping.seer.delete-record-batch-size")
len_hashes = len(hashes)
end_index = min(last_deleted_index + batch_size, len_hashes)
delete_grouping_records_by_hash(project_id, hashes[last_deleted_index:end_index])
if end_index < len_hashes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to call the task again instead of just calling the delete_grouping_records_by_hash function repetitively?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's based on the logic we used for the backfill - I didn't want this task to timeout, so I'm respawning the task instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is our current timeout set to on the celery task?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha whoops just realized I didn't even add a timeout, but I was thinking of using the same timeouts from the backfill soft_time_limit=60 * 15, time_limit=60 * (15 + 5)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. i'd consider thinking about increasing the batch size to 100, and optionally doing some calls in a loop, so that way we have to spawn less celery tasks, as each one should run pretty quick (less than a second)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I didn't realize they were that quick - thanks!

delete_seer_grouping_records_by_hash.apply_async(args=[project_id, hashes, end_index])
24 changes: 24 additions & 0 deletions tests/sentry/tasks/test_delete_seer_grouping_records_by_hash.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from unittest.mock import patch

from sentry.tasks.delete_seer_grouping_records_by_hash import delete_seer_grouping_records_by_hash
from sentry.testutils.pytest.fixtures import django_db_all


@django_db_all
@patch("sentry.tasks.delete_seer_grouping_records_by_hash.delete_grouping_records_by_hash")
@patch(
"sentry.tasks.delete_seer_grouping_records_by_hash.delete_seer_grouping_records_by_hash.apply_async"
)
def test_delete_seer_grouping_records_by_hash_batches(
mock_delete_seer_grouping_records_by_hash_apply_async, mock_delete_grouping_records_by_hash
):
"""
Test that when delete_seer_grouping_records_by_hash is called with over 20 hashes, it spawns
another task with the end index of the previous batch.
"""
mock_delete_grouping_records_by_hash.return_value = True
project_id, hashes = 1, [str(i) for i in range(101)]
delete_seer_grouping_records_by_hash(project_id, hashes, 0)
assert mock_delete_seer_grouping_records_by_hash_apply_async.call_args[1] == {
"args": [project_id, hashes, 100]
}
Loading