-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
ref(aci): remove passing in detector to action.trigger attempt 2 #103099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| return Detector.objects.get(project_id=group.project_id, type=group.issue_type.slug) | ||
| except Detector.DoesNotExist: | ||
| # return issue stream detector | ||
| return Detector.objects.get(project_id=group.project_id, type=IssueStreamGroupType.slug) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Error handling: Logging noise and crashes
The get_detector_by_group function uses logger.exception() for expected control flow when DetectorGroup.DoesNotExist is raised. This logs full stack traces to error tracking for normal fallback behavior, polluting monitoring systems. Additionally, the final fallback Detector.objects.get(project_id=group.project_id, type=IssueStreamGroupType.slug) can raise an unhandled Detector.DoesNotExist exception if no issue stream detector exists for the project, causing crashes instead of graceful error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should crash if there's no detector at the end
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
| pass | ||
|
|
||
| try: | ||
| return Detector.objects.get(project_id=group.project_id, type=group.issue_type.slug) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other types of detectors (besides the issue stream type handled later) are there only 1 per project? Is it just the error detector type or are there others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance detectors will probably be 1 per project(?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so there's an assumption that anything found for DetectorGroup is for metric issues, then it tries the other types and finally falls back to issue stream group types? Would an uptime or cron detector potentially hit a MultipleObjectsReturned? Maybe it'd be safer to pass a list of types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think uptime and crons hit DetectorGroup too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it so that this is only used for activity notifications (e.g. resolve for metric issues)
We only process updates if the detector_id is attached to the StatusChangeMessage
sentry/src/sentry/workflow_engine/handlers/workflow/workflow_status_update_handler.py
Lines 37 to 43 in 646bc83
| detector_id = status_change_message.get("detector_id") | |
| if detector_id is None: | |
| # We should not hit this case, it's should only occur if there is a bug | |
| # passing it from the workflow_engine to the issue platform. | |
| metrics.incr("workflow_engine.tasks.error.no_detector_id") | |
| return |
If detector_id exists, I believe it must have existed when the occurrence to create the issue was sent to issue platform, meaning we would have already written the DetectorGroup row
sentry/src/sentry/issues/ingest.py
Lines 256 to 257 in 646bc83
| if is_new and occurrence.evidence_data and "detector_id" in occurrence.evidence_data: | |
| associate_new_group_with_detector(group, occurrence.evidence_data["detector_id"]) |
c64e0f8 to
065ac2a
Compare
| def trigger(self, event_data: WorkflowEventData) -> None: | ||
| from sentry.workflow_engine.processors.detector import get_detector_by_group | ||
|
|
||
| detector = get_detector_by_group(event_data.group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered tracking how often a Detector is found here?
We know coverage isn't 100%, so while this is probably going to work for newer groups, maybe not older ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I should update this to also use either get_detector_by_group or get_detector_by_event depending on whether the event is a GroupEvent or Activity
We will only send the activity through workflow engine if a detector_id exists in issue platform
sentry/src/sentry/workflow_engine/handlers/workflow/workflow_status_update_handler.py
Lines 39 to 49 in e5270d6
| if detector_id is None: | |
| # We should not hit this case, it's should only occur if there is a bug | |
| # passing it from the workflow_engine to the issue platform. | |
| metrics.incr("workflow_engine.tasks.error.no_detector_id") | |
| return | |
| process_workflow_activity.delay( | |
| activity_id=activity.id, | |
| group_id=group.id, | |
| detector_id=detector_id, | |
| ) |
I am assuming that if that's the case, then we would have already written the DetectorGroup row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking is good, yes
kcons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is that we have known incompleteness of DetectorGroup that we were investigating yesterday, so moving to rely on it exclusively for action triggering will be a bit of a regression.
For error it's nbd because there's an easy fallback, but for others it seems worth being confident on coverage of DetectorGroup before making it load-bearing.
| ) | ||
| raise Detector.DoesNotExist("Detector not found for event data") | ||
|
|
||
| raise TypeError(f"Cannot determine the detector from {type(event_data.event)}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes more sense as an else above.
| "group_id": event_data.group.id, | ||
| }, | ||
| ) | ||
| raise Detector.DoesNotExist("Detector not found for event data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems slightly better to re-raise, as we've already sent an exception to sentry, and if we reraise and someone tries to send it again it should be dropped by dedup, vs raising a new one which will result in potential double-reporting.
| try: | ||
| return DetectorGroup.objects.get(group=group).detector | ||
| except DetectorGroup.DoesNotExist: | ||
| logger.exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This'll probably block deploy. We know DetectorGroup coverage to be incomplete for metric and error issues.
I think it's okay to report that, just making sure we're aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this will fail to fetch detectors for most metric issue groups.
|
|
||
| try: | ||
| return Detector.objects.get(project_id=group.project_id, type=group.issue_type.slug) | ||
| except Detector.DoesNotExist: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, this sort of logic wouldn't need to depend on our awareness of current group type conventions..
it'd be something like
if detector_type := group.issue_type.singleton_detector_type:
return Detector.objects.get(....)
that is, each type explicitly is configured to use a detectors or not. The scheme for detector association being knowable based on GroupType seems like a nice thing to have. "Which groups use the issue stream detector?" would be answerable with a grep.
mifu67
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're already aware of this, but just calling out that we shouldn't merge this until the backfill is done, or metric actions will break.
| try: | ||
| return DetectorGroup.objects.get(group=group).detector | ||
| except DetectorGroup.DoesNotExist: | ||
| logger.exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this will fail to fetch detectors for most metric issue groups.
Same as #103000
But adds in fetching the detector correctly for Activities. Detector is specifically needed to send activity notifications correctly for metric issue resolution