ref(crons): Reorganize incident creation / issue occurrence logic#80
ref(crons): Reorganize incident creation / issue occurrence logic#80camcalaquian wants to merge 1 commit into
Conversation
…tsentry#80528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of getsentryGH-80527
|
Tenki Code Review - Complete Files Reviewed: 4 By Severity:
This PR refactors monitor incident logic with 5 significant code quality issues: 1 config dict bug that silences human-readable display strings, 2 data validation bugs (unbounded DB query, unchecked array access), 1 output rendering bug, and 1 N+1 query performance issue. Files Reviewed (4 files) |
There was a problem hiding this comment.
Review Summary
This PR refactors monitor failure and incident detection logic into separate modules (incident_occurrence.py, incidents.py) and extracts a reusable SimpleCheckIn type into types.py. The refactoring improves code organization but introduces 5 significant issues across 3 severity/category levels.
High-Severity Bugs (3)
1. Config dict not mutated in response (cq-001)
- File:
incident_occurrence.py:168 - Impact:
get_monitor_environment_context()translatesschedule_typeto a human-readable display string (lines 160–162) but returns the original unmodified config instead of the modified copy. This silently discards the translation work. - Affects: Issue occurrence context data sent to Kafka; incident details will show raw enum values instead of display strings.
2. Unbounded DB query before Python trim (cq-002)
- File:
incidents.py:40–44 - Impact: When
failure_issue_threshold > 1, the ORM query has no.values()[:N]LIMIT. The Python slice on line 48 evaluates the full queryset first, fetching ALL check-ins for an environment into memory before trimming. - Affects: For active monitors with thousands of check-ins, this wastes memory and DB I/O.
3. Unguarded array access after DB query (cq-003)
- File:
incidents.py:59 - Impact: After querying previous check-ins,
starting_checkin = previous_checkins[0]is accessed without checking for an empty list. A race condition or data inconsistency could cause anIndexError. - Affects: Monitor incident tracking reliability; could crash the check-in processor.
Medium-Severity Bugs (2)
4. Garbled user-facing output (cq-004)
- File:
incident_occurrence.py:145–146 - Impact:
get_failure_reason()has no guard for the emptystatus_countscase. When all check-ins are unrecognized statuses (or the input is empty), the output becomes" check-ins detected". - Affects: User-visible incident failure reason text in Sentry UI.
5. N+1 DB queries in loop (cq-005)
- File:
incidents.py:93–100 - Impact: Inside the checkin loop,
create_incident_occurrence()calls invariant DB queries (get_last_successful_checkin(),get_environment()) on each iteration. Forfailure_issue_threshold=N, this multiplies DB round-trips by N. - Affects: Performance; for large thresholds, creates unnecessary database load.
Recommendation
Request changes — All 5 issues should be resolved before merging, with priority on the three high-severity bugs (config loss, unbounded query, IndexError).
| "id": str(monitor_environment.monitor.guid), | ||
| "slug": str(monitor_environment.monitor.slug), | ||
| "name": monitor_environment.monitor.name, | ||
| "config": monitor_environment.monitor.config, |
There was a problem hiding this comment.
🔴 get_monitor_environment_context uses original config instead of modified copy (bug)
In get_monitor_environment_context, lines 160-162 copy the monitor's config dict and translate the schedule_type field using get_schedule_type_display(). However, line 168 returns monitor_environment.monitor.config (the original dict) instead of the local config variable. This means the schedule_type translation is silently dropped — callers always receive the raw integer/enum value rather than the human-readable display string.
💡 Suggestion: Change line 168 to use the local config variable that was prepared on lines 160-162, not the original monitor_environment.monitor.config.
| "config": monitor_environment.monitor.config, | |
| "config": config, |
📋 Prompt for AI Agents
In src/sentry/monitors/logic/incident_occurrence.py at line 168, the get_monitor_environment_context function builds a mutated config copy on lines 160-162 (translating schedule_type to a human-readable string) but then returns monitor_environment.monitor.config (the unmodified original) instead of the local config variable. Change line 168 from "config": monitor_environment.monitor.config, to "config": config, so that the translated schedule_type is actually included in the returned context.
| MonitorCheckIn.objects.filter( | ||
| monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added | ||
| ) | ||
| .order_by("-date_added") | ||
| .values("id", "date_added", "status"), |
There was a problem hiding this comment.
🔴 Unbounded DB query: LIMIT not pushed to database before Python slice (bug)
In try_incident_threshold (incidents.py), the else branch for failure_issue_threshold > 1 assigns the full .values() queryset to previous_checkins on lines 37-45 without a .values_list()[:N] or .values()[:N] limit. The Python slice on line 48 (previous_checkins[:failure_issue_threshold]) triggers evaluation of the queryset and fetches ALL matching check-ins for the monitor environment before trimming. For environments with many check-ins this can pull thousands of rows when only failure_issue_threshold rows are needed. The comment on line 39 says "Using .values for performance reasons" but the actual DB LIMIT is missing.
💡 Suggestion: Apply the slice limit at the ORM query level by chaining [:failure_issue_threshold] onto the queryset before assigning, then remove the Python-level slice on line 48 (or keep it as a safety guard).
📋 Prompt for AI Agents
In src/sentry/monitors/logic/incidents.py at lines 40-44, the queryset MonitorCheckIn.objects.filter(...).order_by('-date_added').values('id','date_added','status') is assigned without a database LIMIT. The Python slice on line 48 (previous_checkins[:failure_issue_threshold]) only applies after all rows are fetched. Fix by appending [:failure_issue_threshold] to the ORM queryset chain (before the closing parenthesis of the cast call), converting it to a database LIMIT. Then wrap it in list(...) at assignment time and remove the redundant Python slice on line 48 (the list(reversed(...)) should operate directly on the already-limited list).
| monitor_env.status = MonitorStatus.ERROR | ||
| monitor_env.save(update_fields=("status",)) | ||
|
|
||
| starting_checkin = previous_checkins[0] |
There was a problem hiding this comment.
🔴 Potential IndexError: previous_checkins[0] accessed without empty-list guard (bug)
In try_incident_threshold (incidents.py), line 59 unconditionally accesses previous_checkins[0] to retrieve the starting_checkin. In the failure_issue_threshold > 1 branch, if the DB query on lines 40-44 returns an empty queryset (e.g. due to a race where the check-in hasn't been committed yet), previous_checkins after line 48 would be an empty list, causing an unhandled IndexError. The failure threshold branch checks any([...OK...]) and returns False only when there is an OK checkin; it does not guard against the empty-list case.
💡 Suggestion: Add an explicit guard before line 59: if not previous_checkins: return False to safely handle the empty list case.
📋 Prompt for AI Agents
In src/sentry/monitors/logic/incidents.py, add an empty-list guard immediately before line 59 (starting_checkin = previous_checkins[0]). Insert if not previous_checkins: return False (with appropriate indentation inside the if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]: block, after the any(...) check on line 52). This prevents an IndexError if the queryset returns no results due to a race condition or data inconsistency.
| if sum(status_counts.values()) == 1: | ||
| return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] |
There was a problem hiding this comment.
🟠 get_failure_reason produces garbled output when no recognized failure statuses are present (bug)
In get_failure_reason (incident_occurrence.py), the generator on lines 139-143 filters check-ins to only those whose status is in HUMAN_FAILURE_STATUS_MAP. If failed_checkins is empty or contains only statuses not in the map (e.g. CheckInStatus.IN_PROGRESS), status_counts will be empty. sum(status_counts.values()) == 0, which is not 1, so execution falls through to line 148. get_text_list([]) returns '', and the final _('%(problem_checkins)s check-ins detected') becomes ' check-ins detected' — a confusing, user-visible string.
💡 Suggestion: Add a guard for the empty/zero case before the sum == 1 check. Return a sensible fallback string such as _('A check-in failure was detected') when status_counts is empty.
📋 Prompt for AI Agents
In src/sentry/monitors/logic/incident_occurrence.py at lines 145-146, add an early-return guard for the case where status_counts is empty (i.e., no recognized failure statuses). Before the existing if sum(status_counts.values()) == 1: check, insert:
if not status_counts:
return _("A check-in failure was detected")This prevents the function from producing a garbled string like ' check-ins detected' when all check-ins have unrecognized statuses or when the input list is empty.
| checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins]) | ||
| for checkin in checkins: | ||
| create_incident_occurrence( | ||
| previous_checkins, | ||
| checkin, | ||
| incident, | ||
| received=received, | ||
| ) |
There was a problem hiding this comment.
🟠 N+1 DB queries: per-checkin calls inside occurrence creation loop (performance)
In try_incident_threshold (incidents.py, lines 93-100), a MonitorCheckIn.objects.filter(...) result set is iterated and create_incident_occurrence is called for each item. Inside create_incident_occurrence (incident_occurrence.py), monitor_env.get_last_successful_checkin() (line 46, a DB query) and monitor_env.get_environment() (lines 67, 90, DB lookups called twice per occurrence) are issued on each iteration. For a failure_issue_threshold of N, this results in N extra DB round-trips for get_last_successful_checkin and up to 2N for get_environment. These values are identical across all loop iterations since monitor_env doesn't change.
💡 Suggestion: Hoist the invariant queries (get_last_successful_checkin, get_environment) out of the loop and pass their results as arguments to create_incident_occurrence. Alternatively, cache the results before the loop. Also consider whether firing one occurrence per checkin (rather than a single aggregate occurrence) is intentional — if a single occurrence suffices, eliminate the loop entirely.
📋 Prompt for AI Agents
In src/sentry/monitors/logic/incidents.py at lines 93-100, the loop calls create_incident_occurrence once per checkin. Inside that function (incident_occurrence.py lines 46, 67, 90), monitor_env.get_last_successful_checkin() and monitor_env.get_environment() are called on every iteration — the same queries repeated N times for identical results. Fix by caching these results before the loop: call monitor_env.get_last_successful_checkin() once before the for loop, and fetch monitor_env.get_environment() once and store it. Pass the cached values into create_incident_occurrence as additional parameters (updating its signature accordingly), so that DB queries are not duplicated per checkin.
| "id": str(monitor_environment.monitor.guid), | ||
| "slug": str(monitor_environment.monitor.slug), | ||
| "name": monitor_environment.monitor.name, | ||
| "config": monitor_environment.monitor.config, |
There was a problem hiding this comment.
🟡 get_monitor_environment_context returns original config instead of the modified copy
In get_monitor_environment_context, a local config copy is created and schedule_type is overwritten with a human-readable display value (line 160-162). However, the return dict on line 168 uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This means the schedule_type transformation is silently discarded and the monitor context always contains the raw schedule_type value. This is a pre-existing bug that was carried over during this refactor.
| "config": monitor_environment.monitor.config, | |
| "config": config, |
Was this helpful? React with 👍 or 👎 to provide feedback.
Uh oh!
There was an error while loading. Please reload this page.