Skip to content

ref(crons): Reorganize incident creation / issue occurrence logic#80

Open
camcalaquian wants to merge 1 commit into
tenki/base-8from
tenki/head-8
Open

ref(crons): Reorganize incident creation / issue occurrence logic#80
camcalaquian wants to merge 1 commit into
tenki/base-8from
tenki/head-8

Conversation

@camcalaquian
Copy link
Copy Markdown

@camcalaquian camcalaquian commented Apr 28, 2026

…tsentry#80528)

Since we'll be doing more with issue occurrences split out the concept
of incidents into it's own logic module, as well as incident_occurrence
into it's own module

Part of getsentryGH-80527
@tenki-reviewer
Copy link
Copy Markdown

tenki-reviewer Bot commented Apr 28, 2026

Tenki Code Review - Complete

Files Reviewed: 4
Findings: 5

By Severity:

  • 🔴 High: 3
  • 🟠 Medium: 2

This PR refactors monitor incident logic with 5 significant code quality issues: 1 config dict bug that silences human-readable display strings, 2 data validation bugs (unbounded DB query, unchecked array access), 1 output rendering bug, and 1 N+1 query performance issue.

Files Reviewed (4 files)
src/sentry/monitors/logic/incident_occurrence.py
src/sentry/monitors/logic/incidents.py
src/sentry/monitors/logic/mark_failed.py
src/sentry/monitors/types.py

Copy link
Copy Markdown

@tenki-reviewer tenki-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR refactors monitor failure and incident detection logic into separate modules (incident_occurrence.py, incidents.py) and extracts a reusable SimpleCheckIn type into types.py. The refactoring improves code organization but introduces 5 significant issues across 3 severity/category levels.

High-Severity Bugs (3)

1. Config dict not mutated in response (cq-001)

  • File: incident_occurrence.py:168
  • Impact: get_monitor_environment_context() translates schedule_type to a human-readable display string (lines 160–162) but returns the original unmodified config instead of the modified copy. This silently discards the translation work.
  • Affects: Issue occurrence context data sent to Kafka; incident details will show raw enum values instead of display strings.

2. Unbounded DB query before Python trim (cq-002)

  • File: incidents.py:40–44
  • Impact: When failure_issue_threshold > 1, the ORM query has no .values()[:N] LIMIT. The Python slice on line 48 evaluates the full queryset first, fetching ALL check-ins for an environment into memory before trimming.
  • Affects: For active monitors with thousands of check-ins, this wastes memory and DB I/O.

3. Unguarded array access after DB query (cq-003)

  • File: incidents.py:59
  • Impact: After querying previous check-ins, starting_checkin = previous_checkins[0] is accessed without checking for an empty list. A race condition or data inconsistency could cause an IndexError.
  • Affects: Monitor incident tracking reliability; could crash the check-in processor.

Medium-Severity Bugs (2)

4. Garbled user-facing output (cq-004)

  • File: incident_occurrence.py:145–146
  • Impact: get_failure_reason() has no guard for the empty status_counts case. When all check-ins are unrecognized statuses (or the input is empty), the output becomes " check-ins detected".
  • Affects: User-visible incident failure reason text in Sentry UI.

5. N+1 DB queries in loop (cq-005)

  • File: incidents.py:93–100
  • Impact: Inside the checkin loop, create_incident_occurrence() calls invariant DB queries (get_last_successful_checkin(), get_environment()) on each iteration. For failure_issue_threshold=N, this multiplies DB round-trips by N.
  • Affects: Performance; for large thresholds, creates unnecessary database load.

Recommendation

Request changes — All 5 issues should be resolved before merging, with priority on the three high-severity bugs (config loss, unbounded query, IndexError).

"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 get_monitor_environment_context uses original config instead of modified copy (bug)

In get_monitor_environment_context, lines 160-162 copy the monitor's config dict and translate the schedule_type field using get_schedule_type_display(). However, line 168 returns monitor_environment.monitor.config (the original dict) instead of the local config variable. This means the schedule_type translation is silently dropped — callers always receive the raw integer/enum value rather than the human-readable display string.

💡 Suggestion: Change line 168 to use the local config variable that was prepared on lines 160-162, not the original monitor_environment.monitor.config.

Suggested change
"config": monitor_environment.monitor.config,
"config": config,
📋 Prompt for AI Agents

In src/sentry/monitors/logic/incident_occurrence.py at line 168, the get_monitor_environment_context function builds a mutated config copy on lines 160-162 (translating schedule_type to a human-readable string) but then returns monitor_environment.monitor.config (the unmodified original) instead of the local config variable. Change line 168 from "config": monitor_environment.monitor.config, to "config": config, so that the translated schedule_type is actually included in the returned context.

Comment on lines +40 to +44
MonitorCheckIn.objects.filter(
monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
)
.order_by("-date_added")
.values("id", "date_added", "status"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unbounded DB query: LIMIT not pushed to database before Python slice (bug)

In try_incident_threshold (incidents.py), the else branch for failure_issue_threshold > 1 assigns the full .values() queryset to previous_checkins on lines 37-45 without a .values_list()[:N] or .values()[:N] limit. The Python slice on line 48 (previous_checkins[:failure_issue_threshold]) triggers evaluation of the queryset and fetches ALL matching check-ins for the monitor environment before trimming. For environments with many check-ins this can pull thousands of rows when only failure_issue_threshold rows are needed. The comment on line 39 says "Using .values for performance reasons" but the actual DB LIMIT is missing.

💡 Suggestion: Apply the slice limit at the ORM query level by chaining [:failure_issue_threshold] onto the queryset before assigning, then remove the Python-level slice on line 48 (or keep it as a safety guard).

📋 Prompt for AI Agents

In src/sentry/monitors/logic/incidents.py at lines 40-44, the queryset MonitorCheckIn.objects.filter(...).order_by('-date_added').values('id','date_added','status') is assigned without a database LIMIT. The Python slice on line 48 (previous_checkins[:failure_issue_threshold]) only applies after all rows are fetched. Fix by appending [:failure_issue_threshold] to the ORM queryset chain (before the closing parenthesis of the cast call), converting it to a database LIMIT. Then wrap it in list(...) at assignment time and remove the redundant Python slice on line 48 (the list(reversed(...)) should operate directly on the already-limited list).

monitor_env.status = MonitorStatus.ERROR
monitor_env.save(update_fields=("status",))

starting_checkin = previous_checkins[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Potential IndexError: previous_checkins[0] accessed without empty-list guard (bug)

In try_incident_threshold (incidents.py), line 59 unconditionally accesses previous_checkins[0] to retrieve the starting_checkin. In the failure_issue_threshold > 1 branch, if the DB query on lines 40-44 returns an empty queryset (e.g. due to a race where the check-in hasn't been committed yet), previous_checkins after line 48 would be an empty list, causing an unhandled IndexError. The failure threshold branch checks any([...OK...]) and returns False only when there is an OK checkin; it does not guard against the empty-list case.

💡 Suggestion: Add an explicit guard before line 59: if not previous_checkins: return False to safely handle the empty list case.

📋 Prompt for AI Agents

In src/sentry/monitors/logic/incidents.py, add an empty-list guard immediately before line 59 (starting_checkin = previous_checkins[0]). Insert if not previous_checkins: return False (with appropriate indentation inside the if monitor_env.status in [MonitorStatus.OK, MonitorStatus.ACTIVE]: block, after the any(...) check on line 52). This prevents an IndexError if the queryset returns no results due to a race condition or data inconsistency.

Comment on lines +145 to +146
if sum(status_counts.values()) == 1:
return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 get_failure_reason produces garbled output when no recognized failure statuses are present (bug)

In get_failure_reason (incident_occurrence.py), the generator on lines 139-143 filters check-ins to only those whose status is in HUMAN_FAILURE_STATUS_MAP. If failed_checkins is empty or contains only statuses not in the map (e.g. CheckInStatus.IN_PROGRESS), status_counts will be empty. sum(status_counts.values()) == 0, which is not 1, so execution falls through to line 148. get_text_list([]) returns '', and the final _('%(problem_checkins)s check-ins detected') becomes ' check-ins detected' — a confusing, user-visible string.

💡 Suggestion: Add a guard for the empty/zero case before the sum == 1 check. Return a sensible fallback string such as _('A check-in failure was detected') when status_counts is empty.

📋 Prompt for AI Agents

In src/sentry/monitors/logic/incident_occurrence.py at lines 145-146, add an early-return guard for the case where status_counts is empty (i.e., no recognized failure statuses). Before the existing if sum(status_counts.values()) == 1: check, insert:

if not status_counts:
    return _("A check-in failure was detected")

This prevents the function from producing a garbled string like ' check-ins detected' when all check-ins have unrecognized statuses or when the input list is empty.

Comment on lines +93 to +100
checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
for checkin in checkins:
create_incident_occurrence(
previous_checkins,
checkin,
incident,
received=received,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 N+1 DB queries: per-checkin calls inside occurrence creation loop (performance)

In try_incident_threshold (incidents.py, lines 93-100), a MonitorCheckIn.objects.filter(...) result set is iterated and create_incident_occurrence is called for each item. Inside create_incident_occurrence (incident_occurrence.py), monitor_env.get_last_successful_checkin() (line 46, a DB query) and monitor_env.get_environment() (lines 67, 90, DB lookups called twice per occurrence) are issued on each iteration. For a failure_issue_threshold of N, this results in N extra DB round-trips for get_last_successful_checkin and up to 2N for get_environment. These values are identical across all loop iterations since monitor_env doesn't change.

💡 Suggestion: Hoist the invariant queries (get_last_successful_checkin, get_environment) out of the loop and pass their results as arguments to create_incident_occurrence. Alternatively, cache the results before the loop. Also consider whether firing one occurrence per checkin (rather than a single aggregate occurrence) is intentional — if a single occurrence suffices, eliminate the loop entirely.

📋 Prompt for AI Agents

In src/sentry/monitors/logic/incidents.py at lines 93-100, the loop calls create_incident_occurrence once per checkin. Inside that function (incident_occurrence.py lines 46, 67, 90), monitor_env.get_last_successful_checkin() and monitor_env.get_environment() are called on every iteration — the same queries repeated N times for identical results. Fix by caching these results before the loop: call monitor_env.get_last_successful_checkin() once before the for loop, and fetch monitor_env.get_environment() once and store it. Pass the cached values into create_incident_occurrence as additional parameters (updating its signature accordingly), so that DB queries are not duplicated per checkin.

@camcalaquian camcalaquian marked this pull request as draft April 30, 2026 16:18
@camcalaquian camcalaquian marked this pull request as ready for review April 30, 2026 16:19
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 2 additional findings in Devin Review.

Open in Devin Review

"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 get_monitor_environment_context returns original config instead of the modified copy

In get_monitor_environment_context, a local config copy is created and schedule_type is overwritten with a human-readable display value (line 160-162). However, the return dict on line 168 uses monitor_environment.monitor.config (the original, unmodified config) instead of the local config variable. This means the schedule_type transformation is silently discarded and the monitor context always contains the raw schedule_type value. This is a pre-existing bug that was carried over during this refactor.

Suggested change
"config": monitor_environment.monitor.config,
"config": config,
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants