feat(issue-detection): Add org-level scheduling for AI issue detection#113060
feat(issue-detection): Add org-level scheduling for AI issue detection#113060roggenkemper wants to merge 13 commits intomasterfrom
Conversation
Backend Test FailuresFailures on
|
…ssue detection Replace the hourly per-project dispatch with a 15-minute bucketed dispatcher that spreads org dispatches across time slots using md5 hashing. Each org is assigned to a deterministic slot and dispatched once per full cycle. - Add `organizations:ai-issue-detection` feature flag (FlagPole) - Rewrite dispatcher to iterate active orgs with RangeQuerySetWrapper - Add `detect_llm_issues_for_org` task: picks random project, sends 1 trace - Remove legacy `detect_llm_issues_for_project` and project allowlist path - Change Celery Beat from hourly to every 15 minutes - NUM_DISPATCH_SLOTS=10 (~2.5h cycle), increase toward 67 as org count grows Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
173d79f to
b12c4d1
Compare
… detection Respect the user's project-level setting to avoid wasting Snuba queries and Seer calls when AI detection is disabled for the selected project. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dition Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
bugbot run |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5f1e8ae. Configure here.
Scan all orgs and check feature flags once per cycle (slot 0), store the eligible org IDs in Redis. Subsequent ticks read from cache instead of scanning the DB. Cache TTL is 2x the cycle length as a safety margin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fallback in _get_eligible_org_ids rebuilds the cache if it expires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
500 per slot × 10 slots = 5k max, too low for 17k orgs. 2000 × 10 = 20k covers the current enrollment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wedamija
left a comment
There was a problem hiding this comment.
I'm wondering if CursoredScheduler does what you need here? You can run it like
sentry/src/sentry/integrations/source_code_management/sync_repos.py
Lines 396 to 408 in 7340e27
It doesn't do the caching of all the orgs in the batch, but we could expand it that way if it's helpful. I think it'd be good to try and genericise this kind of logic, so let me know if you want to try and integrate with it and i'm happy to either work on it or review any of your changes
| """Read cached eligible org IDs, or rebuild if missing.""" | ||
| cluster = redis_clusters.get("default") | ||
| cached = cluster.get(ELIGIBLE_ORGS_CACHE_KEY) | ||
| if cached: |
There was a problem hiding this comment.
edge case that'll probably never get hit but may want to check not None instead of truthy in case the eligible orgs are cached as no orgs
| if not has_access: | ||
| return | ||
|
|
||
| projects = list( |
There was a problem hiding this comment.
nit: project_ids might be a better name
| traces_to_send: list[TraceMetadataWithSpanCount] = [ | ||
| t for t in evidence_traces if t.trace_id in unprocessed_ids | ||
| ][:NUM_TRANSACTIONS_TO_PROCESS] | ||
| ][:1] |
There was a problem hiding this comment.
Any chance of this changing in the future? Should we consider leaving it as a constant?
There was a problem hiding this comment.
there's a chance tho it's hard to say rn how likely it is, depends on some data we will gather as we LA
|
|
||
| Returns the allowlist from system options. | ||
| """ | ||
| return options.get("issue-detection.llm-detection.projects-allowlist") |
There was a problem hiding this comment.
Should/can we delete the registration for this option?
There was a problem hiding this comment.
yes - will do that in future PRs!
|
|
||
| dispatched = 0 | ||
| for org_id in eligible_org_ids: | ||
| if dispatched >= MAX_ORGS_PER_CYCLE: |
There was a problem hiding this comment.
should we log/metric when we drop some orgs?
There was a problem hiding this comment.
this won't be a problem during the LA/EA, and as we get closer to GA the number of groups will increase to the point where i doubt we will actually hit this, but having a metric could be good
Replace md5 hash bucketing + Redis org cache with the built-in CursoredScheduler framework. It handles cursor-based batching, distributed locking, and cycle metrics out of the box. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| perf_settings = project.get_option("sentry:performance_issue_settings", default={}) | ||
| if not perf_settings.get("ai_issue_detection_enabled", True): |
There was a problem hiding this comment.
Bug: The task detect_llm_issues_for_org randomly selects one project. If that project has ai_issue_detection_enabled=False, the entire organization is silently skipped for the cycle.
Severity: MEDIUM
Suggested Fix
Instead of selecting one random project and exiting if it's ineligible, iterate through the organization's projects until an eligible one is found. Alternatively, filter the initial project list to only include those with ai_issue_detection_enabled=True before making a random selection. This ensures that organizations with at least one eligible project are always processed.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/sentry/tasks/llm_issue_detection/detection.py#L327-L328
Potential issue: The `detect_llm_issues_for_org` task processes an organization by
selecting a single random project to check for eligibility. If this randomly chosen
project has the `ai_issue_detection_enabled` setting disabled, the function returns
early. This causes the entire organization to be silently skipped for the current
detection cycle, even if other projects within the same organization have the feature
enabled. This behavior is a functional regression from the previous implementation,
which processed each eligible project individually, and leads to non-deterministic and
reduced feature coverage for multi-project organizations.

Replace the hourly per-project dispatch with a 15-minute bucketed dispatcher that spreads org dispatches across time slots using hashing. Each org is assigned to a deterministic slot and dispatched once per full cycle. cached org results so each run doesn't need to find eligible organizations