Skip to content

feat(issue-detection): Add org-level scheduling for AI issue detection#113060

Draft
roggenkemper wants to merge 13 commits intomasterfrom
roggenkemper/feat/budget-paced-detection-scheduling
Draft

feat(issue-detection): Add org-level scheduling for AI issue detection#113060
roggenkemper wants to merge 13 commits intomasterfrom
roggenkemper/feat/budget-paced-detection-scheduling

Conversation

@roggenkemper
Copy link
Copy Markdown
Member

@roggenkemper roggenkemper commented Apr 15, 2026

Replace the hourly per-project dispatch with a 15-minute bucketed dispatcher that spreads org dispatches across time slots using hashing. Each org is assigned to a deterministic slot and dispatched once per full cycle. cached org results so each run doesn't need to find eligible organizations

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Backend Test Failures

Failures on 44421d5 in this run:

tests/sentry/tasks/test_llm_issue_detection.py::TestRunLLMIssueDetection::test_dispatches_orgs_in_current_slotlog
[gw0] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
src/sentry/features/manager.py:217: in _get_feature_class
    return self._feature_registry[name]
E   KeyError: 'organizations:ai-issue-detection'

During handling of the above exception, another exception occurred:
src/sentry/testutils/helpers/features.py:72: in features_override
    feature = features.get(name, None)
src/sentry/features/manager.py:227: in get
    cls = self._get_feature_class(name)
src/sentry/features/manager.py:219: in _get_feature_class
    raise FeatureNotRegistered(name)
E   sentry.features.exceptions.FeatureNotRegistered: The "organizations:ai-issue-detection" feature has not been registered. Ensure that a feature has been added to sentry.features.default_manager

During handling of the above exception, another exception occurred:
tests/sentry/tasks/test_llm_issue_detection.py:70: in test_dispatches_orgs_in_current_slot
    run_llm_issue_detection()
.venv/lib/python3.13/site-packages/taskbroker_client/task.py:92: in __call__
    return self._func(*args, **kwargs)
src/sentry/tasks/llm_issue_detection/detection.py:292: in run_llm_issue_detection
    if not features.has("organizations:ai-issue-detection", org):
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/unittest/mock.py:1167: in __call__
    return self._mock_call(*args, **kwargs)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/unittest/mock.py:1171: in _mock_call
    return self._execute_mock_call(*args, **kwargs)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/unittest/mock.py:1232: in _execute_mock_call
    result = effect(*args, **kwargs)
src/sentry/testutils/helpers/features.py:74: in features_override
    raise ValueError("Unregistered feature flag: %s", repr(name))
E   ValueError: ('Unregistered feature flag: %s', "'organizations:ai-issue-detection'")
tests/sentry/tasks/test_llm_issue_detection.py::TestRunLLMIssueDetection::test_skips_orgs_with_hidden_ailog
[gw0] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
src/sentry/features/manager.py:217: in _get_feature_class
    return self._feature_registry[name]
E   KeyError: 'organizations:ai-issue-detection'

During handling of the above exception, another exception occurred:
src/sentry/testutils/helpers/features.py:72: in features_override
    feature = features.get(name, None)
src/sentry/features/manager.py:227: in get
    cls = self._get_feature_class(name)
src/sentry/features/manager.py:219: in _get_feature_class
    raise FeatureNotRegistered(name)
E   sentry.features.exceptions.FeatureNotRegistered: The "organizations:ai-issue-detection" feature has not been registered. Ensure that a feature has been added to sentry.features.default_manager

During handling of the above exception, another exception occurred:
tests/sentry/tasks/test_llm_issue_detection.py:121: in test_skips_orgs_with_hidden_ai
    run_llm_issue_detection()
.venv/lib/python3.13/site-packages/taskbroker_client/task.py:92: in __call__
    return self._func(*args, **kwargs)
src/sentry/tasks/llm_issue_detection/detection.py:292: in run_llm_issue_detection
    if not features.has("organizations:ai-issue-detection", org):
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/unittest/mock.py:1167: in __call__
    return self._mock_call(*args, **kwargs)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/unittest/mock.py:1171: in _mock_call
    return self._execute_mock_call(*args, **kwargs)
/opt/hostedtoolcache/Python/3.13.1/x64/lib/python3.13/unittest/mock.py:1232: in _execute_mock_call
    result = effect(*args, **kwargs)
src/sentry/testutils/helpers/features.py:74: in features_override
    raise ValueError("Unregistered feature flag: %s", repr(name))
E   ValueError: ('Unregistered feature flag: %s', "'organizations:ai-issue-detection'")

Comment thread src/sentry/tasks/llm_issue_detection/detection.py Outdated
roggenkemper and others added 2 commits April 15, 2026 15:56
…ssue detection

Replace the hourly per-project dispatch with a 15-minute bucketed dispatcher
that spreads org dispatches across time slots using md5 hashing. Each org is
assigned to a deterministic slot and dispatched once per full cycle.

- Add `organizations:ai-issue-detection` feature flag (FlagPole)
- Rewrite dispatcher to iterate active orgs with RangeQuerySetWrapper
- Add `detect_llm_issues_for_org` task: picks random project, sends 1 trace
- Remove legacy `detect_llm_issues_for_project` and project allowlist path
- Change Celery Beat from hourly to every 15 minutes
- NUM_DISPATCH_SLOTS=10 (~2.5h cycle), increase toward 67 as org count grows

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roggenkemper roggenkemper force-pushed the roggenkemper/feat/budget-paced-detection-scheduling branch from 173d79f to b12c4d1 Compare April 15, 2026 19:58
roggenkemper and others added 3 commits April 15, 2026 15:59
… detection

Respect the user's project-level setting to avoid wasting Snuba queries
and Seer calls when AI detection is disabled for the selected project.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dition

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/sentry/tasks/llm_issue_detection/detection.py
@roggenkemper
Copy link
Copy Markdown
Member Author

bugbot run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5f1e8ae. Configure here.

Comment thread src/sentry/conf/server.py Outdated
roggenkemper and others added 3 commits April 16, 2026 15:06
Scan all orgs and check feature flags once per cycle (slot 0), store
the eligible org IDs in Redis. Subsequent ticks read from cache instead
of scanning the DB. Cache TTL is 2x the cycle length as a safety margin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fallback in _get_eligible_org_ids rebuilds the cache if it expires.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roggenkemper and others added 2 commits April 16, 2026 15:15
500 per slot × 10 slots = 5k max, too low for 17k orgs.
2000 × 10 = 20k covers the current enrollment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roggenkemper roggenkemper marked this pull request as ready for review April 16, 2026 20:19
@roggenkemper roggenkemper requested a review from a team as a code owner April 16, 2026 20:19
Comment thread src/sentry/tasks/llm_issue_detection/detection.py Outdated
Comment thread src/sentry/tasks/llm_issue_detection/detection.py Outdated
Comment thread src/sentry/tasks/llm_issue_detection/detection.py Outdated
Copy link
Copy Markdown
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if CursoredScheduler does what you need here? You can run it like

def scm_repo_sync_beat() -> None:
scheduler = CursoredScheduler(
name="scm_repo_sync",
schedule_key="scm-repo-sync-beat",
queryset=OrganizationIntegration.objects.filter(
integration__provider__in=SCM_SYNC_PROVIDERS,
integration__status=ObjectStatus.ACTIVE,
status=ObjectStatus.ACTIVE,
),
task=sync_repos_for_org,
cycle_duration=timedelta(hours=24),
)
scheduler.tick()

It doesn't do the caching of all the orgs in the batch, but we could expand it that way if it's helpful. I think it'd be good to try and genericise this kind of logic, so let me know if you want to try and integrate with it and i'm happy to either work on it or review any of your changes

Copy link
Copy Markdown
Member

@shashjar shashjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments & also @wedamija's comment makes sense to me if possible

"""Read cached eligible org IDs, or rebuild if missing."""
cluster = redis_clusters.get("default")
cached = cluster.get(ELIGIBLE_ORGS_CACHE_KEY)
if cached:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edge case that'll probably never get hit but may want to check not None instead of truthy in case the eligible orgs are cached as no orgs

if not has_access:
return

projects = list(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: project_ids might be a better name

traces_to_send: list[TraceMetadataWithSpanCount] = [
t for t in evidence_traces if t.trace_id in unprocessed_ids
][:NUM_TRANSACTIONS_TO_PROCESS]
][:1]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance of this changing in the future? Should we consider leaving it as a constant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a chance tho it's hard to say rn how likely it is, depends on some data we will gather as we LA


Returns the allowlist from system options.
"""
return options.get("issue-detection.llm-detection.projects-allowlist")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should/can we delete the registration for this option?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - will do that in future PRs!


dispatched = 0
for org_id in eligible_org_ids:
if dispatched >= MAX_ORGS_PER_CYCLE:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log/metric when we drop some orgs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't be a problem during the LA/EA, and as we get closer to GA the number of groups will increase to the point where i doubt we will actually hit this, but having a metric could be good

Replace md5 hash bucketing + Redis org cache with the built-in
CursoredScheduler framework. It handles cursor-based batching,
distributed locking, and cycle metrics out of the box.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roggenkemper roggenkemper requested a review from a team as a code owner April 16, 2026 21:12
@roggenkemper roggenkemper marked this pull request as draft April 16, 2026 21:14
Comment on lines 327 to 328
perf_settings = project.get_option("sentry:performance_issue_settings", default={})
if not perf_settings.get("ai_issue_detection_enabled", True):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The task detect_llm_issues_for_org randomly selects one project. If that project has ai_issue_detection_enabled=False, the entire organization is silently skipped for the cycle.
Severity: MEDIUM

Suggested Fix

Instead of selecting one random project and exiting if it's ineligible, iterate through the organization's projects until an eligible one is found. Alternatively, filter the initial project list to only include those with ai_issue_detection_enabled=True before making a random selection. This ensures that organizations with at least one eligible project are always processed.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/sentry/tasks/llm_issue_detection/detection.py#L327-L328

Potential issue: The `detect_llm_issues_for_org` task processes an organization by
selecting a single random project to check for eligibility. If this randomly chosen
project has the `ai_issue_detection_enabled` setting disabled, the function returns
early. This causes the entire organization to be silently skipped for the current
detection cycle, even if other projects within the same organization have the feature
enabled. This behavior is a functional regression from the previous implementation,
which processed each eligible project individually, and leads to non-deterministic and
reduced feature coverage for multi-project organizations.

@roggenkemper roggenkemper removed the request for review from a team April 17, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants