Skip to content

feat(upsampling) - Support upsampled error count with performance optimizations#75

Open
camcalaquian wants to merge 2 commits into
tenki/base-3from
tenki/head-3
Open

feat(upsampling) - Support upsampled error count with performance optimizations#75
camcalaquian wants to merge 2 commits into
tenki/base-3from
tenki/head-3

Conversation

@camcalaquian
Copy link
Copy Markdown

@camcalaquian camcalaquian commented Apr 28, 2026

yuvmen and others added 2 commits July 25, 2025 09:48
…etsentry#94376)

Part of the Error Upsampling project:
https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link

Events-stats API will now check if all projects in the query are
allowlisted for upsampling, and convert the count query to a sum over
`sample_weight` in Snuba, this is done by defining a new SnQL function
`upsampled_count()`.

I noticed there are also eps() and epm() functions in use in this
endpoint. I considered (and even worked on) also supporting
swapping eps() and epm() which for correctness should probably also not
count naively and use `sample_weight`, but this
caused some complications and since they are only in use by specific
dashboard widgets and not available in discover
I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance
- Separate upsampling eligibility check from query transformation for better optimization
- Remove unnecessary null checks in upsampled_count() function per schema requirements
- Add cache invalidation utilities for configuration management

This improves performance during high-traffic periods by avoiding repeated
expensive allowlist lookups while maintaining data consistency.
@tenki-reviewer
Copy link
Copy Markdown

tenki-reviewer Bot commented Apr 28, 2026

Tenki Code Review - Complete

Files Reviewed: 8
Findings: 3

By Severity:

  • 🔴 High: 1
  • 🟡 Low: 2

This PR implements error event upsampling to improve query accuracy, but has one critical caching bug that breaks the feature in multi-worker deployments and two code quality issues. The non-deterministic hash-based cache key prevents cache hits across worker processes, reducing the optimization to a no-op.

Files Reviewed (8 files)
pyproject.toml
sentry-repo
src/sentry/api/endpoints/organization_events_stats.py
src/sentry/api/helpers/error_upsampling.py
src/sentry/search/events/datasets/discover.py
src/sentry/testutils/factories.py
tests/sentry/api/helpers/test_error_upsampling.py
tests/snuba/api/endpoints/test_organization_events_stats.py

Copy link
Copy Markdown

@tenki-reviewer tenki-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

The PR adds error upsampling support to Sentry's event statistics API, allowing weighted aggregations for queries on allowlisted projects. The changes include:

  • New helper module (error_upsampling.py) with eligibility checks and query transformations
  • Updated endpoint to conditionally apply upsampling transformations based on project allowlist
  • New SnQL aggregate function for weighted counting (upsampled_count)
  • Test factory improvements to handle client_sample_rate extraction

Critical Issues

🔴 Non-deterministic Cache Key (cq-001)

The cache key uses Python's hash(tuple(...)) which is randomized per process due to PYTHONHASHSEED. In a multi-worker deployment (production), worker A stores cache entries under hash value X, but worker B computes hash value Y for the same project IDs and never finds the entry. The entire 60-second cache optimization is a no-op in production. Additionally, invalidate_upsampling_cache() fails silently because it can't delete entries keyed with other processes' hash values.

Code Quality Concerns

🟡 Dead Intermediate Variable (cq-003)

The variable should_upsample is assigned and immediately copied to upsampling_enabled with no other use. Pure dead code that adds noise.

🟡 Redundant Transformation Calls (cq-004)

The same transform_query_columns_for_error_upsampling(query_columns) is called three times in separate branches with identical inputs, creating maintenance risk and misleading comments suggesting they're different transformations.

Recommendation

Request Changes: The caching bug must be fixed before merge. The non-deterministic hash key means the performance optimization doesn't work in production. Code quality issues should also be cleaned up before merging.

expensive repeated option lookups during high-traffic periods. This is safe
because allowlist changes are infrequent and eventual consistency is acceptable.
"""
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Non-deterministic hash() in shared cache key breaks cross-worker caching (bug)

In src/sentry/api/helpers/error_upsampling.py at lines 27 and 73, the cache key is built using Python's hash(tuple(sorted(...))). Python randomizes the hash seed per process (PYTHONHASHSEED), so the same list of project IDs will produce a different integer hash value in each worker process. Because the cache is shared (Redis), entries written by process A are keyed with hash X, but process B computes hash Y for the same input and will never find them. The caching optimization is effectively a no-op in any multi-worker deployment. The invalidate_upsampling_cache function (line 73) has the same defect: it deletes a key under hash Y, but the entry was stored under hash X by another process, so invalidation silently fails.

💡 Suggestion: Replace hash(tuple(sorted(...))) with a deterministic string representation, e.g. ','.join(str(pid) for pid in sorted(snuba_params.project_ids)). Apply the same fix to invalidate_upsampling_cache at line 73.

Suggested change
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
cache_key = f"error_upsampling_eligible:{organization.id}:{','.join(str(pid) for pid in sorted(snuba_params.project_ids))}"
📋 Prompt for AI Agents

In src/sentry/api/helpers/error_upsampling.py, the cache keys at lines 27 and 73 use hash(tuple(sorted(...))) which is non-deterministic across Python processes due to PYTHONHASHSEED randomization. Replace both occurrences with a deterministic string representation. At line 27: change f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" to f"error_upsampling_eligible:{organization.id}:{','.join(str(pid) for pid in sorted(snuba_params.project_ids))}". At line 73: change f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}" to f"error_upsampling_eligible:{organization_id}:{','.join(str(pid) for pid in sorted(project_ids))}".

Comment on lines +220 to +226
should_upsample = is_errors_query_for_error_upsampled_projects(
snuba_params, organization, dataset, request
)

# Store the upsampling decision to apply later during query building
# This separation allows for better query optimization and caching
upsampling_enabled = should_upsample
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Dead intermediate variable should_upsample in _get_event_stats (maintainability)

In src/sentry/api/endpoints/organization_events_stats.py at lines 220-226, the result of is_errors_query_for_error_upsampled_projects(...) is first assigned to should_upsample, then immediately copied verbatim to upsampling_enabled. The variable should_upsample is never read again. This is a pure no-op that adds noise and suggests the comment "Store the upsampling decision to apply later" was written for a different architecture that didn't survive.

💡 Suggestion: Remove should_upsample and assign directly: upsampling_enabled = is_errors_query_for_error_upsampled_projects(snuba_params, organization, dataset, request).

📋 Prompt for AI Agents

In src/sentry/api/endpoints/organization_events_stats.py around lines 218-226 in the _get_event_stats closure, remove the intermediate should_upsample variable. Replace the two-step should_upsample = is_errors_query_for_error_upsampled_projects(...); upsampling_enabled = should_upsample with a single assignment: upsampling_enabled = is_errors_query_for_error_upsampled_projects(snuba_params, organization, dataset, request). Also remove the comments that were explaining this unnecessary separation.

# Store the upsampling decision to apply later during query building
# This separation allows for better query optimization and caching
upsampling_enabled = should_upsample
final_columns = query_columns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 transform_query_columns_for_error_upsampling called redundantly in three branches (maintainability)

In src/sentry/api/endpoints/organization_events_stats.py the transformation transform_query_columns_for_error_upsampling(query_columns) is called separately in three execution branches (lines 233, 277, 296). All three calls pass the same unchanged query_columns input. The result could be computed once after the upsampling_enabled check and reused in all branches. As written, a future change to the transformation arguments or logic would need to be made in three places, and the three calls have misleading copy-paste comments suggesting they serve different purposes ('just before query execution', 'just before RPC query execution', 'just before standard query execution').

💡 Suggestion: Compute final_columns = transform_query_columns_for_error_upsampling(query_columns) if upsampling_enabled else query_columns once after the upsampling_enabled assignment, and remove the three repeated conditional blocks inside the branches.

📋 Prompt for AI Agents

In src/sentry/api/endpoints/organization_events_stats.py in the _get_event_stats closure, after computing upsampling_enabled and final_columns = query_columns (around lines 226-227), add: if upsampling_enabled: final_columns = transform_query_columns_for_error_upsampling(query_columns). Then remove the three identical if upsampling_enabled: final_columns = transform_query_columns_for_error_upsampling(query_columns) blocks inside the if top_events > 0: branch (line 232-233), the if use_rpc: branch (lines 276-277), and the final standard query branch (lines 295-296). This centralizes the transformation logic and eliminates the risk of the three copies diverging.

@camcalaquian camcalaquian marked this pull request as draft April 30, 2026 14:57
@camcalaquian camcalaquian marked this pull request as ready for review April 30, 2026 14:57
@camcalaquian camcalaquian marked this pull request as draft April 30, 2026 16:02
@camcalaquian camcalaquian marked this pull request as ready for review April 30, 2026 16:02
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +220 to +222
should_upsample = is_errors_query_for_error_upsampled_projects(
snuba_params, organization, dataset, request
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Upsampling check uses outer closure dataset instead of the scoped_dataset parameter

In _get_event_stats, the upsampling eligibility check on line 221 passes the outer closure variable dataset to is_errors_query_for_error_upsampled_projects, but the function's actual dataset is scoped_dataset (the first parameter). When _get_event_stats is called from the widget-splitting logic in get_event_stats_factory (e.g., src/sentry/api/endpoints/organization_events_stats.py:372, :386, :427, :444), a different dataset like discover can be passed as scoped_dataset while the outer dataset remains metrics_enhanced_performance. This causes _should_apply_sample_weight_transform to evaluate against the wrong dataset module, so upsampling will never be applied in widget-splitting paths even when querying error events on the discover dataset.

Suggested change
should_upsample = is_errors_query_for_error_upsampled_projects(
snuba_params, organization, dataset, request
)
should_upsample = is_errors_query_for_error_upsampled_projects(
snuba_params, organization, scoped_dataset, request
)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +135 to +140
query = request.GET.get("query", "").lower()

if "event.type:error" in query:
return True

return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 _is_error_focused_query matches negated queries like !event.type:error

The substring check "event.type:error" in query at line 137 also matches negated queries such as !event.type:error or NOT event.type:error. A user query like !event.type:error means "everything except errors" (i.e., transactions), but this function would incorrectly return True, causing count() to be transformed to upsampled_count() for non-error events — producing incorrect results by summing sample_weight on events that may not have meaningful sample weights.

Suggested change
query = request.GET.get("query", "").lower()
if "event.type:error" in query:
return True
return False
query = request.GET.get("query", "").lower()
if "!event.type:error" in query or "not event.type:error" in query:
return False
if "event.type:error" in query:
return True
return False
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

expensive repeated option lookups during high-traffic periods. This is safe
because allowlist changes are infrequent and eventual consistency is acceptable.
"""
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Non-deterministic hash() in cache key makes cross-process caching and invalidation broken

Python's hash() on tuples is randomized per-process (via PYTHONHASHSEED since Python 3.3). The cache key at line 27 uses hash(tuple(sorted(...))), so different worker processes produce different cache keys for the same org/project combination. Since sentry.utils.cache.cache is Django's default cache (typically Redis, shared across processes), this means: (1) cache entries from one process are invisible to other processes, and (2) invalidate_upsampling_cache (src/sentry/api/helpers/error_upsampling.py:73) will compute a different hash and fail to delete entries created by other processes, making explicit invalidation ineffective.

Prompt for agents
The cache key on line 27 and the matching key in invalidate_upsampling_cache on line 73 both use Python's hash() on a tuple, which is randomized per process. Replace hash(tuple(sorted(snuba_params.project_ids))) with a deterministic representation, for example a hashlib-based hash or simply use the sorted project IDs directly as a string in the cache key: e.g., ','.join(str(pid) for pid in sorted(snuba_params.project_ids)). Both locations (line 27 in is_errors_query_for_error_upsampled_projects and line 73 in invalidate_upsampling_cache) must use the same deterministic key generation.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants