feat(upsampling) - Support upsampled error count with performance optimizations#75
feat(upsampling) - Support upsampled error count with performance optimizations#75camcalaquian wants to merge 2 commits into
Conversation
…etsentry#94376) Part of the Error Upsampling project: https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link Events-stats API will now check if all projects in the query are allowlisted for upsampling, and convert the count query to a sum over `sample_weight` in Snuba, this is done by defining a new SnQL function `upsampled_count()`. I noticed there are also eps() and epm() functions in use in this endpoint. I considered (and even worked on) also supporting swapping eps() and epm() which for correctness should probably also not count naively and use `sample_weight`, but this caused some complications and since they are only in use by specific dashboard widgets and not available in discover I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance - Separate upsampling eligibility check from query transformation for better optimization - Remove unnecessary null checks in upsampled_count() function per schema requirements - Add cache invalidation utilities for configuration management This improves performance during high-traffic periods by avoiding repeated expensive allowlist lookups while maintaining data consistency.
|
Tenki Code Review - Complete Files Reviewed: 8 By Severity:
This PR implements error event upsampling to improve query accuracy, but has one critical caching bug that breaks the feature in multi-worker deployments and two code quality issues. The non-deterministic hash-based cache key prevents cache hits across worker processes, reducing the optimization to a no-op. Files Reviewed (8 files) |
There was a problem hiding this comment.
Overview
The PR adds error upsampling support to Sentry's event statistics API, allowing weighted aggregations for queries on allowlisted projects. The changes include:
- New helper module (
error_upsampling.py) with eligibility checks and query transformations - Updated endpoint to conditionally apply upsampling transformations based on project allowlist
- New SnQL aggregate function for weighted counting (
upsampled_count) - Test factory improvements to handle
client_sample_rateextraction
Critical Issues
🔴 Non-deterministic Cache Key (cq-001)
The cache key uses Python's hash(tuple(...)) which is randomized per process due to PYTHONHASHSEED. In a multi-worker deployment (production), worker A stores cache entries under hash value X, but worker B computes hash value Y for the same project IDs and never finds the entry. The entire 60-second cache optimization is a no-op in production. Additionally, invalidate_upsampling_cache() fails silently because it can't delete entries keyed with other processes' hash values.
Code Quality Concerns
🟡 Dead Intermediate Variable (cq-003)
The variable should_upsample is assigned and immediately copied to upsampling_enabled with no other use. Pure dead code that adds noise.
🟡 Redundant Transformation Calls (cq-004)
The same transform_query_columns_for_error_upsampling(query_columns) is called three times in separate branches with identical inputs, creating maintenance risk and misleading comments suggesting they're different transformations.
Recommendation
Request Changes: The caching bug must be fixed before merge. The non-deterministic hash key means the performance optimization doesn't work in production. Code quality issues should also be cleaned up before merging.
| expensive repeated option lookups during high-traffic periods. This is safe | ||
| because allowlist changes are infrequent and eventual consistency is acceptable. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" |
There was a problem hiding this comment.
🔴 Non-deterministic hash() in shared cache key breaks cross-worker caching (bug)
In src/sentry/api/helpers/error_upsampling.py at lines 27 and 73, the cache key is built using Python's hash(tuple(sorted(...))). Python randomizes the hash seed per process (PYTHONHASHSEED), so the same list of project IDs will produce a different integer hash value in each worker process. Because the cache is shared (Redis), entries written by process A are keyed with hash X, but process B computes hash Y for the same input and will never find them. The caching optimization is effectively a no-op in any multi-worker deployment. The invalidate_upsampling_cache function (line 73) has the same defect: it deletes a key under hash Y, but the entry was stored under hash X by another process, so invalidation silently fails.
💡 Suggestion: Replace hash(tuple(sorted(...))) with a deterministic string representation, e.g. ','.join(str(pid) for pid in sorted(snuba_params.project_ids)). Apply the same fix to invalidate_upsampling_cache at line 73.
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" | |
| cache_key = f"error_upsampling_eligible:{organization.id}:{','.join(str(pid) for pid in sorted(snuba_params.project_ids))}" |
📋 Prompt for AI Agents
In src/sentry/api/helpers/error_upsampling.py, the cache keys at lines 27 and 73 use hash(tuple(sorted(...))) which is non-deterministic across Python processes due to PYTHONHASHSEED randomization. Replace both occurrences with a deterministic string representation. At line 27: change f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" to f"error_upsampling_eligible:{organization.id}:{','.join(str(pid) for pid in sorted(snuba_params.project_ids))}". At line 73: change f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}" to f"error_upsampling_eligible:{organization_id}:{','.join(str(pid) for pid in sorted(project_ids))}".
| should_upsample = is_errors_query_for_error_upsampled_projects( | ||
| snuba_params, organization, dataset, request | ||
| ) | ||
|
|
||
| # Store the upsampling decision to apply later during query building | ||
| # This separation allows for better query optimization and caching | ||
| upsampling_enabled = should_upsample |
There was a problem hiding this comment.
🟡 Dead intermediate variable should_upsample in _get_event_stats (maintainability)
In src/sentry/api/endpoints/organization_events_stats.py at lines 220-226, the result of is_errors_query_for_error_upsampled_projects(...) is first assigned to should_upsample, then immediately copied verbatim to upsampling_enabled. The variable should_upsample is never read again. This is a pure no-op that adds noise and suggests the comment "Store the upsampling decision to apply later" was written for a different architecture that didn't survive.
💡 Suggestion: Remove should_upsample and assign directly: upsampling_enabled = is_errors_query_for_error_upsampled_projects(snuba_params, organization, dataset, request).
📋 Prompt for AI Agents
In src/sentry/api/endpoints/organization_events_stats.py around lines 218-226 in the _get_event_stats closure, remove the intermediate should_upsample variable. Replace the two-step should_upsample = is_errors_query_for_error_upsampled_projects(...); upsampling_enabled = should_upsample with a single assignment: upsampling_enabled = is_errors_query_for_error_upsampled_projects(snuba_params, organization, dataset, request). Also remove the comments that were explaining this unnecessary separation.
| # Store the upsampling decision to apply later during query building | ||
| # This separation allows for better query optimization and caching | ||
| upsampling_enabled = should_upsample | ||
| final_columns = query_columns |
There was a problem hiding this comment.
🟡 transform_query_columns_for_error_upsampling called redundantly in three branches (maintainability)
In src/sentry/api/endpoints/organization_events_stats.py the transformation transform_query_columns_for_error_upsampling(query_columns) is called separately in three execution branches (lines 233, 277, 296). All three calls pass the same unchanged query_columns input. The result could be computed once after the upsampling_enabled check and reused in all branches. As written, a future change to the transformation arguments or logic would need to be made in three places, and the three calls have misleading copy-paste comments suggesting they serve different purposes ('just before query execution', 'just before RPC query execution', 'just before standard query execution').
💡 Suggestion: Compute final_columns = transform_query_columns_for_error_upsampling(query_columns) if upsampling_enabled else query_columns once after the upsampling_enabled assignment, and remove the three repeated conditional blocks inside the branches.
📋 Prompt for AI Agents
In src/sentry/api/endpoints/organization_events_stats.py in the _get_event_stats closure, after computing upsampling_enabled and final_columns = query_columns (around lines 226-227), add: if upsampling_enabled: final_columns = transform_query_columns_for_error_upsampling(query_columns). Then remove the three identical if upsampling_enabled: final_columns = transform_query_columns_for_error_upsampling(query_columns) blocks inside the if top_events > 0: branch (line 232-233), the if use_rpc: branch (lines 276-277), and the final standard query branch (lines 295-296). This centralizes the transformation logic and eliminates the risk of the three copies diverging.
| should_upsample = is_errors_query_for_error_upsampled_projects( | ||
| snuba_params, organization, dataset, request | ||
| ) |
There was a problem hiding this comment.
🔴 Upsampling check uses outer closure dataset instead of the scoped_dataset parameter
In _get_event_stats, the upsampling eligibility check on line 221 passes the outer closure variable dataset to is_errors_query_for_error_upsampled_projects, but the function's actual dataset is scoped_dataset (the first parameter). When _get_event_stats is called from the widget-splitting logic in get_event_stats_factory (e.g., src/sentry/api/endpoints/organization_events_stats.py:372, :386, :427, :444), a different dataset like discover can be passed as scoped_dataset while the outer dataset remains metrics_enhanced_performance. This causes _should_apply_sample_weight_transform to evaluate against the wrong dataset module, so upsampling will never be applied in widget-splitting paths even when querying error events on the discover dataset.
| should_upsample = is_errors_query_for_error_upsampled_projects( | |
| snuba_params, organization, dataset, request | |
| ) | |
| should_upsample = is_errors_query_for_error_upsampled_projects( | |
| snuba_params, organization, scoped_dataset, request | |
| ) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| query = request.GET.get("query", "").lower() | ||
|
|
||
| if "event.type:error" in query: | ||
| return True | ||
|
|
||
| return False |
There was a problem hiding this comment.
🔴 _is_error_focused_query matches negated queries like !event.type:error
The substring check "event.type:error" in query at line 137 also matches negated queries such as !event.type:error or NOT event.type:error. A user query like !event.type:error means "everything except errors" (i.e., transactions), but this function would incorrectly return True, causing count() to be transformed to upsampled_count() for non-error events — producing incorrect results by summing sample_weight on events that may not have meaningful sample weights.
| query = request.GET.get("query", "").lower() | |
| if "event.type:error" in query: | |
| return True | |
| return False | |
| query = request.GET.get("query", "").lower() | |
| if "!event.type:error" in query or "not event.type:error" in query: | |
| return False | |
| if "event.type:error" in query: | |
| return True | |
| return False |
Was this helpful? React with 👍 or 👎 to provide feedback.
| expensive repeated option lookups during high-traffic periods. This is safe | ||
| because allowlist changes are infrequent and eventual consistency is acceptable. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" |
There was a problem hiding this comment.
🔴 Non-deterministic hash() in cache key makes cross-process caching and invalidation broken
Python's hash() on tuples is randomized per-process (via PYTHONHASHSEED since Python 3.3). The cache key at line 27 uses hash(tuple(sorted(...))), so different worker processes produce different cache keys for the same org/project combination. Since sentry.utils.cache.cache is Django's default cache (typically Redis, shared across processes), this means: (1) cache entries from one process are invisible to other processes, and (2) invalidate_upsampling_cache (src/sentry/api/helpers/error_upsampling.py:73) will compute a different hash and fail to delete entries created by other processes, making explicit invalidation ineffective.
Prompt for agents
The cache key on line 27 and the matching key in invalidate_upsampling_cache on line 73 both use Python's hash() on a tuple, which is randomized per process. Replace hash(tuple(sorted(snuba_params.project_ids))) with a deterministic representation, for example a hashlib-based hash or simply use the sorted project IDs directly as a string in the cache key: e.g., ','.join(str(pid) for pid in sorted(snuba_params.project_ids)). Both locations (line 27 in is_errors_query_for_error_upsampled_projects and line 73 in invalidate_upsampling_cache) must use the same deterministic key generation.
Was this helpful? React with 👍 or 👎 to provide feedback.
Uh oh!
There was an error while loading. Please reload this page.