Skip to content

Conversation

@yuvmen
Copy link
Member

@yuvmen yuvmen commented Nov 25, 2025

Change the stacktrace length check to go by token count and not be stacktrace frame count. We also do a sanity check on raw string length first, if it is below the token max then no point counting tokens, just pass. We are keeping the old bypass for certain platforms to not change current behaviour, once we move to v2 grouping model we will enable this for them as well.

…tead of frame count

Change the stacktrace length check to go by token count and not be stacktrace
frame count. We also do a sanity check on raw string length first, if it is below the
token max then no point counting tokens, just pass.
We are keeping the old bypass for certain platforms to not change current behaviour,
once we move to v2 grouping model we will enable this for them as well.
@yuvmen yuvmen requested a review from a team as a code owner November 25, 2025 19:49
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Nov 25, 2025
report_token_count_metric(event, variants, "block_frames")
return True
report_token_count_metric(event, variants, "pass_string_length")
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: String length compared to token count without conversion

The code compares string_length (measured in characters) directly against max_token_count (measured in tokens) at line 383. These are different units and cannot be meaningfully compared. Since one token typically represents ~4 characters, a stacktrace with several thousand characters could have far fewer tokens. This comparison will almost always be true, causing most stacktraces to skip the expensive token counting and pass immediately, defeating the token-based filtering logic.

Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong analysis here, we are comparing exactly because tokens are 4~ characters, which means if a string length is even less than the max token count it will never exceed the token count limit. We could have even made it 4 times that according to this anaylsis, so what we are doing is very conservative actually.

@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #103997      +/-   ##
===========================================
+ Coverage   76.12%    80.64%   +4.52%     
===========================================
  Files        9312      9318       +6     
  Lines      397283    397962     +679     
  Branches    25357     25357              
===========================================
+ Hits       302433    320954   +18521     
+ Misses      94393     76551   -17842     
  Partials      457       457              

Copy link
Member

@lobsterkatie lobsterkatie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions, but otherwise LGTM!

Comment on lines -360 to -361
# Exception-message-based grouping
or not hasattr(contributing_component, "frame_counts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this fail-fast option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it seemed like I no longer need it to have frame_counts to be able to do this check, which just needs the raw stacktrace, but maybe if it doesnt have it its indicative of something more important?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that variants all have a key property, you could just do something like if 'stacktrace' not in contributing_variant.key or not contributing_component: ... and that'd catch all the cases mentioned in the current version of the check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, refactored the conditions there to this 👍

Copy link
Member

@lobsterkatie lobsterkatie Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there's no type-check ahead of the mypy-appeasment check, the comment about it doesn't make sense - I'd just remove it (the comment that is, not the check).

Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test expects `stacktrace_type` tag not provided by production code

The tests were updated to expect stacktrace_type in the metric tags passed to get_similarity_data_from_seer, but the production code in get_seer_similar_issues (in src/sentry/grouping/ingest/seer.py) only includes platform, model_version, training_mode, and hybrid_fingerprint in seer_request_metric_tags. Since stacktrace_type is never added to these tags, the test assertions will fail. Either the production code needs to be updated to include stacktrace_type, or these test expectations are incorrect.

tests/sentry/grouping/seer_similarity/test_seer.py#L70-L71

"hybrid_fingerprint": False,
"stacktrace_type": "system",

tests/sentry/grouping/seer_similarity/test_seer.py#L208-L209

"hybrid_fingerprint": False,
"stacktrace_type": "system",

Fix in Cursor Fix in Web


…luding `stacktrace`

makes for a better condition and clear what we are looking for
@yuvmen yuvmen merged commit db9fc5e into master Dec 2, 2025
67 checks passed
@yuvmen yuvmen deleted the yuvmen/seer-grouping-token-count-filtering branch December 2, 2025 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants