-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(seer_grouping): Filter Seer grouping requests by token count instead of frame count #103997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tead of frame count Change the stacktrace length check to go by token count and not be stacktrace frame count. We also do a sanity check on raw string length first, if it is below the token max then no point counting tokens, just pass. We are keeping the old bypass for certain platforms to not change current behaviour, once we move to v2 grouping model we will enable this for them as well.
| report_token_count_metric(event, variants, "block_frames") | ||
| return True | ||
| report_token_count_metric(event, variants, "pass_string_length") | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: String length compared to token count without conversion
The code compares string_length (measured in characters) directly against max_token_count (measured in tokens) at line 383. These are different units and cannot be meaningfully compared. Since one token typically represents ~4 characters, a stacktrace with several thousand characters could have far fewer tokens. This comparison will almost always be true, causing most stacktraces to skip the expensive token counting and pass immediately, defeating the token-based filtering logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong analysis here, we are comparing exactly because tokens are 4~ characters, which means if a string length is even less than the max token count it will never exceed the token count limit. We could have even made it 4 times that according to this anaylsis, so what we are doing is very conservative actually.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #103997 +/- ##
===========================================
+ Coverage 76.12% 80.64% +4.52%
===========================================
Files 9312 9318 +6
Lines 397283 397962 +679
Branches 25357 25357
===========================================
+ Hits 302433 320954 +18521
+ Misses 94393 76551 -17842
Partials 457 457 |
lobsterkatie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions, but otherwise LGTM!
| # Exception-message-based grouping | ||
| or not hasattr(contributing_component, "frame_counts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this fail-fast option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it seemed like I no longer need it to have frame_counts to be able to do this check, which just needs the raw stacktrace, but maybe if it doesnt have it its indicative of something more important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now that variants all have a key property, you could just do something like if 'stacktrace' not in contributing_variant.key or not contributing_component: ... and that'd catch all the cases mentioned in the current version of the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, refactored the conditions there to this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there's no type-check ahead of the mypy-appeasment check, the comment about it doesn't make sense - I'd just remove it (the comment that is, not the check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Test expects `stacktrace_type` tag not provided by production code
The tests were updated to expect stacktrace_type in the metric tags passed to get_similarity_data_from_seer, but the production code in get_seer_similar_issues (in src/sentry/grouping/ingest/seer.py) only includes platform, model_version, training_mode, and hybrid_fingerprint in seer_request_metric_tags. Since stacktrace_type is never added to these tags, the test assertions will fail. Either the production code needs to be updated to include stacktrace_type, or these test expectations are incorrect.
tests/sentry/grouping/seer_similarity/test_seer.py#L70-L71
sentry/tests/sentry/grouping/seer_similarity/test_seer.py
Lines 70 to 71 in fcc165e
| "hybrid_fingerprint": False, | |
| "stacktrace_type": "system", |
tests/sentry/grouping/seer_similarity/test_seer.py#L208-L209
sentry/tests/sentry/grouping/seer_similarity/test_seer.py
Lines 208 to 209 in fcc165e
| "hybrid_fingerprint": False, | |
| "stacktrace_type": "system", |
fcc165e to
7df4650
Compare
…luding `stacktrace` makes for a better condition and clear what we are looking for
Change the stacktrace length check to go by token count and not be stacktrace frame count. We also do a sanity check on raw string length first, if it is below the token max then no point counting tokens, just pass. We are keeping the old bypass for certain platforms to not change current behaviour, once we move to v2 grouping model we will enable this for them as well.