Skip to content

Conversation

yuvmen
Copy link
Member

@yuvmen yuvmen commented Oct 14, 2025

In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes.

Introduces usage of tokenizers library for token count. Added the local tokenization model to Sentry to be used for tokenization without external dependencies.

Redo of #99873 which removed tiktoken dep by mistake. It is still used in getsentry and causes build errors if removed.

yuvmen and others added 19 commits September 18, 2025 15:20
…Seer

In preparation to making the switch to token length being considered instead of frame count of errors,
we take metrics of the token length of stacktraces being sent to be able to map out the statistics
and the impact that would make. Insturmented get_token_count to monitor how long it takes.
We use the existing titoken library which was already in use in Sentry.
We will be turning it off only if something goes wrong
Meant introducing new `transformers` package to Sentry
…cal file

saved the model locally under data/models and added a readme for downloading it again
It is still used in getsentry, and causes build to fail if removed
@yuvmen yuvmen requested review from a team as code owners October 14, 2025 20:52
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 14, 2025
@yuvmen yuvmen changed the title Yuvmen/token count stacktraces poc feat(ai_grouping): Send token length metrics on stacktraces sent to Seer Oct 14, 2025
Comment on lines 562 to 565
return 0
stacktrace_text = get_stacktrace_string(get_grouping_info_from_variants(variants))

if stacktrace_text:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: get_token_count calls get_grouping_info_from_variants, which returns data with keys incompatible with the downstream get_stacktrace_string function, causing incorrect calculations.
  • Description: The get_token_count function calls get_grouping_info_from_variants to generate grouping information when a cached stacktrace string is not available. This function creates a dictionary with keys like app_stacktrace. However, the downstream get_stacktrace_string function, which consumes this data, expects keys like app and system. This key mismatch causes get_stacktrace_string to find no relevant data, produce an empty string, and consequently makes get_token_count always return 0. This silently defeats the purpose of the new token count metrics collection.

  • Suggested fix: In get_token_count, replace the call to get_grouping_info_from_variants(variants) with get_grouping_info_from_variants_legacy(variants). This will produce a data structure with the keys that get_stacktrace_string expects, allowing for correct token count calculation.
    severity: 0.65, confidence: 0.95

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually correct, @lobsterkatie recently made a change to get_grouping_info_from_variants changing existing usages to get_grouping_info_from_variants_legacy, which needed to happen here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also realised my tests didnt cover this case. While I realize we previously said variants shouldn't be empty, I guess it doesnt hurt to protect from it since its a dict and we cant be sure theoretically.

Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 94.64286% with 3 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/seer/similarity/utils.py 94.54% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #101477      +/-   ##
===========================================
+ Coverage   78.81%    81.29%   +2.48%     
===========================================
  Files        8699      8595     -104     
  Lines      385940    383404    -2536     
  Branches    24413     23858     -555     
===========================================
+ Hits       304162    311698    +7536     
+ Misses      81427     71363   -10064     
+ Partials      351       343       -8     

@yuvmen yuvmen merged commit e24b4f5 into master Oct 15, 2025
66 checks passed
@yuvmen yuvmen deleted the yuvmen/token-count-stacktraces-poc branch October 15, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants