-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(ai_grouping): Send token length metrics on stacktraces sent to Seer #101477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Seer In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes. We use the existing titoken library which was already in use in Sentry.
We will be turning it off only if something goes wrong
Meant introducing new `transformers` package to Sentry
…cal file saved the model locally under data/models and added a readme for downloading it again
It is still used in getsentry, and causes build to fail if removed
return 0 | ||
stacktrace_text = get_stacktrace_string(get_grouping_info_from_variants(variants)) | ||
|
||
if stacktrace_text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: get_token_count
calls get_grouping_info_from_variants
, which returns data with keys incompatible with the downstream get_stacktrace_string
function, causing incorrect calculations.
-
Description: The
get_token_count
function callsget_grouping_info_from_variants
to generate grouping information when a cached stacktrace string is not available. This function creates a dictionary with keys likeapp_stacktrace
. However, the downstreamget_stacktrace_string
function, which consumes this data, expects keys likeapp
andsystem
. This key mismatch causesget_stacktrace_string
to find no relevant data, produce an empty string, and consequently makesget_token_count
always return 0. This silently defeats the purpose of the new token count metrics collection. -
Suggested fix: In
get_token_count
, replace the call toget_grouping_info_from_variants(variants)
withget_grouping_info_from_variants_legacy(variants)
. This will produce a data structure with the keys thatget_stacktrace_string
expects, allowing for correct token count calculation.
severity: 0.65, confidence: 0.95
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually correct, @lobsterkatie recently made a change to get_grouping_info_from_variants
changing existing usages to get_grouping_info_from_variants_legacy
, which needed to happen here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also realised my tests didnt cover this case. While I realize we previously said variants
shouldn't be empty, I guess it doesnt hurt to protect from it since its a dict and we cant be sure theoretically.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #101477 +/- ##
===========================================
+ Coverage 78.81% 81.29% +2.48%
===========================================
Files 8699 8595 -104
Lines 385940 383404 -2536
Branches 24413 23858 -555
===========================================
+ Hits 304162 311698 +7536
+ Misses 81427 71363 -10064
+ Partials 351 343 -8 |
…cy` and add test for it
In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes.
Introduces usage of tokenizers library for token count. Added the local tokenization model to Sentry to be used for tokenization without external dependencies.
Redo of #99873 which removed
tiktoken
dep by mistake. It is still used ingetsentry
and causes build errors if removed.