Sampling by fingerprint #27884

roberttod · 2021-08-02T15:40:57Z

Problem Statement

Projects with a lot of traffic tend to be a lot more expensive to integrate with Sentry because during error conditions there will be many times more errors sent. Roughly speaking, there's a linear correlation between traffic and $ cost for Sentry errors. Of course there is sampling to reduce this cost, but then you lose critical information because errors that happen at a lower rate than the sample rate could be cut out if other errors are happening at a high frequency.

One could argue that if there are errors happening at high frequency, then they should be addressed, deleted or filtered. But in my experience that's not reasonable - frontend projects with many users will frequently get into this state and by the time some thought has been put into mitigation you may have already exhausted many wasted $$ or missed a load of new errors that were sampled out.

Why can't there be a sample rate per event fingerprint? I understand that would still incur higher costs for high traffic projects (for example, you'll need to store those fingerprints somewhere) but the cost-errorrate relationship would not be linear and you'd barely lose any functionality by applying the sampling. You could even continue counting the errors correctly but not put the full error event information into the system.

Solution Brainstorm

Switch in the UI for turning on sampling by error fingerprint. Select a sample rate.
When switched on, server will apply fingerprint logic to the error and check against a counter in a key:value store to check whether it should be sampled out.
(optionally) Add the timestamp to the error log without any detailed info (adding to the cost).

getsentry-release · 2021-08-03T09:23:39Z

Routing to @getsentry/owners-ingest for triage. ⏲️

BYK · 2021-08-03T09:25:40Z

We have features that seem like can help with your problem:

Not sure if you already saw these and your proposal is in addition to existing solutions or you were unable to find these resources (hence sharing them here).

untitaker · 2021-08-03T10:09:48Z

Why can't there be a sample rate per event fingerprint?

Because determining the fingerprint for a particular event depends on resolving sourcemaps and debug symbols, which is the most expensive thing we do with an event on the server. We don't want people to send us large amounts of traffic, have us process it all and pay none of the cost that it incurs. All the existing sampling and filtering you can do today is cheap to execute because they're based on data that does not need to be computed by the server first. This doesn't apply to languages like Python/Ruby, but does to JS, for example.

jan-auer · 2021-08-06T07:01:19Z

@roberttod to clarify, when you say "fingerprint" do you mean:

sampling by Sentry Issue, as described by @untitaker, taking Sentry's full server-side processing and issue grouping into account
sampling by client-generated fingerprint?

roberttod · 2021-08-06T18:05:22Z

Thanks @BYK - I've seen these features but the spike protection is only useful for unexpected spikes, rather than long running high frequency errors and the filtering needs manual intervention for each error type and it'll stop reporting an error that might still need to be addressed.

roberttod · 2021-08-06T18:18:42Z

@jan-auer ideally the sampling could use the server side fingerprint logic, but if that's expensive even a more simple fingerprint like that calculated on the client side could be useful.

I was expecting it to be a server side solution because then sampling could be applied across many clients.

I wasn't aware server side fingerprint generation was so expensive, as I expected most of the cost to be storing, sorting and searching through errors on the server. If we could apply some simple fingerprint sampling, that would be very beneficial.

Would the more complex server fingerprinting in most cases join more errors together anyway? In that case it might not be much of a difference.

jan-auer · 2021-08-08T16:37:09Z

Thanks for the additional input! In short, we're aware of this challenge and evaluating options at the moment (including client-side fingerprinting). I think this generally makes sense to add.

I wasn't aware server side fingerprint generation was so expensive, as I expected most of the cost to be storing, sorting and searching through errors on the server.

It's less about being expensive, rather that server side fingerprinting and grouping runs quite late in the pipeline for functional reasons. It requires information from pretty much any prior step of data ingestion, including server-side processing to resolve actual function names from JavaScript sourcemaps or debug information files. Server-side sampling and related functionality runs rather early, in contrast. We're even pushing the sampling decision down to the client in some cases, to increase performance and reduce bandwidth requirements.

Would the more complex server fingerprinting in most cases join more errors together anyway? In that case it might not be much of a difference.

Generally speaking, the result of server-side fingerprinting can be completely different from what could be done with information available on the client. In many cases, it would not even be possible to group errors into issues correctly on the client side.

roberttod · 2021-08-09T16:45:24Z

Understood, that all makes a lot of sense. I think some rudimentary fingerprint of the exact error message + an exact stack match would do for most cases we've seen - it wouldn't group anything it shouldn't group and would still massively reduce error count. If done on the server instead of the client (I believe sampling is client side right now), then that would also help when there are a large number of users getting the same error.

Looking at the docs, I always assumed sampling was done client side since it's an SDK option, but I realized it's not specified. Is it done server side or client side?

(a different idea from the simple solution above) If your fingerprint logic is done further toward the end of the pipeline, I wonder if something like this would work

generate a fingerprint from the stack trace + error message + whatever else is involved in routing of an error message (e.g. "raw-fingerprint-id")
run all the usual server side logic that eventually produces a fingerprint (is a server side fingerprint a 1:1 mapping to an issue ID?)
Add into some table "raw-fingerprint-id" = "server-side-fingerprint"
When new errors come through you could then find the server side fingerprint from the raw fingerprint of the incoming error (where there's a 1:many relationship of server side fingerprints to raw fingerprints)

I am probably missing some context here, so not sure if this would work - I was assuming all of the info could be gathered from just the incoming error message (which seems reasonable) and that each raw fingerprint could map directly to one server side fingerprint.

untitaker · 2021-08-17T13:03:07Z

Looking at the docs, I always assumed sampling was done client side since it's an SDK option, but I realized it's not specified. Is it done server side or client side?

We have a server-side filtering and sampling product as well and are not too worried about its scalability as it sits at the front of the event processing pipeline. Client-side sampling is more widely used.

I wonder if something like this would work

yes that can work:

raw-fingerprint:fingerprint is a n:1 relationship
if you want to filter/sample an issue, you will discover more values for raw-fingerprint over time. so you may start out with n=10 raw fingerprints, then accidentally let through a couple more raw fingerprints that map to the same fingerprint. then you need to add those raw fingerprints to your filter list as well that gets pushed to the front of the pipeline (Relay)

you have to ensure that n in the first relationship does not get too large, as otherwise you will end up sending/syncing a large amount of raw fingerprints to Relay.

so that rudimentary fingerprint logic can't just liberally add all event data that could ever be involved in grouping, but will have to be improved just for scalability of this setup.

one could potentially make the rudimentary fingerprint logic easier to maintain by porting the regular fingerprint logic to Rust first. So it can be run from both Python/Sentry and Rust/Relay.

IIRC this is as far as we've gotten in internal discussions 1-2 months ago.

5achinJani · 2021-08-27T09:54:01Z

+1
Much needed. At least there should be a switch in the sentry issue details page to stop receiving the events(discard future events) instead of "delete & discard future events" cause we would like to keep the issue in our issues list but not receive any further events on it which will save our events quota.

github-actions bot added the Status: Unrouted label Aug 2, 2021

BYK added the Team: Ingest label Aug 3, 2021

getsentry-release added Status: Untriaged and removed Status: Unrouted labels Aug 3, 2021

untitaker added Component: Event Pipeline Component: Grouping Component: Inbound Filters Component: Source Maps Status: Backlog and removed Status: Untriaged labels Aug 3, 2021

jan-auer mentioned this issue Sep 24, 2021

Rate limit on a per issue basis #28792

Closed

jan-auer self-assigned this Sep 24, 2021

jan-auer removed their assignment Jan 18, 2023

jernejstrasner added Type: Improvement and removed Team: Ingest labels Feb 21, 2023

jernejstrasner assigned danielkhan Feb 27, 2023

chadwhitacre mentioned this issue Jun 27, 2023

Remove all Team and Component labels getsentry/team-ospo#146

Open

1 task

github-actions bot removed the Status: Backlog label Jul 20, 2023

saippuakauppias mentioned this issue Oct 11, 2023

Issues Sampling (like traces_sampler) getsentry/sentry-python#2440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling by fingerprint #27884

Sampling by fingerprint #27884

roberttod commented Aug 2, 2021 •

edited

Loading

getsentry-release commented Aug 3, 2021

BYK commented Aug 3, 2021

untitaker commented Aug 3, 2021 •

edited by jan-auer

Loading

jan-auer commented Aug 6, 2021

roberttod commented Aug 6, 2021

roberttod commented Aug 6, 2021

jan-auer commented Aug 8, 2021

roberttod commented Aug 9, 2021 •

edited

Loading

untitaker commented Aug 17, 2021

5achinJani commented Aug 27, 2021

Sampling by fingerprint #27884

Sampling by fingerprint #27884

Comments

roberttod commented Aug 2, 2021 • edited Loading

Problem Statement

Solution Brainstorm

getsentry-release commented Aug 3, 2021

BYK commented Aug 3, 2021

untitaker commented Aug 3, 2021 • edited by jan-auer Loading

jan-auer commented Aug 6, 2021

roberttod commented Aug 6, 2021

roberttod commented Aug 6, 2021

jan-auer commented Aug 8, 2021

roberttod commented Aug 9, 2021 • edited Loading

untitaker commented Aug 17, 2021

5achinJani commented Aug 27, 2021

roberttod commented Aug 2, 2021 •

edited

Loading

untitaker commented Aug 3, 2021 •

edited by jan-auer

Loading

roberttod commented Aug 9, 2021 •

edited

Loading