Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling by fingerprint #27884

Open
roberttod opened this issue Aug 2, 2021 · 10 comments
Open

Sampling by fingerprint #27884

roberttod opened this issue Aug 2, 2021 · 10 comments

Comments

@roberttod
Copy link

roberttod commented Aug 2, 2021

Problem Statement

Projects with a lot of traffic tend to be a lot more expensive to integrate with Sentry because during error conditions there will be many times more errors sent. Roughly speaking, there's a linear correlation between traffic and $ cost for Sentry errors. Of course there is sampling to reduce this cost, but then you lose critical information because errors that happen at a lower rate than the sample rate could be cut out if other errors are happening at a high frequency.

One could argue that if there are errors happening at high frequency, then they should be addressed, deleted or filtered. But in my experience that's not reasonable - frontend projects with many users will frequently get into this state and by the time some thought has been put into mitigation you may have already exhausted many wasted $$ or missed a load of new errors that were sampled out.

Why can't there be a sample rate per event fingerprint? I understand that would still incur higher costs for high traffic projects (for example, you'll need to store those fingerprints somewhere) but the cost-errorrate relationship would not be linear and you'd barely lose any functionality by applying the sampling. You could even continue counting the errors correctly but not put the full error event information into the system.

Solution Brainstorm

  • Switch in the UI for turning on sampling by error fingerprint. Select a sample rate.
  • When switched on, server will apply fingerprint logic to the error and check against a counter in a key:value store to check whether it should be sampled out.
  • (optionally) Add the timestamp to the error log without any detailed info (adding to the cost).
@getsentry-release
Copy link

Routing to @getsentry/owners-ingest for triage. ⏲️

@BYK
Copy link
Member

BYK commented Aug 3, 2021

We have features that seem like can help with your problem:

Not sure if you already saw these and your proposal is in addition to existing solutions or you were unable to find these resources (hence sharing them here).

@untitaker
Copy link
Member

untitaker commented Aug 3, 2021

Why can't there be a sample rate per event fingerprint?

Because determining the fingerprint for a particular event depends on resolving sourcemaps and debug symbols, which is the most expensive thing we do with an event on the server. We don't want people to send us large amounts of traffic, have us process it all and pay none of the cost that it incurs. All the existing sampling and filtering you can do today is cheap to execute because they're based on data that does not need to be computed by the server first. This doesn't apply to languages like Python/Ruby, but does to JS, for example.

@jan-auer
Copy link
Member

jan-auer commented Aug 6, 2021

@roberttod to clarify, when you say "fingerprint" do you mean:

@roberttod
Copy link
Author

Thanks @BYK - I've seen these features but the spike protection is only useful for unexpected spikes, rather than long running high frequency errors and the filtering needs manual intervention for each error type and it'll stop reporting an error that might still need to be addressed.

@roberttod
Copy link
Author

@jan-auer ideally the sampling could use the server side fingerprint logic, but if that's expensive even a more simple fingerprint like that calculated on the client side could be useful.

I was expecting it to be a server side solution because then sampling could be applied across many clients.

I wasn't aware server side fingerprint generation was so expensive, as I expected most of the cost to be storing, sorting and searching through errors on the server. If we could apply some simple fingerprint sampling, that would be very beneficial.

Would the more complex server fingerprinting in most cases join more errors together anyway? In that case it might not be much of a difference.

@jan-auer
Copy link
Member

jan-auer commented Aug 8, 2021

Thanks for the additional input! In short, we're aware of this challenge and evaluating options at the moment (including client-side fingerprinting). I think this generally makes sense to add.

I wasn't aware server side fingerprint generation was so expensive, as I expected most of the cost to be storing, sorting and searching through errors on the server.

It's less about being expensive, rather that server side fingerprinting and grouping runs quite late in the pipeline for functional reasons. It requires information from pretty much any prior step of data ingestion, including server-side processing to resolve actual function names from JavaScript sourcemaps or debug information files. Server-side sampling and related functionality runs rather early, in contrast. We're even pushing the sampling decision down to the client in some cases, to increase performance and reduce bandwidth requirements.

Would the more complex server fingerprinting in most cases join more errors together anyway? In that case it might not be much of a difference.

Generally speaking, the result of server-side fingerprinting can be completely different from what could be done with information available on the client. In many cases, it would not even be possible to group errors into issues correctly on the client side.

@roberttod
Copy link
Author

roberttod commented Aug 9, 2021

Understood, that all makes a lot of sense. I think some rudimentary fingerprint of the exact error message + an exact stack match would do for most cases we've seen - it wouldn't group anything it shouldn't group and would still massively reduce error count. If done on the server instead of the client (I believe sampling is client side right now), then that would also help when there are a large number of users getting the same error.

Looking at the docs, I always assumed sampling was done client side since it's an SDK option, but I realized it's not specified. Is it done server side or client side?

(a different idea from the simple solution above) If your fingerprint logic is done further toward the end of the pipeline, I wonder if something like this would work

  1. generate a fingerprint from the stack trace + error message + whatever else is involved in routing of an error message (e.g. "raw-fingerprint-id")
  2. run all the usual server side logic that eventually produces a fingerprint (is a server side fingerprint a 1:1 mapping to an issue ID?)
  3. Add into some table "raw-fingerprint-id" = "server-side-fingerprint"
  4. When new errors come through you could then find the server side fingerprint from the raw fingerprint of the incoming error (where there's a 1:many relationship of server side fingerprints to raw fingerprints)

I am probably missing some context here, so not sure if this would work - I was assuming all of the info could be gathered from just the incoming error message (which seems reasonable) and that each raw fingerprint could map directly to one server side fingerprint.

@untitaker
Copy link
Member

Looking at the docs, I always assumed sampling was done client side since it's an SDK option, but I realized it's not specified. Is it done server side or client side?

We have a server-side filtering and sampling product as well and are not too worried about its scalability as it sits at the front of the event processing pipeline. Client-side sampling is more widely used.

I wonder if something like this would work

yes that can work:

  • raw-fingerprint:fingerprint is a n:1 relationship
  • if you want to filter/sample an issue, you will discover more values for raw-fingerprint over time. so you may start out with n=10 raw fingerprints, then accidentally let through a couple more raw fingerprints that map to the same fingerprint. then you need to add those raw fingerprints to your filter list as well that gets pushed to the front of the pipeline (Relay)

you have to ensure that n in the first relationship does not get too large, as otherwise you will end up sending/syncing a large amount of raw fingerprints to Relay.

so that rudimentary fingerprint logic can't just liberally add all event data that could ever be involved in grouping, but will have to be improved just for scalability of this setup.

one could potentially make the rudimentary fingerprint logic easier to maintain by porting the regular fingerprint logic to Rust first. So it can be run from both Python/Sentry and Rust/Relay.

IIRC this is as far as we've gotten in internal discussions 1-2 months ago.

@5achinJani
Copy link

+1
Much needed. At least there should be a switch in the sentry issue details page to stop receiving the events(discard future events) instead of "delete & discard future events" cause we would like to keep the issue in our issues list but not receive any further events on it which will save our events quota.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants