Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Rate Limiting for Replay Events #6710

Closed
Lms24 opened this issue Jan 10, 2023 · 5 comments
Closed

Handle Rate Limiting for Replay Events #6710

Lms24 opened this issue Jan 10, 2023 · 5 comments

Comments

@Lms24
Copy link
Member

Lms24 commented Jan 10, 2023

Problem Statement

For all Sentry events, we currently do not do anything if an event is not ingested by the Sentry backend due to rate limits. We of course respect the retry-after time in which we don't send events but we don't retry sending the original event. This is fine for regular events (errors, transactions, sessions) as they are mostly atomic. For Replay however, this is not the case, as we're sending multiple events in one replay. If one segment goes missing, the replay cannot be continued after this segment, as one (or multiple) diffs would be missing.

Solution Brainstorm

We have a couple of options how to handle replays and replay events if we hit a rate limit:

Option A: Splitting Replays

When we hit a rate limit, we pause the replay and once the retry-after period expired, we start a new replay with a new checkout. The obivous question here is: Can we link the two (or more) replays effectively? This will probably require additional complexity in the SDK and in the Sentry Replay UI. Possibly also for replay event ingestion (not sure here...)

Pros:

  • We get a functional replay after the rate limit period

Cons:

  • We end up with multiple replays per session
  • Linking adds complexity to SDK, UI and possibly ingestion
  • We still loose the window during the rate limit period

Option B: Pausing the Replay

When we hit a rate limit, we pause the replay and continue the same replay after the rate limit period expired. When we restart, we take full snapshot, which should theoretically make it possible to continue the replay even though we obviously missed segments during the rate limit period. IIRC this should work out and users would basically see a paused/inactive period of time.

Q: Can we show users in the UI that the "missing" segements are due to rate limits? What information do we need to pass along? and when?

Pros:

  • We get one functional replay

Cons:

  • We still loose the window during the rate limit period, which will be shown to users as a period of inactivity in the replay
  • Still some complexity around implementing this in the SDK but at least not on the ingestion side and mostly not in the UI (unless we want to show some sort of explanation for the inactivity).

Option C: Retrying rate-limited Replay Requests

In order to not loose any segments, we could leave events that were rate-limited in the queue and retry sending them at a later time. There are implications around this as we would potentially accumulate a lot of events in the queue which we'd try to re-send after the first rate limit period in addition to newer segments. This increases the potential for more rate-limits occuring at that time, therefore again increasing the amount of queued events, etc....
This would even occur if we just attempt to retry a request for 1/2/3 times.

Pros:

  • We get one functional replay with the events during the rate limiting period included

Cons:

  • Can lead to increase of queued events on the client
  • Can lead to more rate-limits after the initial one
  • Possibly also has effects on sending of other Sentry events (??)
    ==> Is this scaleable at all?

My strong feeling is that option B is probably the best but I'm happy to hear everyone's opinions.

@mydea
Copy link
Member

mydea commented Jan 10, 2023

Personally, I think option B is the safest & most "correct" one. Keeping stuff around when we hit rate limits is potentially problematic, so I think option B would cover us best.
One thing that would be a very good addition to B, as Lukas noted, would be to at least show in the UI that this "inactivity" period is due to rate limiting. Maybe we could find a way to let the replay know that?

@billyvg
Copy link
Member

billyvg commented Jan 10, 2023

I think it depends a bit on how long rate limits last for -- if they are short B/C are viable. If they are long (in the minutes), then maybe A. It would be great to track rate limit start/end events, but this also assumes that the user stays active through the rate limiting period.

@Lms24
Copy link
Member Author

Lms24 commented Jan 10, 2023

@billyvg
Copy link
Member

billyvg commented Jan 10, 2023

@Lms24 I think in the short term option B sounds good. Long-term a solution that combines B + C would be ideal where 1) we can show the user when they get rate limited and 2) we keep the events that occur during the rate limiting. We should be able to buffer the events and turn off flushing for that time period and decide to reset the buffer if it gets too large?

@smeubank
Copy link
Member

closing with the decision made, and can revisit if need be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants