Handle Rate Limiting for Replay Events #6710

Lms24 · 2023-01-10T13:04:53Z

Problem Statement

For all Sentry events, we currently do not do anything if an event is not ingested by the Sentry backend due to rate limits. We of course respect the retry-after time in which we don't send events but we don't retry sending the original event. This is fine for regular events (errors, transactions, sessions) as they are mostly atomic. For Replay however, this is not the case, as we're sending multiple events in one replay. If one segment goes missing, the replay cannot be continued after this segment, as one (or multiple) diffs would be missing.

Solution Brainstorm

We have a couple of options how to handle replays and replay events if we hit a rate limit:

Option A: Splitting Replays

When we hit a rate limit, we pause the replay and once the retry-after period expired, we start a new replay with a new checkout. The obivous question here is: Can we link the two (or more) replays effectively? This will probably require additional complexity in the SDK and in the Sentry Replay UI. Possibly also for replay event ingestion (not sure here...)

Pros:

We get a functional replay after the rate limit period

Cons:

We end up with multiple replays per session
Linking adds complexity to SDK, UI and possibly ingestion
We still loose the window during the rate limit period

Option B: Pausing the Replay

When we hit a rate limit, we pause the replay and continue the same replay after the rate limit period expired. When we restart, we take full snapshot, which should theoretically make it possible to continue the replay even though we obviously missed segments during the rate limit period. IIRC this should work out and users would basically see a paused/inactive period of time.

Q: Can we show users in the UI that the "missing" segements are due to rate limits? What information do we need to pass along? and when?

Pros:

We get one functional replay

Cons:

We still loose the window during the rate limit period, which will be shown to users as a period of inactivity in the replay
Still some complexity around implementing this in the SDK but at least not on the ingestion side and mostly not in the UI (unless we want to show some sort of explanation for the inactivity).

Option C: Retrying rate-limited Replay Requests

In order to not loose any segments, we could leave events that were rate-limited in the queue and retry sending them at a later time. There are implications around this as we would potentially accumulate a lot of events in the queue which we'd try to re-send after the first rate limit period in addition to newer segments. This increases the potential for more rate-limits occuring at that time, therefore again increasing the amount of queued events, etc....
This would even occur if we just attempt to retry a request for 1/2/3 times.

Pros:

We get one functional replay with the events during the rate limiting period included

Cons:

Can lead to increase of queued events on the client
Can lead to more rate-limits after the initial one
Possibly also has effects on sending of other Sentry events (??)
==> Is this scaleable at all?

My strong feeling is that option B is probably the best but I'm happy to hear everyone's opinions.

The text was updated successfully, but these errors were encountered:

mydea · 2023-01-10T13:28:37Z

Personally, I think option B is the safest & most "correct" one. Keeping stuff around when we hit rate limits is potentially problematic, so I think option B would cover us best.
One thing that would be a very good addition to B, as Lukas noted, would be to at least show in the UI that this "inactivity" period is due to rate limiting. Maybe we could find a way to let the replay know that?

billyvg · 2023-01-10T13:42:43Z

I think it depends a bit on how long rate limits last for -- if they are short B/C are viable. If they are long (in the minutes), then maybe A. It would be great to track rate limit start/end events, but this also assumes that the user stays active through the rate limiting period.

Lms24 · 2023-01-10T15:31:28Z

Internal notion doc with more input: https://www.notion.so/sentry/Replay-Rate-Limiting-Replay-Quality-853db9e906f44c6e94c1f416d7e2eb8f

billyvg · 2023-01-10T17:29:25Z

@Lms24 I think in the short term option B sounds good. Long-term a solution that combines B + C would be ideal where 1) we can show the user when they get rate limited and 2) we keep the events that occur during the rate limiting. We should be able to buffer the events and turn off flushing for that time period and decide to reset the buffer if it gets too large?

smeubank · 2023-01-13T10:09:37Z

closing with the decision made, and can revisit if need be

Lms24 added Type: Improvement Package: replay labels Jan 10, 2023

This was referenced Jan 10, 2023

[Replay] Ensure we avoid sending replays when hitting rate limits #6520

Closed

Handle replay events in a queue #6639

Closed

Lms24 self-assigned this Jan 11, 2023

Lms24 mentioned this issue Jan 11, 2023

ref(replay): Pause recording when replay requests are rate-limited #6733

Merged

smeubank closed this as completed Jan 13, 2023

Lms24 mentioned this issue Feb 1, 2023

feat(replay): Stop recording when hitting a rate limit #7018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Rate Limiting for Replay Events #6710

Handle Rate Limiting for Replay Events #6710

Lms24 commented Jan 10, 2023 •

edited

mydea commented Jan 10, 2023

billyvg commented Jan 10, 2023

Lms24 commented Jan 10, 2023

billyvg commented Jan 10, 2023

smeubank commented Jan 13, 2023

Handle Rate Limiting for Replay Events #6710

Handle Rate Limiting for Replay Events #6710

Comments

Lms24 commented Jan 10, 2023 • edited

Problem Statement

Solution Brainstorm

Option A: Splitting Replays

Option B: Pausing the Replay

Option C: Retrying rate-limited Replay Requests

mydea commented Jan 10, 2023

billyvg commented Jan 10, 2023

Lms24 commented Jan 10, 2023

billyvg commented Jan 10, 2023

smeubank commented Jan 13, 2023

Lms24 commented Jan 10, 2023 •

edited