Skip to content

Conversation

@dnsbty
Copy link
Contributor

@dnsbty dnsbty commented Nov 13, 2025

Description

We've been seeing OOMs in our service pods when we receive larger than normal bursts of traffic:
image

We tracked the increased memory usage to the Sentry.Transport.Sender processes whose mailboxes are getting backed up when the sender gets rate limited by Sentry:
image

The current implementation of the sender ignores Sentry's rate limit headers until a 429 status is received. At that point, each sender process will sleep until the "Retry-After" period has ended. During the time while the process is sleeping, the process mailbox continues to fill up with more events to send.

This pull request creates a new Sentry.Transport.RateLimiter backed by ets, which will store current rate limits for the categories specified in the Sentry rate limit response header. During times where the category of message to be sent is under a rate limit, the client will now drop the message, as is recommended by Sentry's rate limiting docs. This should prevent pileup in the process mailboxes. And because it's now tracking rate limits for each message category, it will still allow other categories to get through while one is rate limited so that less data is lost going forward.

AI Disclosure: I used Claude Code for an initial pass on this work, before thoroughly reviewing and refining it manually.

I'm happy to make or receive any changes requested to help get this through.

Issues

@dnsbty
Copy link
Contributor Author

dnsbty commented Nov 14, 2025

Per the DangerJS check, should I be adding a line to the changelog? I assumed that should be left to maintainers, but happy to do that if it would be helpful.

@dnsbty
Copy link
Contributor Author

dnsbty commented Nov 15, 2025

We shipped this in our codebase today and you can see the results here (the change was deployed at 18:10 where you see the memory usage sharply drop):

image

Compare this to the same period one week earlier and you can see that the memory usage is much more stable now:

image

And to make sure that everything is getting through to Sentry still, these are our stats for spans from today:

image

And this is for the same period a week ago:

image

A week ago, 32.4K spans were rate limited out of 2.1M total, so ~1.54% rate limited. Today 35.9K spans were rate limited out of 3M total, so ~1.20% rate limited. I don't think this really means much though. A week ago we would have retried a lot of the events multiple times before dropping them, so fewer would have been dropped because more would have been kept in memory. And because we deployed today while memory was high, we would have lost a lot of events that were in memory but hadn't been sent yet, and the Sentry client wouldn't have reported on those yet. I didn't think about this prior to building this, but this change should prevent data loss during deploys since events won't be trapped in the process mailboxes.

@dnsbty
Copy link
Contributor Author

dnsbty commented Nov 18, 2025

Now with a couple more days' traffic, you can better see the longer term results of the change (deployed at the point where the cursor/line are on the memory graph):
image

Everything appears to be working as expected in our application at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

More gracefully handle situations with many events & running into rate limits

1 participant