Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(profiling): Increase profile maximum size by an order of magnitude #1321

Merged
merged 4 commits into from
Jun 29, 2022

Conversation

phacops
Copy link
Contributor

@phacops phacops commented Jun 27, 2022

With a few customers onboarded, we're seeing profiles lasting a few seconds that can end up being right above 10MiB after being expanded (so, they can be 1MiB coming from the client but then we expand them into a more useful format and the resulting JSON ends up being more than 10MiB to be sent to Kafka). This aims to allow such profiles to make it to the rest of the pipeline as they are not a problem.

There is a good discussion to have around the maximum size we want to allow (perhaps there's a limit to what Kafka can handle without too much trouble) and we'd be able to reduce the size of the payload (using a different format than JSON, compressing the payload before it's send to Kafka) but for now, I want to see those profiles through and assess if they are outliers or the norm before we put too much work into reducing the size.

@phacops phacops requested a review from a team June 27, 2022 15:15
@jan-auer
Copy link
Member

  1. Have you run a performance test to check how well Relay is handling such a large JSON blob? This is now two orders of magnitude larger than any other JSON we process.
  2. Note that there are in fact two absolute size limits in production that you cannot reconfigure like this (and please don't for experiments 😄): 20MB overall request size and 100MB after inflating the content encoding (e.g. gzip). If your profiles do not compress to 20MB or less, they will still be dropped with 413.
  3. 100MB requires quite a bit of uploading. We've primarily allowed such large payloads for minidumps which usually contain mostly empty memory, so they compress very well. I expect 100MB of JSON to compress much less, which also means that the upload from the client will take significant time. This means, our request duration timing may be affected negatively.

Rather than changing the default here, would it be possible to run canary test with a changed config instead? That would still yield test samples for you.

@phacops
Copy link
Contributor Author

phacops commented Jun 28, 2022

This is sort of a way to help us find a better limit. I'm not expecting 100MB profiles and wouldn't leave the limit like this (it's way too big). This is intended to change in the near future.

Remember, we're still operating at a ridiculously low scale, receiving less than 2 profiles per minute, with 90% being less than 500KB. This is really to detect what's a maximum good size and right now, we're dropping them without a clue.

We won't upload something bigger than 20MB compressed for sure as the clients have hard limits anyway (5MB for Android).

I can lower this maximum limit to something bigger than what we had before and see. I brought it up to 100MB since it was the maximum payload we allowed for 1 other item type and seemed more than enough.

@untitaker
Copy link
Member

To my knowledge the only thing stopping somebody from sending us a lot of profiles today is the fact that we have an organization-level featureflag. But, our frontend DSN is public and we don't have rate limits to drop profiles specifically (the load balancer limits are not low enough). So assuming that is true, and considering that profiles are being processed within the same infrastructure as other traffic, I think this change is quite risky and I also don't understand what we get from it.

You already said that a limit of 100MB is not going to stick long term. What do you want to find out by deploying this limit? Is this about debugging those 100MB profiles and seeing why they got so big in the first place?

@phacops
Copy link
Contributor Author

phacops commented Jun 28, 2022

Yes, the idea is we're receiving profiles that, when expanded, are bigger than the current limit and we want to explore why. 100MB might be too big (I mean, the max attachment size is also 100MB, so what's so different with attachments? I'd be happy to modify the rate limits for the same values) but the idea is to understand those profiles better and limit/adjust accordingly.

@untitaker
Copy link
Member

@phacops i think the main two differences are that attachments are written in chunks to kafka, and that they have rather strict abuse quotas passed down from sentry. My understanding is that we don't have any rate limits configured for profiles atm

I think i would be fine with this as a temporary thing if it's not kept turned on overnight, and if we're prepared to quickly disable the profile feature flag via flagr in case things turn south. but I'll leave the decision to @jan-auer

@phacops
Copy link
Contributor Author

phacops commented Jun 29, 2022

@untitaker OK, I see. We still have the feature flag we can turn off in case something goes weird but it's true we don't have strict abuse quotas for profiles specifically. Thanks for the explanation.

@phacops phacops merged commit 6109cec into master Jun 29, 2022
@phacops phacops deleted the pierre/profiling-increase-profile-max-size branch June 29, 2022 13:19
@untitaker
Copy link
Member

and yeah @phacops as soon as you have enough debugging data let's revert this again.

jan-auer added a commit that referenced this pull request Jun 30, 2022
* master:
  fix(profiling): Increase profile maximum size by an order of magnitude (#1321)
  feat(dynamic-sampling): Extend protocol [INGEST-1395 INGEST-1397] (#1318)
  ref(dynamic-sampling): Sample trace after metrics extraction (#1317)
  ref(project_redis): Add counter for redis cache usage (#1322)
  feat(profiling): Add thread metadata to tell which thread is the main (#1320)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants