Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Move the metrics cardinality limiter to Relay #2717

Closed
30 tasks done
Tracked by #59893
jan-auer opened this issue Nov 13, 2023 · 10 comments
Closed
30 tasks done
Tracked by #59893

[EPIC] Move the metrics cardinality limiter to Relay #2717

jan-auer opened this issue Nov 13, 2023 · 10 comments
Assignees

Comments

@jan-auer
Copy link
Member

jan-auer commented Nov 13, 2023

The performance of metrics queries depends on the number of time series they have to aggregate. To put an upper bound on the number of time series - and thus the number of metric buckets with a unique combination of tag values - the metrics ingestion pipeline has a cardinality limiter.

It resides in the metrics indexer consumer, currently causing more than half of its CPU usage and introducing additional latency due to a high number of Redis calls. There are two reasons to move the cardinality limiter out of the indexer consumers:

  1. The limiter currently does not apply to Relay.
  2. The indexer consumers are a bottleneck and rely on low-latency high throughput.

Stage 1

Stage 1 / Re-Implementation

  1. Dav1dde
  2. Scope: Backend
    jjbayer
  3. Scope: Backend
    Dav1dde
  4. Dav1dde
  5. Dav1dde
  6. Dav1dde
  7. Dav1dde
  8. Scope: Backend
    Dav1dde
  9. Dav1dde
  10. Dav1dde
  11. Dav1dde
  12. Dav1dde
  13. Dav1dde
  14. Dav1dde
  15. Dav1dde
  16. Dav1dde
  17. Dav1dde
  18. Dav1dde
  19. Dav1dde
  20. Dav1dde
  21. Scope: Backend
    Dav1dde

Validation

  • Metrics
    • Same metrics as in the indexer cardinality limiter
    • Observability if org limits are hit through Sentry Errors and metrics

Rollout

  • New memstore
  • Load Test (cardinality), adapt existing indexer load tests
  • Rollout in S4S
    • Use a really low sample rate, slowly ramp up the rate
    • Verify indexer limiter is now longer hit
    • Turn off limiter in Indexer
  • Rollout in Production
    • Same as above, just more informed decisions

Stage 2

Replace Redis Set with in memory HLL and Bloomfilter, which is repeatedly merged back into Redis to synchronize all nodes.

While this is still an option to use either probabilistic data structure in either Redis or Relay, it seems a Bloomfilter can reduce the amount of bits we store by 50% 32 to ~15, while at the same time decreasing lookup and write performance. Our current bottleneck is more likely to be CPU than Storage, so this makes this option less interesting and valuable.

@jan-auer jan-auer self-assigned this Nov 13, 2023
@jan-auer jan-auer changed the title Move the metrics cardinality limiter to Relay [EPIC] Move the metrics cardinality limiter to Relay Nov 13, 2023
@Dav1dde Dav1dde self-assigned this Nov 14, 2023
@jjbayer
Copy link
Member

jjbayer commented Nov 20, 2023

Ideally we can implement a solution that supports both the existing cardinality limiting and the "product-aware", per-tag limiter for span metrics. A good starting point for a combined solution is trying to come up with a common config format, I had something like this in mind:

For the classic limiter:

{
    "scope": {
        "level": "organization",
        "use_case": "transactions"
    },
    "window": 3600,
    "limit": 80000,
    "keep": "first-seen", // keep recording tag combinations until limit is exceeded
    "drop_strategy": "bucket" // drop entire buckets when limit is exceeded
}

For a per-tag limiter (most complex version):

{
    "scope": {
        "level": "project",
        "use_case": "spans",
        "metric": "d:spans/exclusive_time_light@millisecond",
        // limit specific combination of tags
        "tags": [
            "span.group",
            "transaction"
        ]
    },
    "window": 3600,
    "limit": 1000,
    "keep": {
        "type": "top-k", // maintain list of 1000 slowest spans
        "weight": "span.exclusive_time" // same syntax as sampling rules
    },
    "drop_strategy": {
        // replace tag values for the defined tags with a static value
        "type": "tag",
        "replacement": "<< other >>"
    }
}

@jjbayer
Copy link
Member

jjbayer commented Nov 22, 2023

For the span group cardinality limiter, we will need to apply the limiting in a different part of the code base, because we want to update the span payload itself if a limit has been reached (e.g. drop a tag). So a common config might not make sense, in fact the only common ground might be the bare-bones cardinality tracker.

We could cleanly separate the cardinality limiting concern from metrics by providing a reusable type like this:

struct CardinalityLimiter {
    scope: String, // used as prefix for redis entries
    window: u64,   // seconds
    limit: usize,  // maximum number of entries per window
}

impl CardinalityLimiter {
    /// Returns `true` if the entry was successfully inserted.
    ///
    /// Returns `false` if the limit for the current time window has been exceeded.
    pub fn try_insert(&self, entry: impl std::hash::Hash) -> bool;
}

@Dav1dde
Copy link
Member

Dav1dde commented Jan 30, 2024

Enabled in SaaS for some Sentry Orgs now. Seeing some issues with Redis memory consumption. I'll investigate before continuing with the rollout.

@Dav1dde
Copy link
Member

Dav1dde commented Feb 7, 2024

We have a 50% Rollout rate on SaaS and it looks good. We might want to move the cardinality limits to a separate cluster which has less redundancy to reduce network traffic and increase performance. Otherwise we'll have to slightly increase available memory on the existing cluster before rolling out to 100%.

@Dav1dde
Copy link
Member

Dav1dde commented Feb 15, 2024

100% Rolled out but not activated due to some product concerns on our side.

@jernejstrasner
Copy link

@Dav1dde can you elaborate on the product concerns please?

@Dav1dde
Copy link
Member

Dav1dde commented Feb 15, 2024

We tried rolling it out for our own Sentry org twice now, in both cases this broke something in the performance product. This is not a bug, simply shows that the cardinality limiter is culling excessive cardinality from the metrics, unfortunately this does mean all metrics become unreliable (alerts stop working or fire when they shouldn't).

With the current limits we have configured, this currently does also affect some (10 - 30) customer orgs (although a lot less severe than it does our own org).

Ideally we would work on bringing down the cardinality first, then let the limiter enforce the limits and setup alerts for the product team once the limit is breached again in the future.

@jernejstrasner

@Dav1dde
Copy link
Member

Dav1dde commented Mar 7, 2024

Implemented passive mode on a per limit basis. We should be able to enable passive mode for a few known orgs and then close our efforts here.

@Dav1dde
Copy link
Member

Dav1dde commented Mar 12, 2024

Limiter is rolled out for SaaS with the exception of a few known orgs.

@olksdr
Copy link
Contributor

olksdr commented Mar 13, 2024

The epic is done and we are closing this.
The cardinality limiter is enabled except few orgs, which will be adjusted later.

Few follow-ups will be created for enabling alerts and monitoring.

@olksdr olksdr closed this as completed Mar 13, 2024
xurui-c added a commit to getsentry/sentry that referenced this issue Mar 27, 2024
<!-- Describe your PR here. -->
[Relay already applies cardinality
limits](getsentry/relay#2717), so we can
safely remove cardinality limiting from the indexer
<!--

  Sentry employees and contractors can delete or ignore the following.

-->

### Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated
in the State of Delaware in 2015 as Functional Software, Inc. and is
gonna need some rights from me in order to utilize my contributions in
this here PR. So here's the deal: I retain all rights, title and
interest in and to my contributions, and by keeping this boilerplate
intact I confirm that Sentry can use, modify, copy, and redistribute my
contributions, under Sentry's choice of terms.

---------

Co-authored-by: Rachel Chen <rachelchen@PL6VFX9HP4.attlocal.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants