[EPIC] Move the metrics cardinality limiter to Relay #2717

jan-auer · 2023-11-13T08:52:42Z

The performance of metrics queries depends on the number of time series they have to aggregate. To put an upper bound on the number of time series - and thus the number of metric buckets with a unique combination of tag values - the metrics ingestion pipeline has a cardinality limiter.

It resides in the metrics indexer consumer, currently causing more than half of its CPU usage and introducing additional latency due to a high number of Redis calls. There are two reasons to move the cardinality limiter out of the indexer consumers:

The limiter currently does not apply to Relay.
The indexer consumers are a bottleneck and rely on low-latency high throughput.

Stage 1

Stage 1 / Re-Implementation

Give feedback

Investigate current scale of cardinality limiter #2719
Organize a Memstore/Redis Instance (We decided to use the existing Relay Redis for now)
Pass configuration via global config down from Sentry
chore(metrics): Add feature for Relay cardinality limiter sentry#61028

Scope: Backend
feat(relay): Introduce options to global config sentry#61024

Scope: Backend
https://github.com/getsentry/getsentry/pull/12241
feat(cardinality): Implement Redis set based cardinality limiter #2745
~~SMISMEMBER instead of loading the entire set~~ (Needs Redis 6.2)
ref(cardinality): Use a Lua script and in-memory cache for the cardinality limiter #2849
Hookup Cardinality Limiter into existing Processing Pipeline
Investiage alternative HashSet hasher, e.g. AHash for Redis implementation
fix(cardinality-analyzer): Add gauges table to allow list snuba#5349
https://github.com/getsentry/sentry-options-automator/pull/574
feat(cardinality): Create outcomes for cardinality limited metrics #2947
feat(cardinality): Add cardinality limits to project configs sentry#63562

Scope: Backend
feat(cardinality): Implement quota/limit system for cardinality limiter #2972
ref(cardinality): Use cardinality limits from project config #2990
ref(cardinality): Track rejections in an accumulator #2978
feat(cardinality): Add cardinality kill switch to global config #3008
ref(cardinality): Offset sliding window #3020
ref(cardinality): Clean old values from cardinality cache #3018
Remove timestamp from Metric partition key #3010
ref(cardinality): Store hashes in Redis as bytes instead of itoa #3029
ref(cardinality): Log Sentry errors for cardinality limits #3053
feat(cardinality): Implement project scoped cardinality limits #3071
fix(cardinality): Keep sliding window updated with good cardinality items #3076
https://github.com/getsentry/ops/pull/9381
ref(cardinality): Passive mode per cardinality limit #3199
feat(cardinality): Add option to force limits into passive mode sentry#66299

Scope: Backend
~~Investigate using the hash as a hashmap hash instead of hashing the hash, e.g. https://crates.io/crates/hash_hasher~~ Performs very badly with certain hashes, minimal gain
Options

Validation

Metrics
- Same metrics as in the indexer cardinality limiter
- Observability if org limits are hit through Sentry Errors and metrics

Rollout

New memstore
Load Test (cardinality), adapt existing indexer load tests
Rollout in S4S
- Use a really low sample rate, slowly ramp up the rate
- Verify indexer limiter is now longer hit
- Turn off limiter in Indexer
Rollout in Production
- Same as above, just more informed decisions

Stage 2

~~Replace Redis Set with in memory HLL and Bloomfilter, which is repeatedly merged back into Redis to synchronize all nodes.~~

While this is still an option to use either probabilistic data structure in either Redis or Relay, it seems a Bloomfilter can reduce the amount of bits we store by 50% 32 to ~15, while at the same time decreasing lookup and write performance. Our current bottleneck is more likely to be CPU than Storage, so this makes this option less interesting and valuable.

The text was updated successfully, but these errors were encountered:

jjbayer · 2023-11-20T16:56:14Z

Ideally we can implement a solution that supports both the existing cardinality limiting and the "product-aware", per-tag limiter for span metrics. A good starting point for a combined solution is trying to come up with a common config format, I had something like this in mind:

For the classic limiter:

{
    "scope": {
        "level": "organization",
        "use_case": "transactions"
    },
    "window": 3600,
    "limit": 80000,
    "keep": "first-seen", // keep recording tag combinations until limit is exceeded
    "drop_strategy": "bucket" // drop entire buckets when limit is exceeded
}

For a per-tag limiter (most complex version):

{
    "scope": {
        "level": "project",
        "use_case": "spans",
        "metric": "d:spans/exclusive_time_light@millisecond",
        // limit specific combination of tags
        "tags": [
            "span.group",
            "transaction"
        ]
    },
    "window": 3600,
    "limit": 1000,
    "keep": {
        "type": "top-k", // maintain list of 1000 slowest spans
        "weight": "span.exclusive_time" // same syntax as sampling rules
    },
    "drop_strategy": {
        // replace tag values for the defined tags with a static value
        "type": "tag",
        "replacement": "<< other >>"
    }
}

jjbayer · 2023-11-22T12:40:38Z

For the span group cardinality limiter, we will need to apply the limiting in a different part of the code base, because we want to update the span payload itself if a limit has been reached (e.g. drop a tag). So a common config might not make sense, in fact the only common ground might be the bare-bones cardinality tracker.

We could cleanly separate the cardinality limiting concern from metrics by providing a reusable type like this:

struct CardinalityLimiter {
    scope: String, // used as prefix for redis entries
    window: u64,   // seconds
    limit: usize,  // maximum number of entries per window
}

impl CardinalityLimiter {
    /// Returns `true` if the entry was successfully inserted.
    ///
    /// Returns `false` if the limit for the current time window has been exceeded.
    pub fn try_insert(&self, entry: impl std::hash::Hash) -> bool;
}

This only declares the feature. The feature handler will be added in a separate PR. ref: getsentry/relay#2717

Dav1dde · 2024-01-30T14:15:54Z

Enabled in SaaS for some Sentry Orgs now. Seeing some issues with Redis memory consumption. I'll investigate before continuing with the rollout.

Dav1dde · 2024-02-07T13:30:14Z

We have a 50% Rollout rate on SaaS and it looks good. We might want to move the cardinality limits to a separate cluster which has less redundancy to reduce network traffic and increase performance. Otherwise we'll have to slightly increase available memory on the existing cluster before rolling out to 100%.

Dav1dde · 2024-02-15T07:32:57Z

100% Rolled out but not activated due to some product concerns on our side.

jernejstrasner · 2024-02-15T15:29:28Z

@Dav1dde can you elaborate on the product concerns please?

Dav1dde · 2024-02-15T16:21:22Z

We tried rolling it out for our own Sentry org twice now, in both cases this broke something in the performance product. This is not a bug, simply shows that the cardinality limiter is culling excessive cardinality from the metrics, unfortunately this does mean all metrics become unreliable (alerts stop working or fire when they shouldn't).

With the current limits we have configured, this currently does also affect some (10 - 30) customer orgs (although a lot less severe than it does our own org).

Ideally we would work on bringing down the cardinality first, then let the limiter enforce the limits and setup alerts for the product team once the limit is breached again in the future.

@jernejstrasner

Dav1dde · 2024-03-07T09:54:49Z

Implemented passive mode on a per limit basis. We should be able to enable passive mode for a few known orgs and then close our efforts here.

Dav1dde · 2024-03-12T08:53:02Z

Limiter is rolled out for SaaS with the exception of a few known orgs.

olksdr · 2024-03-13T09:38:31Z

The epic is done and we are closing this.
The cardinality limiter is enabled except few orgs, which will be adjusted later.

Few follow-ups will be created for enabling alerts and monitoring.

[Relay already applies cardinality limits](getsentry/relay#2717), so we can safely remove cardinality limiting from the indexer  ### Legal Boilerplate Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms. --------- Co-authored-by: Rachel Chen <rachelchen@PL6VFX9HP4.attlocal.net>

jan-auer self-assigned this Nov 13, 2023

jan-auer changed the title ~~Move the metrics cardinality limiter to Relay~~ [EPIC] Move the metrics cardinality limiter to Relay Nov 13, 2023

jan-auer mentioned this issue Nov 14, 2023

[Milestone] Productionize DDM and add guardrails getsentry/sentry#59893

Closed

Dav1dde self-assigned this Nov 14, 2023

jan-auer removed their assignment Nov 29, 2023

jjbayer mentioned this issue Dec 4, 2023

chore(metrics): Add feature for Relay cardinality limiter getsentry/sentry#61028

Merged

jjbayer added a commit to getsentry/sentry that referenced this issue Dec 4, 2023

chore(metrics): Add feature for Relay cardinality limiter (#61028)

7e8aa49

This only declares the feature. The feature handler will be added in a separate PR. ref: getsentry/relay#2717

Dav1dde mentioned this issue Dec 5, 2023

feat(cardinality): Implement Redis set based cardinality limiter #2745

Merged

Dav1dde mentioned this issue Dec 14, 2023

ref(cardinality): Use a Lua script and in-memory cache for the cardinality limiter #2849

Merged

olksdr closed this as completed Mar 13, 2024

xurui-c mentioned this issue Mar 22, 2024

Removed cardinality limiting from indexer getsentry/sentry#67535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Move the metrics cardinality limiter to Relay #2717

[EPIC] Move the metrics cardinality limiter to Relay #2717

jan-auer commented Nov 13, 2023 •

edited by Dav1dde

Loading

Stage 1 / Re-Implementation

jjbayer commented Nov 20, 2023

jjbayer commented Nov 22, 2023

Dav1dde commented Jan 30, 2024

Dav1dde commented Feb 7, 2024 •

edited

Loading

Dav1dde commented Feb 15, 2024

jernejstrasner commented Feb 15, 2024

Dav1dde commented Feb 15, 2024 •

edited

Loading

Dav1dde commented Mar 7, 2024

Dav1dde commented Mar 12, 2024

olksdr commented Mar 13, 2024

[EPIC] Move the metrics cardinality limiter to Relay #2717

[EPIC] Move the metrics cardinality limiter to Relay #2717

Comments

jan-auer commented Nov 13, 2023 • edited by Dav1dde Loading

Stage 1

Stage 1 / Re-Implementation

Validation

Rollout

Stage 2

jjbayer commented Nov 20, 2023

jjbayer commented Nov 22, 2023

Dav1dde commented Jan 30, 2024

Dav1dde commented Feb 7, 2024 • edited Loading

Dav1dde commented Feb 15, 2024

jernejstrasner commented Feb 15, 2024

Dav1dde commented Feb 15, 2024 • edited Loading

Dav1dde commented Mar 7, 2024

Dav1dde commented Mar 12, 2024

olksdr commented Mar 13, 2024

jan-auer commented Nov 13, 2023 •

edited by Dav1dde

Loading

Dav1dde commented Feb 7, 2024 •

edited

Loading

Dav1dde commented Feb 15, 2024 •

edited

Loading