-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Move the metrics cardinality limiter to Relay #2717
Comments
Ideally we can implement a solution that supports both the existing cardinality limiting and the "product-aware", per-tag limiter for span metrics. A good starting point for a combined solution is trying to come up with a common config format, I had something like this in mind: For the classic limiter: {
"scope": {
"level": "organization",
"use_case": "transactions"
},
"window": 3600,
"limit": 80000,
"keep": "first-seen", // keep recording tag combinations until limit is exceeded
"drop_strategy": "bucket" // drop entire buckets when limit is exceeded
} For a per-tag limiter (most complex version): {
"scope": {
"level": "project",
"use_case": "spans",
"metric": "d:spans/exclusive_time_light@millisecond",
// limit specific combination of tags
"tags": [
"span.group",
"transaction"
]
},
"window": 3600,
"limit": 1000,
"keep": {
"type": "top-k", // maintain list of 1000 slowest spans
"weight": "span.exclusive_time" // same syntax as sampling rules
},
"drop_strategy": {
// replace tag values for the defined tags with a static value
"type": "tag",
"replacement": "<< other >>"
}
} |
For the span group cardinality limiter, we will need to apply the limiting in a different part of the code base, because we want to update the span payload itself if a limit has been reached (e.g. drop a tag). So a common config might not make sense, in fact the only common ground might be the bare-bones cardinality tracker. We could cleanly separate the cardinality limiting concern from metrics by providing a reusable type like this: struct CardinalityLimiter {
scope: String, // used as prefix for redis entries
window: u64, // seconds
limit: usize, // maximum number of entries per window
}
impl CardinalityLimiter {
/// Returns `true` if the entry was successfully inserted.
///
/// Returns `false` if the limit for the current time window has been exceeded.
pub fn try_insert(&self, entry: impl std::hash::Hash) -> bool;
} |
This only declares the feature. The feature handler will be added in a separate PR. ref: getsentry/relay#2717
Enabled in SaaS for some Sentry Orgs now. Seeing some issues with Redis memory consumption. I'll investigate before continuing with the rollout. |
We have a 50% Rollout rate on SaaS and it looks good. We might want to move the cardinality limits to a separate cluster which has less redundancy to reduce network traffic and increase performance. Otherwise we'll have to slightly increase available memory on the existing cluster before rolling out to 100%. |
100% Rolled out but not activated due to some product concerns on our side. |
@Dav1dde can you elaborate on the product concerns please? |
We tried rolling it out for our own Sentry org twice now, in both cases this broke something in the performance product. This is not a bug, simply shows that the cardinality limiter is culling excessive cardinality from the metrics, unfortunately this does mean all metrics become unreliable (alerts stop working or fire when they shouldn't). With the current limits we have configured, this currently does also affect some (10 - 30) customer orgs (although a lot less severe than it does our own org). Ideally we would work on bringing down the cardinality first, then let the limiter enforce the limits and setup alerts for the product team once the limit is breached again in the future. |
Implemented passive mode on a per limit basis. We should be able to enable passive mode for a few known orgs and then close our efforts here. |
Limiter is rolled out for SaaS with the exception of a few known orgs. |
The epic is done and we are closing this. Few follow-ups will be created for enabling alerts and monitoring. |
<!-- Describe your PR here. --> [Relay already applies cardinality limits](getsentry/relay#2717), so we can safely remove cardinality limiting from the indexer <!-- Sentry employees and contractors can delete or ignore the following. --> ### Legal Boilerplate Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms. --------- Co-authored-by: Rachel Chen <rachelchen@PL6VFX9HP4.attlocal.net>
The performance of metrics queries depends on the number of time series they have to aggregate. To put an upper bound on the number of time series - and thus the number of metric buckets with a unique combination of tag values - the metrics ingestion pipeline has a cardinality limiter.
It resides in the metrics indexer consumer, currently causing more than half of its CPU usage and introducing additional latency due to a high number of Redis calls. There are two reasons to move the cardinality limiter out of the indexer consumers:
Stage 1
Stage 1 / Re-Implementation
Validation
Rollout
Stage 2
Replace Redis Set with in memory HLL and Bloomfilter, which is repeatedly merged back into Redis to synchronize all nodes.While this is still an option to use either probabilistic data structure in either Redis or Relay, it seems a Bloomfilter can reduce the amount of bits we store by 50% 32 to ~15, while at the same time decreasing lookup and write performance. Our current bottleneck is more likely to be CPU than Storage, so this makes this option less interesting and valuable.
The text was updated successfully, but these errors were encountered: