New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in 1.82.0: high CPU usage for vmagent's graphite relabeling #3466
Comments
@artli , thanks for filing the bugreport! It looks like your workload hit worst-case scenario for the relabeling optimization introduced in the commit a18d6d5 , which has been included in In your case the number of unique metric names is quite big - the provided test generates 2M unique metric names. This completely breaks the optimization, since it caches results for up to 100K unique label values (or metric names) per each relabeling rule. The workaround is to use graphite-style relabeling for Graphite metrics - it should work much faster than the traditional Prometheus-style relabeling. @artli , could you try using the Graphite-style relabeling and reporting whether it works faster for your case in |
hint: you can debug VictoriaMetrics relabeling rules (both Prometheus-style and Graphite-style) at https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/metric-relabel-debug . This UI will be included in the next release of VictoriaMetrics. See this feature request for details. |
…ssed during the last 5 minutes from FastStringMatcher.Match(), FastStringTransformer.Transform() and InternString() Previously only up to 100K results were cached. This could result in sub-optimal performance when more than 100K unique strings were actually used. For example, when the relabeling rule was applied to a million of unique Graphite metric names like in the #3466 This commit should reduce the long-term CPU usage for #3466 after all the unique Graphite metrics are registered in the FastStringMatcher.Transform() cache. It is expected that the number of unique strings, which are passed to FastStringMatcher.Match(), FastStringTransformer.Transform() and to InternString() during the last 5 minutes, is limited, so the function results fit memory. Otherwise OOM crash can occur. This should be the case for typical production workloads.
…ssed during the last 5 minutes from FastStringMatcher.Match(), FastStringTransformer.Transform() and InternString() Previously only up to 100K results were cached. This could result in sub-optimal performance when more than 100K unique strings were actually used. For example, when the relabeling rule was applied to a million of unique Graphite metric names like in the #3466 This commit should reduce the long-term CPU usage for #3466 after all the unique Graphite metrics are registered in the FastStringMatcher.Transform() cache. It is expected that the number of unique strings, which are passed to FastStringMatcher.Match(), FastStringTransformer.Transform() and to InternString() during the last 5 minutes, is limited, so the function results fit memory. Otherwise OOM crash can occur. This should be the case for typical production workloads.
@artli , the commit 3b18931 should address the issue without the need to use Graphite-style relabeling rules (though it is highly recommended to switch to Graphite-style relabeling rules when working with Graphite metric names - this improves both performance and readability of the relabeling rules). The commit lifts the limit in 100K unique strings per relabeling rule, so now the cache can remember regex results for all the input strings (e.g. unique Graphite metric names), and reduce CPU usage on subsequent execution of the relabeling rules for the same Graphite metric names. It is possible to test the enhancement by building |
Thanks a lot! I'll report back when we try it out and later also try and switch most rules to Graphite relabeling rules as well (though last time I checked that wasn't possible to do for every rule in our case). Intuitively though it does seem like relabeling rules in general should have different caching behavior for the Graphite endpoint because it's expected to have much greater cardinality in metric names. I suppose that's why you lifted the restriction but I wonder whether this might result in memory issues. Hopefully not; we'll try and check it out soon. |
Thanks! I tried out 1.85.1 and it does fix the CPU usage issue (and in fact saves some CPU). It does use a lot more memory but that might be fine for our setup. What's worrying me now is that on the busiest of vmagents I now see periodic dips in ingestion followed by spikes from catching up. They happen with a frequency of a minute, so it feels like the new cache cleanup mechanism might be slowing things down considerably because it linearly scans the whole cache. We have lots of cores and lots of spare CPU capacity though, so not sure why it would affect things this much; maybe the linear scan wrecks the shared CPU caches? Or maybe this is not related to the scan at all; no idea. I currently don't have great visibility into this because our vmagents are scraped once a minute, which obviously prevents me from inspecting a process with a period of a minute (I noticed the dips from some more frequent metrics from an upstream process). I might be able to try and narrow this down more; do you have a hunch for what the best approach would be here (besides using more graphite relabeling rules, obviously)? Thank you for the fix anyway, this looks much better now. |
It would be great if you could collect CPU profile and memory profile from As for the graphite relabeling, it can be mixed with the Prometheus-style relabeling. For example, you can quickly extract the needed parts from Graphite metrics into some labels and then apply the usual relabeling to the extracted labels: # extract job, instance and metric name from Graphite metric
- action: graphite
match: "*.*.*.total"
labels:
job: "$1"
instance: "$2"
__name__: "${3}_total"
# drop metrics with job="foo" label.
- if: '{job="foo"}'
action: drop |
Sure, thanks! I agree that the most critical issue has been fixed, so let me close this and open a different issue when I get more data. |
… single goroutine out of many concurrently running goroutines Updates #3466
… single goroutine out of many concurrently running goroutines Updates #3466
@artli , could you try |
vmagent-20230518-224855-4c1241d (vmtools 1.87.6) same problem, CPU us 100% used with anything but vmagent metrics checking. Even node_exporter from two targets uses all available CPU. I hoped to get rid of memory-hogging Prometheus so much... |
Describe the bug
When we tried to upgrade our deployment from 1.81.2 to 1.84.0, we saw our vmagents' CPU consumption jump up considerably. This only happened to the vmagents that were relabeling graphite metrics being sent to their graphite TCP endpoints.
We pinpointed this regression to the update from 1.81.2 to 1.82.0 and condensed it to a simple reproducible example.
To Reproduce
Create a simple relabeling config of
(did not reproduce with
.*
)Generate fake metrics with
seq 1 2000000 | awk '{print $0, "0 0"}'
(did not reproduce with a constant name)
Pump these metrics into a vmagent's graphite TCP endpoint.
Here's a full script to reproduce:
For me ingestion takes on the order of a second on 1.81.2 and more than five seconds on later versions.
The text was updated successfully, but these errors were encountered: