Performance regression in 1.82.0: high CPU usage for vmagent's graphite relabeling #3466

artli · 2022-12-08T23:48:39Z

Describe the bug
When we tried to upgrade our deployment from 1.81.2 to 1.84.0, we saw our vmagents' CPU consumption jump up considerably. This only happened to the vmagents that were relabeling graphite metrics being sent to their graphite TCP endpoints.

We pinpointed this regression to the update from 1.81.2 to 1.82.0 and condensed it to a simple reproducible example.

To Reproduce

Create a simple relabeling config of

- source_labels: [__name__]
  regex: '\w*'
  action: keep

(did not reproduce with .*)

Generate fake metrics with seq 1 2000000 | awk '{print $0, "0 0"}'
(did not reproduce with a constant name)

Pump these metrics into a vmagent's graphite TCP endpoint.

Here's a full script to reproduce:

#!/usr/bin/env bash

TEMPDIR=$(mktemp -d tmp.XXXXXXXXX | xargs realpath)
cd "${TEMPDIR}" || exit
trap "rm -r ${TEMPDIR}" EXIT

download () {
    package="$1"
    version="$2"
    url="https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v${version}/${package}-linux-amd64-v${version}.tar.gz"
    curl --fail --location --remote-name "${url}"
}

VERSIONS=(1.81.2 1.82.0 1.84.0)

for version in "${VERSIONS[@]}"; do
    download "vmutils" "${version}"
done
download "victoria-metrics" "1.84.0"

for file in *.tar.gz; do
    tar -xzf "$file" --one-top-level
done

ingest () {
    version="$1"
    echo
    echo "${version}"
    rm -r victoria-metrics-data 2>/dev/null
    rm -r vmagent-remotewrite-data 2>/dev/null
    victoria-metrics-linux-amd64-v1.84.0/victoria-metrics-prod -httpListenAddr=:8428 2>/dev/null &
    vm_pid=$!
    vmutils-linux-amd64-v${version}/vmagent-prod -httpListenAddr=:8400 -graphiteListenAddr=:2000 -remoteWrite.url=http://127.0.0.1:8428/api/v1/write -remoteWrite.relabelConfig=relabel_config.yml 2>/dev/null &
    vmagent_pid=$!
    sleep 4
    time bash -c 'cat fake-metrics.txt | nc 127.0.0.1 2000'
    kill -9 "${vm_pid}"
    kill -9 "${vmagent_pid}"
    sleep 1
}

cat <<EOF >> relabel_config.yml
- source_labels: [__name__]
  regex: '\w*'
  action: keep
EOF

seq 1 2000000 | awk '{print $0, "0 0"}' > fake-metrics.txt

for version in "${VERSIONS[@]}"; do
    ingest "${version}"
done

For me ingestion takes on the order of a second on 1.81.2 and more than five seconds on later versions.

The text was updated successfully, but these errors were encountered:

valyala · 2022-12-10T11:44:37Z

@artli , thanks for filing the bugreport! It looks like your workload hit worst-case scenario for the relabeling optimization introduced in the commit a18d6d5 , which has been included in v1.82.0. This optimization caches the calculated value for a particular relabeling rule, which uses regex, so it returns the cached value instead of re-calculating it again via expensive regex. This optimization works the best when the number of unique label values (or unique metric names), which are used in the relabeling rules, is relatively low (e.g. less than 100K per label). This is quite common case in practice. For example, the usual summary number of unique metric names exported by frequently used Prometheus exporters doesn't exceed few thousands. The same applies to commonly used labels such as job, instance and pod.

In your case the number of unique metric names is quite big - the provided test generates 2M unique metric names. This completely breaks the optimization, since it caches results for up to 100K unique label values (or metric names) per each relabeling rule.

The workaround is to use graphite-style relabeling for Graphite metrics - it should work much faster than the traditional Prometheus-style relabeling. @artli , could you try using the Graphite-style relabeling and reporting whether it works faster for your case in v1.84.0?

valyala · 2022-12-10T11:51:11Z

hint: you can debug VictoriaMetrics relabeling rules (both Prometheus-style and Graphite-style) at https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/metric-relabel-debug . This UI will be included in the next release of VictoriaMetrics. See this feature request for details.

…ssed during the last 5 minutes from FastStringMatcher.Match(), FastStringTransformer.Transform() and InternString() Previously only up to 100K results were cached. This could result in sub-optimal performance when more than 100K unique strings were actually used. For example, when the relabeling rule was applied to a million of unique Graphite metric names like in the #3466 This commit should reduce the long-term CPU usage for #3466 after all the unique Graphite metrics are registered in the FastStringMatcher.Transform() cache. It is expected that the number of unique strings, which are passed to FastStringMatcher.Match(), FastStringTransformer.Transform() and to InternString() during the last 5 minutes, is limited, so the function results fit memory. Otherwise OOM crash can occur. This should be the case for typical production workloads.

valyala · 2022-12-12T22:55:33Z

@artli , the commit 3b18931 should address the issue without the need to use Graphite-style relabeling rules (though it is highly recommended to switch to Graphite-style relabeling rules when working with Graphite metric names - this improves both performance and readability of the relabeling rules).

The commit lifts the limit in 100K unique strings per relabeling rule, so now the cache can remember regex results for all the input strings (e.g. unique Graphite metric names), and reduce CPU usage on subsequent execution of the relabeling rules for the same Graphite metric names. It is possible to test the enhancement by building vmagent from the commit 3b18931 according to these docs.

artli · 2022-12-13T09:28:04Z

Thanks a lot! I'll report back when we try it out and later also try and switch most rules to Graphite relabeling rules as well (though last time I checked that wasn't possible to do for every rule in our case).

Intuitively though it does seem like relabeling rules in general should have different caching behavior for the Graphite endpoint because it's expected to have much greater cardinality in metric names. I suppose that's why you lifted the restriction but I wonder whether this might result in memory issues. Hopefully not; we'll try and check it out soon.

valyala · 2022-12-14T22:20:49Z

FYI, VictoriaMetrics and vmagent v1.85.1 contain the improvement mentioned here, so it is possible to test it by running VictoriaMetrics components from v1.85.1.

artli · 2022-12-15T17:02:44Z

Thanks! I tried out 1.85.1 and it does fix the CPU usage issue (and in fact saves some CPU). It does use a lot more memory but that might be fine for our setup.

What's worrying me now is that on the busiest of vmagents I now see periodic dips in ingestion followed by spikes from catching up. They happen with a frequency of a minute, so it feels like the new cache cleanup mechanism might be slowing things down considerably because it linearly scans the whole cache. We have lots of cores and lots of spare CPU capacity though, so not sure why it would affect things this much; maybe the linear scan wrecks the shared CPU caches? Or maybe this is not related to the scan at all; no idea.

I currently don't have great visibility into this because our vmagents are scraped once a minute, which obviously prevents me from inspecting a process with a period of a minute (I noticed the dips from some more frequent metrics from an upstream process). I might be able to try and narrow this down more; do you have a hunch for what the best approach would be here (besides using more graphite relabeling rules, obviously)?

Thank you for the fix anyway, this looks much better now.

valyala · 2022-12-20T09:20:27Z

We have lots of cores and lots of spare CPU capacity though, so not sure why it would affect things this much; maybe the linear scan wrecks the shared CPU caches? Or maybe this is not related to the scan at all; no idea.

It would be great if you could collect CPU profile and memory profile from vmagent when it works as expected and then when it slows down and attach them into a new issue linked to this one (since the conversation on this one becomes too long). See how to collect CPU and memory profiles from vmagent.

As for the graphite relabeling, it can be mixed with the Prometheus-style relabeling. For example, you can quickly extract the needed parts from Graphite metrics into some labels and then apply the usual relabeling to the extracted labels:

# extract job, instance and metric name from Graphite metric
- action: graphite
  match: "*.*.*.total"
  labels:
    job: "$1"
    instance: "$2"
    __name__: "${3}_total"

# drop metrics with job="foo" label.
- if: '{job="foo"}'
  action: drop

artli · 2022-12-21T17:07:06Z

Sure, thanks! I agree that the most critical issue has been fixed, so let me close this and open a different issue when I get more data.

… single goroutine out of many concurrently running goroutines Updates #3466

valyala · 2022-12-21T21:17:03Z

@artli , could you try vmagent built from the commit 3300546 and verify whether this improves its performance during periodic slowdowns every minute? You can use either the Docker image published at docker.io/victoriametrics/vmagent:heads-public-single-node-0-g3300546ea or the binary attached to this message - it contains the vmagent built from the commit above. Alternatively you can build vmagent from this commit on yourself by following these steps.

vmagent-prod.tar.gz

tarkhil · 2023-06-03T14:23:42Z

vmagent-20230518-224855-4c1241d (vmtools 1.87.6) same problem, CPU us 100% used with anything but vmagent metrics checking. Even node_exporter from two targets uses all available CPU. I hoped to get rid of memory-hogging Prometheus so much...

denisgolius added the vmagent label Dec 9, 2022

valyala added the bug Something isn't working label Dec 10, 2022

valyala added performance Performance-related issue and removed bug Something isn't working labels Dec 11, 2022

valyala added a commit that referenced this issue Dec 14, 2022

docs/CHANGELOG.md: fix the link to the issue #3466

0c54ff2

valyala added a commit that referenced this issue Dec 14, 2022

docs/CHANGELOG.md: fix the link to the issue #3466

97a9d2e

artli closed this as completed Dec 21, 2022

valyala added a commit that referenced this issue Dec 21, 2022

lib/bytesutil: make sure that the cleanup code is performed only by a…

fcee360

… single goroutine out of many concurrently running goroutines Updates #3466

valyala added a commit that referenced this issue Dec 21, 2022

lib/bytesutil: make sure that the cleanup code is performed only by a…

3300546

… single goroutine out of many concurrently running goroutines Updates #3466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in 1.82.0: high CPU usage for vmagent's graphite relabeling #3466

Performance regression in 1.82.0: high CPU usage for vmagent's graphite relabeling #3466

artli commented Dec 8, 2022

valyala commented Dec 10, 2022

valyala commented Dec 10, 2022

valyala commented Dec 12, 2022

artli commented Dec 13, 2022

valyala commented Dec 14, 2022

artli commented Dec 15, 2022

valyala commented Dec 20, 2022

artli commented Dec 21, 2022

valyala commented Dec 21, 2022

tarkhil commented Jun 3, 2023

Performance regression in 1.82.0: high CPU usage for vmagent's graphite relabeling #3466

Performance regression in 1.82.0: high CPU usage for vmagent's graphite relabeling #3466

Comments

artli commented Dec 8, 2022

valyala commented Dec 10, 2022

valyala commented Dec 10, 2022

valyala commented Dec 12, 2022

artli commented Dec 13, 2022

valyala commented Dec 14, 2022

artli commented Dec 15, 2022

valyala commented Dec 20, 2022

artli commented Dec 21, 2022

valyala commented Dec 21, 2022

tarkhil commented Jun 3, 2023