vmagent: memory overlimit in porto #825

wf1nder · 2020-10-11T11:41:22Z

Hi.

We trying to migrate vmagent from bare metal server to porto container: https://github.com/yandex/porto/

On bare metal we have 128 GB ram, and running vmagent with -memory.allowedPercent=60 option, and normally it uses up to 32 GB after some days from start. After 30 minutes from start it consumes about 15 GB.

In porto we giving 64 GB ram for container, and try to run vmagent with all the same options, and less than a minute after start it killing with OOM with exhause all ram. In some reason vmagent trying to allocate too much memory. Also I tried to run vmagent with -memory.allowedBytes=12884901888 (12 GB) option, and it had no effect: the same OOM after few seconds from start.

In both cases used last version of vmagent: 1.43.0. All options are the same:

-enableTCP6 \
-promscrape.config=/etc/scrape.yml \
-remoteWrite.url=<url_1> \
-remoteWrite.url=<url_2> \
-remoteWrite.url=<url_3> \
-remoteWrite.url=<url_4> \
-remoteWrite.label=prom_cluster=production \
-promscrape.config.strictParse \
-remoteWrite.maxDiskUsagePerURL=1000000000 \
-memory.allowedPercent=60 \
-promscrape.consulSDCheckInterval=60s \
-promscrape.discovery.concurrency=30 \
-remoteWrite.tmpDataPath=/tmp/vmagent-remotewrite-data \
-remoteWrite.showURL=true \
-promscrape.suppressScrapeErrors=true \
-remoteWrite.queues=3

I collected memory profile of vmagent right before OOM in porto, when it reaches about ~64 GB of ram, and sent it to info@victoriametrics.com
Can you please see what the problem might be?

The text was updated successfully, but these errors were encountered:

valyala · 2020-10-17T09:51:37Z

The provided memory profile shows that the majority of memory is consumed during parsing Prometheus data obtained from scrape targets. It is likely vmagent scrapes targets with millions exposed metrics per each target with big number of labels. This case isn't optimized well yet, since vmagent reads all the scraped contents in memory and then parses it in memory in one go instead of using streaming. The workaround is to push data to vmagent via any supported data ingestion protocol. For instance, via Prometheus exposition format. These data ingestion protocols parse data in batches (aka in stream manner), so vmagent can accept unlimited amounts of data per single request/connection without significant memory usage.

Could you verify that vmagent properly detects available CPU cores in porto? The number of detected CPU cores should be exported as go_gomaxprocs metric at http://vmagent:8429/metrics page. vmagent can use more memory if it improperly detects available CPU cores, since it keeps per-CPU datastructures when parsing Prometheus data from scrape targets.

wf1nder · 2020-10-19T10:12:06Z

Thanks for the answer.

First of all I wanted to understand why vmagent consumes different amount of memory in different environments. Your assumption about GOMAXPROCS definitely make sense.
On bare metal we have 32 virtual cores, and vmagent properly detects this value. In porto cloud the process sees all the host machine cores, we can only hard-limit cpu consumption by throttling, specifying it by virtual megahertz. So in proto process see 104 virtual cores in our case.

I set env var GOMAXPROCS=8, and the memory consumption significantly decreased, and for some time (hours) even became less than on bare metal. But over time consumption increases, despite of -memory.allowedBytes=12884901888 option set:

I continue to monitor consumption over time. If it isn't OOM'ed, we can assume that our problem is solved.

wf1nder · 2020-10-21T15:13:48Z

It OOM'ed :(

Looks like some memory leak, because memory consumption is growing over time stepwise. This is also noticeable on bare metal servers (the arrow marks the moment of restart):

zbindenren · 2020-10-23T07:01:05Z

I am not sure if we have the same issue. We have vmagent on k8s with a 4GB memory limit. We use the following (memory) option:

-memory.allowedBytes=50MiB

We have arround 1000 scrape targets and the agent gets oom killed regularly:

Is this a memory leak or do I have to increase the memory limit?

valyala · 2020-11-02T00:27:58Z

@wf1nder , could you try building vmagent from the commit e277c3d and running it with -promscrape.streamParse command-line option as described in the troubleshooting section? The stream parse mode should significantly reduce memory requirements for vmagent when scraping targets with millions of metrics.

@zbindenren , which version of vmagent do you run? Could upgrade to the latest available version and share memory profile when it reaches 4GB limit? See these docs on how to collect memory profile.

valyala · 2020-11-02T02:19:11Z

FYI, stream parsing mode has been included in vmagent starting from v1.45.0.

zbindenren · 2020-11-04T15:37:43Z

@wf1nder , could you try building vmagent from the commit e277c3d and running it with -promscrape.streamParse command-line option as described in the troubleshooting section? The stream parse mode should significantly reduce memory requirements for vmagent when scraping targets with millions of metrics.

@zbindenren , which version of vmagent do you run? Could upgrade to the latest available version and share memory profile when it reaches 4GB limit? See these docs on how to collect memory profile.

@valyala here the requested memory profile at 3.7GB memory usage. Version of vmagent is 1.45.0. Number of targets: 960.
mem.pprof.gz

valyala · 2020-11-04T15:58:24Z

@zbindenren , thanks for the provided memory profile! Increased memory usage for your case could be related to #878 . Could you build vmagent from commit b2042a1 and verify whether it uses lower amounts of memory comparing to vmagent v1.45.0 ? See build instructions for vmagent.

zbindenren · 2020-11-06T06:02:39Z

It is better with master (blue graph), but still increasing.

zbindenren · 2020-11-06T17:17:10Z

@valyala I think -promscrape.streamParse killed our victoria metrics and thanos cluster today. As soon as I enabled the option we got a churn rate of 30k. In the following screenshot you see the metric names in the dashboard:

Here the churn rate:

Slow queries increased:

Slow inserts:

As soon as I disabled the flag, everything went to normal again.

wf1nder · 2020-11-06T18:11:38Z

I can confirm, enabling -promscrape.streamParse creates extra load to VM, and metric "Active Time Series (hour metric ids)" increased instantly x2 in our case. Disabling stream parse on the same version of vmagent returns load to previous condition.
So we stopped to test this option for now.

shuttie · 2020-11-06T18:20:47Z

For streamParse I've noted that metric labels emitted by vmagent sometimes are incorrectly formatted. An example will be smth like lab=\"el value\" instead of label=value. It generated a ton of unique label names and killed the victoriametrics due to a too high data cardinality.

shuttie · 2020-11-06T18:23:45Z

An example of heap profile (for the latest master of b2042a1):
heapx9.bin.gz

Looks like that the main leak is somewhere in promrelabel.setLabelValue, but I was not able to reliably understand where can it leak in this tiny function.

shuttie · 2020-11-07T09:15:49Z

Yet another heap dump after the night of struggle:
heapz4.bin.gz

Previously `-promscrape.streamParse` mode could result in garbage labels for the scraped metrics because of data race. See #825 (comment)

valyala · 2020-11-07T10:49:39Z

@shuttie , @wf1nder and @zbindenren , thanks for providing useful information about -promscrape.streamParse issue. This information helped identifying and fixing the issue, which could result in garbage labels for the scraped metrics. The bugfix is available in the commit 188325f . Could you build vmagent from this commit and verify whether the issue is gone? See build instructions for vmagent.

Previously `-promscrape.streamParse` mode could result in garbage labels for the scraped metrics because of data race. See #825 (comment)

…hen discovering big number of scrape targets by using string concatenation instead of fmt.Sprintf Updates #825

valyala · 2020-11-07T11:09:28Z

@shuttie , the provided memory profiles show that the majority of memory is used for per-scrape target labels after applying relabeling. Could you share the contents of http://vmagent:8429/api/v1/targets page? How many targets are scraped in your setup and how many labels per scrape target are present on this page?

…s after applying per-target relabeling This should reduce memory usage when per-target relabeling creates big number of temporary labels with long names and/or values. See #825

valyala · 2020-11-07T14:22:29Z

@shuttie , could you build vmagent from the commit 83df20b and verify whether it uses lower amounts of memory for your workload?

valyala · 2020-11-07T18:02:19Z

FYI, all the commits mentioned above have been included in v1.46.0 release. So the bugfixes could be tested in this release without the need for building vmagent from sources.

shuttie · 2020-11-08T19:26:36Z

Switching to 1.46 seems to be almost the same as before:

heap1.bin.gz

The main suspect is still the same:

We have approximately active 1700 targets in the /v1/targets response (which seems to be correct), and almost no dropped targets there. But there is quite a high level of target churn by design: each target lifetime is 1-6 hours, so pods are constantly come and leave.

The relabel config looks like this:

 - job_name: 'whatever_prod'
   kubernetes_sd_configs:
     - role: 'pod'
       namespaces:
         names:
         - whatever-prod
       selectors:
       - role: pod
         label: app=whatever
   relabel_configs:
     - source_labels: ['__meta_kubernetes_namespace', '__meta_kubernetes_pod_label_app', '__meta_kubernetes_pod_container_port_number']
       action: keep
       regex: 'whatever-prod;whatever;50000'
     - source_labels: ['__meta_kubernetes_pod_label_host']
       regex: '(.+)'
       action: replace
       target_label: host
     - source_labels: ['__meta_kubernetes_pod_label_merchant']
       regex: '([0-9]+)'
       action: replace
       target_label: merchant
     - source_labels: ['__meta_kubernetes_pod_label_placement']
       regex: '(.+)'
       action: replace
       target_label: placement
     - source_labels: ['__meta_kubernetes_pod_node_name']
       regex: '(.+)'
       action: replace
       target_label: node
     - source_labels: ['__meta_kubernetes_pod_name']
       regex: '(.+)'
       action: replace
       target_label: instance

I'm also going to enable streamParse during the night to see how it will perform.

…els by making a copy of actually used labels Updates #825

valyala · 2020-11-09T09:00:50Z

@shuttie , could you build vmagent from 6c24c5c and verify whether it reduces memory usage under your workload?

shuttie · 2020-11-09T10:02:37Z

For the last 12 hours with streamParse enabled on 1.46.0, memory usage pattern seems to become different and a bit better:

Heap dump:
heap2.bin.gz

setLabelValue seems to be the main abuser:

I'm going to deploy the commit 6c24c5c now and will report in a couple of hours.

shuttie · 2020-11-09T21:50:39Z

With a combo of 6c24c5c and streamParse the situation seems to be much more stable:

The load and config are still the same.

The setLabelValue leak seems to be gone:

But it seems to be leaking a bit somewhere in appendScrapeWork. Heap dump:
heap4.bin.gz

valyala · 2020-11-09T22:10:56Z

@shuttie , thanks for the update! The provided memory profile shows that 580MB is spent on storing original labels for each scrape target before applying relabeling. These labels are shown at /api/v1/targets page in "discoveredLabels" lists and in "droppedTargets" list. Additionally, the original labels are shown in duplicate scrape targets error messages. These labels are intended for debugging cases with improperly configured per-target relabeling. I think we can add a special command-line option like -promscrape.dropOriginalLabels, which would drop original labels from all the places mentioned above. This option can be used for reducing memory usage when scraping big number of targets with big number of original labels at the cost of reduced debuggability for improperly configured per-target relabeling.

…g for reducing memory usage when discovering big number of scrape targets Updates #878 Updates #825

valyala · 2020-11-09T22:22:57Z

@shuttie , could you build vmagent from the commit bcd1393 and run it with -promscrape.dropOriginalLabels command-line flag? This should reduce memory usage for your workload.

…k slice instead of referring to an item in this slice This should prevent from holding previously discovered []ScrapeWork slices when a part of discovered targets changes over time. This should reduce memory usage for the case when big number of discovered scrape targets changes over time. Updates #825

valyala · 2020-11-10T14:17:41Z

@shuttie , it looks like the issue, which resulted in gradual memory usage growth for your workload has been detected and fixed in the commit e205975 . Could you try vmagent built from this commit?

shuttie · 2020-11-10T15:35:58Z

After the bcd1393 commit the memory growth seems to be stable, but it's running for a quite short period of time to do any final conclusions:

But anyway thank you for your help, as the issue seems to be hopefully resolved!

As an experiment I will deploy the e295975 commit today to see the difference.

wf1nder · 2020-11-12T06:28:03Z

Looks like garbage metrics when stream parse enabled is not completely fixed.
Today I upgraded vmagent to latest version v1.46.0, and after enabling strict parse the active time series count is increased (but not as dramatic as previously) and slowly continues to grow:

shuttie · 2020-11-13T11:28:53Z

e295975 seems to be almost the same as bcd1393: memory usage is stable and quite low:

So thanks again for the fix!

valyala · 2020-11-13T11:35:33Z

Looks like garbage metrics when stream parse enabled is not completely fixed.
Today I upgraded vmagent to latest version v1.46.0, and after enabling strict parse the active time series count is increased (but not as dramatic as previously) and slowly continues to grow

@wf1nder , could you try finding metrics with unexpected labels in data collected by vmagent v1.46.0 ?

valyala · 2020-11-13T11:36:08Z

@shuttie , thanks for update!

…k slice instead of referring to an item in this slice This should prevent from holding previously discovered []ScrapeWork slices when a part of discovered targets changes over time. This should reduce memory usage for the case when big number of discovered scrape targets changes over time. Updates VictoriaMetrics#825

wf1nder · 2020-11-15T04:20:14Z

@valyala Honestly, I didn't find a way to find those metrics. I tried to find manually, but we have too many metrics to view it with our eyes.
Is there a way to find the latest created metrics in VM?

wf1nder · 2020-11-15T05:33:08Z

On the blank vmstorage I exported metrics list by query /api/v1/label/__name__/values before and after enabling stream parse on vmagent, and compared results. Metrics list the same and no garbage metric names, so looks like differencee is in labels.

wf1nder · 2020-11-15T06:38:45Z

After a while metrics with bad names appeared too. For some reason labels appeared as metric names, for example:

machine_dc=\dc1\
machine_group=\group1\}
\nodejs_common_name\

All this must be a labels in some time series, but became as a ts name.

valyala · 2020-11-16T21:02:36Z

FYI, commits that limit memory usage for vmagent have been included in v1.47.0. Closing this issue as fixed.

@wf1nder, could you create a separate issue related to bad metric names / labels?

…ed from `get*ScrapeWork()` This should prevent from possible 'memory leaks' when a pointer to ScrapeWork item stored in the slice could prevent from releasing memory occupied by all the ScrapeWork items stored in the slice when they are no longer used. See the related commit e205975 and the related issue #825

f41gh7 added bug Something isn't working vmagent labels Oct 12, 2020

valyala added the enhancement New feature or request label Oct 17, 2020

valyala added a commit that referenced this issue Nov 2, 2020

CHANGELOG.md: add a link to #825

d62ec1c

valyala added a commit that referenced this issue Nov 2, 2020

CHANGELOG.md: add a link to #825

c76242d

valyala added a commit that referenced this issue Nov 7, 2020

lib/promscrape: eliminate data race in stream parse mode

188325f

Previously `-promscrape.streamParse` mode could result in garbage labels for the scraped metrics because of data race. See #825 (comment)

valyala added a commit that referenced this issue Nov 7, 2020

lib/promscrape: eliminate data race in stream parse mode

535fea3

Previously `-promscrape.streamParse` mode could result in garbage labels for the scraped metrics because of data race. See #825 (comment)

valyala added a commit that referenced this issue Nov 7, 2020

lib/promscrape/discovery/kubernetes: reduce memory usage for labels w…

5407eed

…hen discovering big number of scrape targets by using string concatenation instead of fmt.Sprintf Updates #825

valyala added a commit that referenced this issue Nov 7, 2020

lib/promscrape/discovery/kubernetes: reduce memory usage for labels w…

92bc1af

…hen discovering big number of scrape targets by using string concatenation instead of fmt.Sprintf Updates #825

valyala added a commit that referenced this issue Nov 9, 2020

lib/promscrape: further reduce memory usage for per-scrape target lab…

6c24c5c

…els by making a copy of actually used labels Updates #825

valyala added a commit that referenced this issue Nov 9, 2020

lib/promscrape: further reduce memory usage for per-scrape target lab…

aa3e46a

…els by making a copy of actually used labels Updates #825

valyala added a commit that referenced this issue Nov 9, 2020

lib/promscrape: add -promscrape.dropOriginalLabels command-line fla…

bcd1393

…g for reducing memory usage when discovering big number of scrape targets Updates #878 Updates #825

valyala added a commit that referenced this issue Nov 9, 2020

lib/promscrape: add -promscrape.dropOriginalLabels command-line fla…

a8562d6

…g for reducing memory usage when discovering big number of scrape targets Updates #878 Updates #825

valyala closed this as completed Nov 16, 2020

wf1nder mentioned this issue Nov 17, 2020

Bad metrics / labels when using -promscrape.streamParse on vmagent #903

Closed

valyala mentioned this issue Nov 18, 2020

High memory usage of vmagent #878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmagent: memory overlimit in porto #825

vmagent: memory overlimit in porto #825

wf1nder commented Oct 11, 2020

valyala commented Oct 17, 2020 •

edited

wf1nder commented Oct 19, 2020 •

edited

wf1nder commented Oct 21, 2020

zbindenren commented Oct 23, 2020

valyala commented Nov 2, 2020

valyala commented Nov 2, 2020

zbindenren commented Nov 4, 2020 •

edited

valyala commented Nov 4, 2020

zbindenren commented Nov 6, 2020

zbindenren commented Nov 6, 2020 •

edited

wf1nder commented Nov 6, 2020

shuttie commented Nov 6, 2020

shuttie commented Nov 6, 2020

shuttie commented Nov 7, 2020

valyala commented Nov 7, 2020

valyala commented Nov 7, 2020

valyala commented Nov 7, 2020

valyala commented Nov 7, 2020

shuttie commented Nov 8, 2020

valyala commented Nov 9, 2020

shuttie commented Nov 9, 2020

shuttie commented Nov 9, 2020

valyala commented Nov 9, 2020

valyala commented Nov 9, 2020

valyala commented Nov 10, 2020

shuttie commented Nov 10, 2020

wf1nder commented Nov 12, 2020

shuttie commented Nov 13, 2020

valyala commented Nov 13, 2020

valyala commented Nov 13, 2020

wf1nder commented Nov 15, 2020

wf1nder commented Nov 15, 2020

wf1nder commented Nov 15, 2020

valyala commented Nov 16, 2020

vmagent: memory overlimit in porto #825

vmagent: memory overlimit in porto #825

Comments

wf1nder commented Oct 11, 2020

valyala commented Oct 17, 2020 • edited

wf1nder commented Oct 19, 2020 • edited

wf1nder commented Oct 21, 2020

zbindenren commented Oct 23, 2020

valyala commented Nov 2, 2020

valyala commented Nov 2, 2020

zbindenren commented Nov 4, 2020 • edited

valyala commented Nov 4, 2020

zbindenren commented Nov 6, 2020

zbindenren commented Nov 6, 2020 • edited

wf1nder commented Nov 6, 2020

shuttie commented Nov 6, 2020

shuttie commented Nov 6, 2020

shuttie commented Nov 7, 2020

valyala commented Nov 7, 2020

valyala commented Nov 7, 2020

valyala commented Nov 7, 2020

valyala commented Nov 7, 2020

shuttie commented Nov 8, 2020

valyala commented Nov 9, 2020

shuttie commented Nov 9, 2020

shuttie commented Nov 9, 2020

valyala commented Nov 9, 2020

valyala commented Nov 9, 2020

valyala commented Nov 10, 2020

shuttie commented Nov 10, 2020

wf1nder commented Nov 12, 2020

shuttie commented Nov 13, 2020

valyala commented Nov 13, 2020

valyala commented Nov 13, 2020

wf1nder commented Nov 15, 2020

wf1nder commented Nov 15, 2020

wf1nder commented Nov 15, 2020

valyala commented Nov 16, 2020

valyala commented Oct 17, 2020 •

edited

wf1nder commented Oct 19, 2020 •

edited

zbindenren commented Nov 4, 2020 •

edited

zbindenren commented Nov 6, 2020 •

edited