Reduce memory usage for prometheus plugins #568

khanhntd · 2022-08-19T17:02:56Z

Description of the issue

After introducing save name before relabel and save instance, job before relabel. There has been an increase in memory consumption and also CPU cores since it adds extra relabel_configs and metric_relabel_config (200MB for normal case when using job, instance before relabel and 500MB for normal case without any other metric_relabel_config or relabel_config with more than 10,000 metrics). However, process for creating label for each metric within a scrape loop is a single thread.

Description of changes

Dropping unknown type for internal metric for avoid increase metrics processing
Instead of saving relabel metrics for each Prometheus Metric, moving looking for metric type when appending metrics into batch
Utilize metadata in context for OpenTelemetry for better memory consumption (since the context for holding metadata store is created within each Scrape Loop and is garbaged collect afterwards). For previous behavior, we have to use relabel_config to save job and instance before relabel to look up metadata type
Search up for metadataStore in context only once during the entire batch
Only use metric relabel name when target label is name

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Note

Ensure backward compatibility for the current CWAgent and my PR
The number of metrics is quite small (<30). Therefore, it might not show the whole picture

Tests

For my PR

Without relabel name and instance and job: result in 50% decrease in CPU Utilization from start (with less memory utilization during runtime) and also 5% - 10% in memory consumption (2.05% maximum, average is around 1.95% but the minimum always down to 1.86%)

        global:
          scrape_interval: 1m
          scrape_timeout: 10s
        scrape_configs:
          - job_name: cwagent-ecs-file-sd-config
            sample_limit: 10000
            file_sd_configs:
              - files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]

With relabel name and instance and job: mostly same pattern with current CWAgent for memory utilization (with lesser maximum memory utilization pattern); however, decrease 20% in CPU Utilization during initialization

       global:
          scrape_interval: 1m
          scrape_timeout: 10s
        scrape_configs:
          - job_name: cwagent-ecs-file-sd-config
            sample_limit: 10000
            file_sd_configs:
              - files: [ "/tmp/cwagent_ecs_auto_sd.yaml" ]
            relabel_configs:
              - source_labels: [__address__]
                replacement: job_replacement
                target_label: job
            metric_relabel_configs:
              - source_labels: [ __address__ ]
                replacement: instance_replacement
                target_label: instance
              - source_labels: [__name__]
                regex: "memcached_commands_total(.*)"
                target_label: __name__
                replacement: "memcached_commands"

For the current CWAgent performance:

Requirements

Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make linter

…us scraper

codecov-commenter · 2022-08-20T05:44:54Z

Codecov Report

Merging #568 (ca61afc) into master (88d45d2) will decrease coverage by 0.31%.
The diff coverage is 52.17%.

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
- Coverage   56.98%   56.66%   -0.32%     
==========================================
  Files         363      365       +2     
  Lines       16947    16993      +46     
==========================================
- Hits         9657     9629      -28     
- Misses       6739     6813      +74     
  Partials      551      551

Impacted Files	Coverage Δ
plugins/inputs/prometheus_scraper/calculator.go	`0.00% <0.00%> (ø)`
...ugins/inputs/prometheus_scraper/metrics_handler.go	`0.00% <ø> (ø)`
...ns/inputs/prometheus_scraper/prometheus_scraper.go	`2.63% <0.00%> (+0.13%)`	⬆️
plugins/inputs/prometheus_scraper/start.go	`0.91% <0.00%> (+0.06%)`	⬆️
...gins/inputs/prometheus_scraper/metrics_metadata.go	`50.00% <50.00%> (ø)`
...ins/inputs/prometheus_scraper/metric_prometheus.go	`54.54% <54.54%> (ø)`
...gins/inputs/prometheus_scraper/metrics_receiver.go	`73.33% <73.33%> (-16.87%)`	⬇️
plugins/inputs/prometheus_scraper/util.go	`100.00% <100.00%> (ø)`
translator/cmdutil/userutil_darwin.go	`10.52% <0.00%> (ø)`
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

SaxyPandaBear

Great deep dive on the issue. I can't tell what the performance difference is, based on the screenshots though. Can you explain them more?

SaxyPandaBear · 2022-08-23T20:57:24Z

plugins/inputs/prometheus_scraper/metrics_metadata.go

+func metadataForMetric(metricName string, mc MetadataCache) *scrape.MetricMetadata {
+	// Two ways to get metric type through metadataStore:
+	// * Use instance and job to get metadataStore. If customer relabel job or instance, it will fail
+	// * Use Context that holds metadataStore which is created within each scrape loop https://github.com/prometheus/prometheus/blob/main/scrape/scrape.go#L1154-L1161


Nit - the link to the code in the prometheus repo should be pinned to a specific commit, not a branch. It is possible for the file to change over time in the branch so this might not be accurate again when someone has to look for it.

Nice catch. Thanks for pointing it out ! I don't know if this documentation is easy to understand for new folks ( I hope it does)

SaxyPandaBear · 2022-08-23T20:59:01Z

plugins/inputs/prometheus_scraper/metrics_receiver.go


 // metricsReceiver implement interface Appender for prometheus scarper to append metrics
 type metricsReceiver struct {
 	pmbCh chan<- PrometheusMetricBatch
 }

 type metricAppender struct {
-	receiver *metricsReceiver
-	batch    PrometheusMetricBatch
+	ctx        context.Context


Why does the context need to be put in the struct and not as a function parameter?

The Appender and Append are two completely separate process so there would not be a clear way of passing this as a function parameter. Moreover, adding a parameter is unfeasible since the current Appender and Append is the custom overridden method from the prometheus storage (prometheus Append and prometheus Appender)

SaxyPandaBear · 2022-08-23T21:00:52Z

plugins/inputs/prometheus_scraper/metrics_receiver.go

+	if ma.isNewBatch {
+		metadataCache, err := getMetadataCache(ma.ctx)
+		if err != nil {
+			return err
+		}
+		ma.isNewBatch = false
+		ma.mc = metadataCache
+	}


Why not just update the Appender function to get the metadata cache when creating the appender?

The reason is: the Appender is an initialization of the Append. However, the metadata cache only appears within each scrape (in Prometheus terms, scrape loop) and only exists within that scrape (each scrape loop is responsible for scraping each target and apply some system default label). Therefore, we won't be able to get the exact "metadata cache" during initialization.

For more information: (out of scope but want to share) In the past, we use relabel job and instance to find the target, afterwards, using the target to find the metadata cache. However, that causes an extra 200MB in best circumstances and run time too. Therefore, I utilize context to hold down the cache and it will be garbaged collect later on based on my discussion with Anthony.

SaxyPandaBear · 2022-08-23T21:03:49Z

plugins/inputs/prometheus_scraper/metrics_receiver.go

+
+	if metricMetadata.Type == textparse.MetricTypeUnknown {
+		// The internal metrics sometimes will return with type unknown and we would not consider it as valid metric (only support Gauge, Counter, Summary)
+		// https://github.com/khanhntd/amazon-cloudwatch-agent/blob/master/plugins/inputs/prometheus_scraper/metrics_filter.go#L21-L48


same nit here, and this points to your fork

Nice catch. Thanks for pointing it out to me !

SaxyPandaBear · 2022-08-23T21:04:38Z

plugins/inputs/prometheus_scraper/metrics_receiver_test.go

@@ -1,112 +1,172 @@
-// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
-// SPDX-License-Identifier: MIT
+// Copyright The OpenTelemetry Authors


So this code is taken directly from the OTel receiver?

Its part of copy paste for developing unit test. I learn some of their unit test and Prometheus unit test before able to write down this unit test so it is not taken directly from the OTEL receiver.

SaxyPandaBear · 2022-08-23T21:08:09Z

plugins/inputs/prometheus_scraper/start.go

-				TargetLabel:  savedScrapeNameLabel,
-				SourceLabels: model.LabelNames{"__name__"},
-			},
+		if relabelMetricName {


So if any of the metrics have __name__, then all of the metrics get relabeled?

Yes. To be more exactly, if the customer provides their configuration with target label as name, we would an extra label (a magic label) to store the original name before relabeling. OTEL does not allow the customers to modify the metric name and this seems to be an issue in Prometheus too (since prometheus updates the metadata cache is only updated before apply relabeling so we have no control over it)

SaxyPandaBear · 2022-08-23T21:09:37Z

plugins/inputs/prometheus_scraper/metric_prometheus.go

+func (pm *PrometheusMetric) isValueValid() bool {
+	//treat NaN and +/-Inf values as invalid as emf log doesn't support them
+	return !value.IsStaleNaN(pm.metricValue) && !math.IsNaN(pm.metricValue) && !math.IsInf(pm.metricValue, 0)
+}
+
+func (pm *PrometheusMetric) isCounter() bool {
+	return pm.metricType == string(textparse.MetricTypeCounter)
+}
+
+func (pm *PrometheusMetric) isGauge() bool {
+	return pm.metricType == string(textparse.MetricTypeGauge)
+}
+
+func (pm *PrometheusMetric) isHistogram() bool {
+	return pm.metricType == string(textparse.MetricTypeHistogram)
+}
+
+func (pm *PrometheusMetric) isSummary() bool {
+	return pm.metricType == string(textparse.MetricTypeSummary)
+}


Where are these functions used?

It is used in the Filter which drops the unsupported metrics and also used in calculation

For more context in the delta calculation: The reason why we would perform some calculation (e.g Delta with Counter Prometheus Type) since there is no support for Counter Type in CWMetrics. The only way we can do is calculate the difference and send it over to CW Backend.

khanhntd · 2022-08-24T01:59:44Z

I can't tell what the performance difference is, based on the screenshots though. Can you explain them more?

Sure and thanks too! As you already know, the OOM is caused because of saving metrics name before relabel. Therefore, with my PR, I bench marking with these test case:

The customer case without relabeling metrics name
Relabeling metrics name

For the first test case which is the customer's use case, I notice the following,

The CPU Utilization decreases from 2.15% to 1.11% (approximately 50% reduction when initialization CloudWatchAgent and small decrease during run time)
The Memory Utilization is within the range of 1.86% to 2.05% (1.95% in average) while comparing with the current CloudWatchAgent is within the range of 1.95% to 2.15% (average of 2.05%) - be note that this is only a small number of metrics so it might different in customer environment
Confirm with the customers that with my PR, their memory utilization is reduced from 9GB to 2.5GB (which addition 500MB since iirc, their original states is 3GB)

For the second test case which is why we introduce the PR that causes OOM, I notice the following

The CPU Utilization decreases from 2.15% to 1.70% and a small decrease during run time).
The Memory Utilization is within the range of 1.95% to 2.15% (1.95% in average but less pattern in 2.15%) - be also note that this is only a small number of metrics so it might different in customer environment.

** Note**: The decrease in CPU Utilization might help the most since when trouble shooting with customers, increase CPU cores make the customer's pod stables around 9GB (since relabel metric might be a single thread and increase CPU would help Prometheus to access the metrics in memory more faster)

github-actions · 2022-09-01T00:19:37Z

This PR was marked stale due to lack of activity.

ashevtsov-wawa · 2023-05-05T16:05:59Z

@khanhntd I'm wondering if there are any plans to get this PR updated and merged any time soon?

khanhntd added 8 commits August 18, 2022 17:01

Remove relabel name,job,label and add metadatastorecache for promethe…

045994e

…us scraper

Add ctx for appneder

0d4df79

Change to use context to scrap target and cache datastore

23950f8

Add metadata cache for looking up type with each metric

f3910b1

Finish finding metric type in cache

821480c

Add handle internal metric

cf8fef3

Delte metric handler test

eac4759

Add isNew and cache

67b7764

khanhntd added 13 commits August 20, 2022 01:57

Separate function

f55569c

Add metric relabel only when there is target label equal to name

7d51fb8

Change back log level

4830f54

Revert sha

8ad70a4

Add metadata test and build prometheus metric

eda9e11

Format go

2db4586

Fix renaming metrics and add basic unit test

c0cb1b6

Finish basic testing for metrics receiver

dd6cf83

Add filter internal metrics

b7c842b

Format and reduce code

7b5cf03

Add comments

82b310f

Merge branch 'master' into prometheus_OOM

7ea0f73

Fix test

8d9d4cf

khanhntd changed the title ~~Prometheus oom~~ Reduce memory usage for prometheus plugins Aug 22, 2022

khanhntd marked this pull request as ready for review August 22, 2022 14:44

khanhntd requested a review from a team as a code owner August 22, 2022 14:44

SaxyPandaBear reviewed Aug 23, 2022

View reviewed changes

khanhntd added 2 commits August 23, 2022 21:11

Fix license for metrics receiver test

24a42cd

Fix github link issues

ca61afc

github-actions bot added the Stale label Sep 1, 2022

Merge from main

1353479

khanhntd force-pushed the prometheus_OOM branch from 20a354c to 1353479 Compare March 26, 2023 08:43

github-actions bot removed the Stale label Apr 16, 2023

chadpatel closed this Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage for prometheus plugins #568

Reduce memory usage for prometheus plugins #568

khanhntd commented Aug 19, 2022 •

edited

Loading

codecov-commenter commented Aug 20, 2022 •

edited

Loading

SaxyPandaBear left a comment

SaxyPandaBear Aug 23, 2022

khanhntd Aug 24, 2022

SaxyPandaBear Aug 23, 2022

khanhntd Aug 24, 2022 •

edited

Loading

SaxyPandaBear Aug 23, 2022

khanhntd Aug 24, 2022

SaxyPandaBear Aug 23, 2022

khanhntd Aug 24, 2022

SaxyPandaBear Aug 23, 2022

khanhntd Aug 24, 2022

SaxyPandaBear Aug 23, 2022

khanhntd Aug 24, 2022

SaxyPandaBear Aug 23, 2022

khanhntd Aug 24, 2022

khanhntd commented Aug 24, 2022

github-actions bot commented Sep 1, 2022

ashevtsov-wawa commented May 5, 2023

Reduce memory usage for prometheus plugins #568

Reduce memory usage for prometheus plugins #568

Conversation

khanhntd commented Aug 19, 2022 • edited Loading

Description of the issue

Description of changes

License

Note

Tests

Requirements

codecov-commenter commented Aug 20, 2022 • edited Loading

Codecov Report

SaxyPandaBear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khanhntd Aug 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khanhntd commented Aug 24, 2022

github-actions bot commented Sep 1, 2022

ashevtsov-wawa commented May 5, 2023

khanhntd commented Aug 19, 2022 •

edited

Loading

codecov-commenter commented Aug 20, 2022 •

edited

Loading

khanhntd Aug 24, 2022 •

edited

Loading