-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage for prometheus plugins #568
Conversation
Codecov Report
@@ Coverage Diff @@
## master #568 +/- ##
==========================================
- Coverage 56.98% 56.66% -0.32%
==========================================
Files 363 365 +2
Lines 16947 16993 +46
==========================================
- Hits 9657 9629 -28
- Misses 6739 6813 +74
Partials 551 551
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great deep dive on the issue. I can't tell what the performance difference is, based on the screenshots though. Can you explain them more?
func metadataForMetric(metricName string, mc MetadataCache) *scrape.MetricMetadata { | ||
// Two ways to get metric type through metadataStore: | ||
// * Use instance and job to get metadataStore. If customer relabel job or instance, it will fail | ||
// * Use Context that holds metadataStore which is created within each scrape loop https://github.com/prometheus/prometheus/blob/main/scrape/scrape.go#L1154-L1161 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - the link to the code in the prometheus repo should be pinned to a specific commit, not a branch. It is possible for the file to change over time in the branch so this might not be accurate again when someone has to look for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Thanks for pointing it out ! I don't know if this documentation is easy to understand for new folks ( I hope it does)
|
||
// metricsReceiver implement interface Appender for prometheus scarper to append metrics | ||
type metricsReceiver struct { | ||
pmbCh chan<- PrometheusMetricBatch | ||
} | ||
|
||
type metricAppender struct { | ||
receiver *metricsReceiver | ||
batch PrometheusMetricBatch | ||
ctx context.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the context need to be put in the struct and not as a function parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Appender and Append are two completely separate process so there would not be a clear way of passing this as a function parameter. Moreover, adding a parameter is unfeasible since the current Appender and Append is the custom overridden method from the prometheus storage (prometheus Append and prometheus Appender)
if ma.isNewBatch { | ||
metadataCache, err := getMetadataCache(ma.ctx) | ||
if err != nil { | ||
return err | ||
} | ||
ma.isNewBatch = false | ||
ma.mc = metadataCache | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just update the Appender
function to get the metadata cache when creating the appender?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is: the Appender is an initialization of the Append. However, the metadata cache only appears within each scrape (in Prometheus terms, scrape loop) and only exists within that scrape (each scrape loop is responsible for scraping each target and apply some system default label). Therefore, we won't be able to get the exact "metadata cache" during initialization.
For more information: (out of scope but want to share) In the past, we use relabel job and instance to find the target, afterwards, using the target to find the metadata cache. However, that causes an extra 200MB in best circumstances and run time too. Therefore, I utilize context
to hold down the cache and it will be garbaged collect later on based on my discussion with Anthony.
|
||
if metricMetadata.Type == textparse.MetricTypeUnknown { | ||
// The internal metrics sometimes will return with type unknown and we would not consider it as valid metric (only support Gauge, Counter, Summary) | ||
// https://github.com/khanhntd/amazon-cloudwatch-agent/blob/master/plugins/inputs/prometheus_scraper/metrics_filter.go#L21-L48 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nit here, and this points to your fork
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Thanks for pointing it out to me !
@@ -1,112 +1,172 @@ | |||
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | |||
// SPDX-License-Identifier: MIT | |||
// Copyright The OpenTelemetry Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this code is taken directly from the OTel receiver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its part of copy paste for developing unit test. I learn some of their unit test and Prometheus unit test before able to write down this unit test so it is not taken directly from the OTEL receiver.
TargetLabel: savedScrapeNameLabel, | ||
SourceLabels: model.LabelNames{"__name__"}, | ||
}, | ||
if relabelMetricName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if any of the metrics have __name__
, then all of the metrics get relabeled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. To be more exactly, if the customer provides their configuration with target label as name, we would an extra label (a magic label) to store the original name before relabeling. OTEL does not allow the customers to modify the metric name and this seems to be an issue in Prometheus too (since prometheus updates the metadata cache is only updated before apply relabeling so we have no control over it)
func (pm *PrometheusMetric) isValueValid() bool { | ||
//treat NaN and +/-Inf values as invalid as emf log doesn't support them | ||
return !value.IsStaleNaN(pm.metricValue) && !math.IsNaN(pm.metricValue) && !math.IsInf(pm.metricValue, 0) | ||
} | ||
|
||
func (pm *PrometheusMetric) isCounter() bool { | ||
return pm.metricType == string(textparse.MetricTypeCounter) | ||
} | ||
|
||
func (pm *PrometheusMetric) isGauge() bool { | ||
return pm.metricType == string(textparse.MetricTypeGauge) | ||
} | ||
|
||
func (pm *PrometheusMetric) isHistogram() bool { | ||
return pm.metricType == string(textparse.MetricTypeHistogram) | ||
} | ||
|
||
func (pm *PrometheusMetric) isSummary() bool { | ||
return pm.metricType == string(textparse.MetricTypeSummary) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these functions used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in the Filter which drops the unsupported metrics and also used in calculation
For more context in the delta calculation: The reason why we would perform some calculation (e.g Delta with Counter Prometheus Type) since there is no support for Counter Type in CWMetrics. The only way we can do is calculate the difference and send it over to CW Backend.
Sure and thanks too! As you already know, the OOM is caused because of saving metrics name before relabel. Therefore, with my PR, I bench marking with these test case:
For the first test case which is the customer's use case, I notice the following,
For the second test case which is why we introduce the PR that causes OOM, I notice the following
** Note**: The decrease in CPU Utilization might help the most since when trouble shooting with customers, increase CPU cores make the customer's pod stables around 9GB (since relabel metric might be a single thread and increase CPU would help Prometheus to access the metrics in memory more faster) |
This PR was marked stale due to lack of activity. |
20a354c
to
1353479
Compare
@khanhntd I'm wondering if there are any plans to get this PR updated and merged any time soon? |
Description of the issue
After introducing save name before relabel and save instance, job before relabel. There has been an increase in memory consumption and also CPU cores since it adds extra
relabel_configs
andmetric_relabel_config
(200MB for normal case when using job, instance before relabel and 500MB for normal case without any other metric_relabel_config or relabel_config with more than 10,000 metrics). However, process for creating label for each metric within a scrape loop is a single thread.Description of changes
relabel
metrics for each Prometheus Metric, moving looking for metric type when appending metrics into batchrelabel_config
to save job and instance before relabel to look up metadata typeLicense
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Note
Tests
For my PR
For the current CWAgent performance:
Requirements
Before commit the code, please do the following steps.
make fmt
andmake fmt-sh
make linter