Reduce memory usage of Metricbeat prometheus collector #17004

jsoriano · 2020-03-13T16:47:30Z

I am creating this issue as brain dump to keep track of an issue that has recently appeared in some conversations.

Prometheus collector in Metricbeat can require a lot of memory to process some big Prometheus responses. In general this is not a problem, but this can be an issue in some cases, for example when collecting metrics from the federate API (this can be possibly workarounded by #14983), or when collecting metrics of big Kubernetes clusters or other services with lots of resources.
From my observations Metricbeat can take up to 20 times the memory of a Prometheus response to process it.

Prometheus response processing does the following:

It parses all the families, creating objects and keeping them in memory.
All these objects are converted to fields, and grouped into events with fields that share the same labels.
Each of the generated events for each set of labels are reported.

At the moment of reporting, all the objects created to process a response are still in memory, and they can be a lot of objects.

Memory usage of point 1. could be reduced with stream-parsing, we could assign metrics to events as soon as we parse them, so we don't need to keep the intermediary objects in memory. I did a quick test for this and memory usage was reduced about 20%.

The bulk of the memory usage is in the grouping of metrics per label (2.), this is not so easy to solve because we don't know if we can have metrics with the same labels on different parts of the file.

Some possible approaches for this could be:

On 2. only group the families, but don't generate the events (family objects may take less memory than their equivalent maps), after grouping the families generate and send each one of the events.
Order raw metrics in the received response per labels, and send events as soon as we have read all metrics with the same labels. This would increase CPU usage, and would require to have the whole response in memory, possibly twice. But would allow to avoid having all the events in memory.
Keep a maximum of events in memory, as soon as we reach this limit we sent them and we continue processing. This would reduce memory usage, but would create multiple events with the same labels, increasing volume of events and index size.
Investigate the logic Prometheus uses (see comment).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-13T16:47:32Z

Pinging @elastic/integrations-platforms (Team:Platforms)

exekias · 2020-03-24T10:31:20Z

Thanks for starting this, the families iterator and grouping before constructing the events should save a fair amount of memory, right? That would also allow to forget about already sent groups

jsoriano · 2020-03-24T10:48:24Z

Yes, I think this would be the best option. Memory usage will still increase with the number of metrics, but at a much lower rate.

vjsamuel · 2021-02-11T06:52:45Z

One thing to consider is to move to the same logic that Prometheus uses. They dont use expfmt that we use and it allows more erroneous endpoints to be scraped than what expfmt does. looking at that implementation, i think it would be slightly memory efficient than what expfmt does as well.

botelastic · 2022-01-27T11:48:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

botelastic · 2023-01-27T16:11:38Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

eduardofesilva · 2023-12-15T07:52:52Z

:+1

jsoriano added enhancement discuss Issue needs further discussion. refactoring Team:Platforms Label for the Integrations - Platforms team labels Mar 13, 2020

jsoriano added Metricbeat Metricbeat [zube]: Inbox labels Mar 13, 2020

jsoriano added the Team:Integrations Label for the Integrations team label Jan 15, 2021

botelastic bot added the Stalled label Jan 27, 2022

jsoriano removed the Stalled label Jan 27, 2022

botelastic bot added the Stalled label Jan 27, 2023

ChrsMark added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Jan 30, 2023

botelastic bot removed the Stalled label Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage of Metricbeat prometheus collector #17004

Reduce memory usage of Metricbeat prometheus collector #17004

jsoriano commented Mar 13, 2020 •

edited

Loading

elasticmachine commented Mar 13, 2020

exekias commented Mar 24, 2020 •

edited

Loading

jsoriano commented Mar 24, 2020

vjsamuel commented Feb 11, 2021

botelastic bot commented Jan 27, 2022

botelastic bot commented Jan 27, 2023

eduardofesilva commented Dec 15, 2023

Reduce memory usage of Metricbeat prometheus collector #17004

Reduce memory usage of Metricbeat prometheus collector #17004

Comments

jsoriano commented Mar 13, 2020 • edited Loading

elasticmachine commented Mar 13, 2020

exekias commented Mar 24, 2020 • edited Loading

jsoriano commented Mar 24, 2020

vjsamuel commented Feb 11, 2021

botelastic bot commented Jan 27, 2022

botelastic bot commented Jan 27, 2023

eduardofesilva commented Dec 15, 2023

jsoriano commented Mar 13, 2020 •

edited

Loading

exekias commented Mar 24, 2020 •

edited

Loading