Skip to content

OpenTelemetry metrics bridge caches can cause OOM #4122

@JonasKunz

Description

@JonasKunz

The agent provides an OpenTelemetry metrics bridge: Users can collect metrics via the OpenTelemetry API and we bridge them to be exported via the Elastic APM agent.

This mechanism internally uses an embedded OpenTelemetry metrics SDK to do the metrics aggregation.
Due to living in different classloaders, we need to copy AttributeKeys and Attributes provided by the user OpenTelemetry API to the OpenTelemetry metrics SDK. Because these are very often static, we added a WeakHashMap based caching to this copying.

However, we have now had a report where this mechanism caused an OOM:
The user likely had a hot loop reporting metrics. In this loop, it seems like AttributeKeys where created over an over again, before using them to report a metrics data point.

At the same time, the application seemed to be CPU-starved, causing the cleaning thread of the WeakHashMap to never run and release the cached values which already had gone out of scope. As a result, this caused an OOM crash.

Though it is not the best pattern to create the same AttributeKeys over and over again, it is still valid to do so and we shouldn'tbe causing an OOM in this case.

We can easily solve this by imposing a reasonable size limit on our caches, from which on we stop caching newer entries.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions