-
Notifications
You must be signed in to change notification settings - Fork 327
Description
The agent provides an OpenTelemetry metrics bridge: Users can collect metrics via the OpenTelemetry API and we bridge them to be exported via the Elastic APM agent.
This mechanism internally uses an embedded OpenTelemetry metrics SDK to do the metrics aggregation.
Due to living in different classloaders, we need to copy AttributeKeys and Attributes provided by the user OpenTelemetry API to the OpenTelemetry metrics SDK. Because these are very often static, we added a WeakHashMap based caching to this copying.
However, we have now had a report where this mechanism caused an OOM:
The user likely had a hot loop reporting metrics. In this loop, it seems like AttributeKeys where created over an over again, before using them to report a metrics data point.
At the same time, the application seemed to be CPU-starved, causing the cleaning thread of the WeakHashMap to never run and release the cached values which already had gone out of scope. As a result, this caused an OOM crash.
Though it is not the best pattern to create the same AttributeKeys over and over again, it is still valid to do so and we shouldn'tbe causing an OOM in this case.
We can easily solve this by imposing a reasonable size limit on our caches, from which on we stop caching newer entries.