Metrics during a batch job are lost if the underlying instances are removed #2041

vishalbollu · 2021-04-01T22:45:05Z

Even if a batch job were to complete successfully, it might be marked as completed with failures if an instance running one of the workers is removed. Instances can be removed for any number reasons, the primary one being AWS retracting spot instances. If an instance is removed, the metrics stored on the statsd agent on that node will be lost and therefore will not be available to be scrapped by prometheus.

Proposed solution:

Deploy a statsd agent dedicated for the batch jobs scheduled on the operator nodegroup. Make it available to other pods via a service. All batch workers will push their metrics to this service.

The text was updated successfully, but these errors were encountered:

RobertLucian added the tests Something related to testing label Apr 2, 2021

miguelvr added BatchAPI Something related to the BatchAPI kind metrics Related to metrics or dashboards and removed tests Something related to testing labels Apr 13, 2021

vishalbollu mentioned this issue Jun 11, 2021

Move statsd agent to a deployment on the operator nodegroup #2247

Merged

2 tasks

vishalbollu closed this as completed in #2247 Jun 17, 2021

deliahu added this to the v0.37 milestone Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics during a batch job are lost if the underlying instances are removed #2041

Metrics during a batch job are lost if the underlying instances are removed #2041

vishalbollu commented Apr 1, 2021

Metrics during a batch job are lost if the underlying instances are removed #2041

Metrics during a batch job are lost if the underlying instances are removed #2041

Comments

vishalbollu commented Apr 1, 2021