You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Even if a batch job were to complete successfully, it might be marked as completed with failures if an instance running one of the workers is removed. Instances can be removed for any number reasons, the primary one being AWS retracting spot instances. If an instance is removed, the metrics stored on the statsd agent on that node will be lost and therefore will not be available to be scrapped by prometheus.
Proposed solution:
Deploy a statsd agent dedicated for the batch jobs scheduled on the operator nodegroup. Make it available to other pods via a service. All batch workers will push their metrics to this service.
The text was updated successfully, but these errors were encountered:
Even if a batch job were to complete successfully, it might be marked as
completed with failures
if an instance running one of the workers is removed. Instances can be removed for any number reasons, the primary one being AWS retracting spot instances. If an instance is removed, the metrics stored on the statsd agent on that node will be lost and therefore will not be available to be scrapped by prometheus.Proposed solution:
Deploy a statsd agent dedicated for the batch jobs scheduled on the operator nodegroup. Make it available to other pods via a service. All batch workers will push their metrics to this service.
The text was updated successfully, but these errors were encountered: