Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics during a batch job are lost if the underlying instances are removed #2041

Closed
vishalbollu opened this issue Apr 1, 2021 · 0 comments · Fixed by #2247
Closed

Metrics during a batch job are lost if the underlying instances are removed #2041

vishalbollu opened this issue Apr 1, 2021 · 0 comments · Fixed by #2247
Labels
BatchAPI Something related to the BatchAPI kind metrics Related to metrics or dashboards
Milestone

Comments

@vishalbollu
Copy link
Contributor

Even if a batch job were to complete successfully, it might be marked as completed with failures if an instance running one of the workers is removed. Instances can be removed for any number reasons, the primary one being AWS retracting spot instances. If an instance is removed, the metrics stored on the statsd agent on that node will be lost and therefore will not be available to be scrapped by prometheus.

Proposed solution:

Deploy a statsd agent dedicated for the batch jobs scheduled on the operator nodegroup. Make it available to other pods via a service. All batch workers will push their metrics to this service.

@RobertLucian RobertLucian added the tests Something related to testing label Apr 2, 2021
@miguelvr miguelvr added BatchAPI Something related to the BatchAPI kind metrics Related to metrics or dashboards and removed tests Something related to testing labels Apr 13, 2021
@deliahu deliahu added this to the v0.37 milestone Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BatchAPI Something related to the BatchAPI kind metrics Related to metrics or dashboards
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants