Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide metric for tasks queue when using limit-active-tasks placement strategy #5057

Closed
tenjaa opened this issue Jan 22, 2020 · 4 comments · Fixed by #5448
Closed

Provide metric for tasks queue when using limit-active-tasks placement strategy #5057

tenjaa opened this issue Jan 22, 2020 · 4 comments · Fixed by #5448

Comments

@tenjaa
Copy link
Contributor

tenjaa commented Jan 22, 2020

What challenge are you facing?

We switched to the limit-active-tasks placement strategy and so far it solved a lot of our problems. We want to improve now by scaling our workers depending on the size of the task queue.
We are running our Concourse in a Kubernetes environment.

What would make this better?

It would improve the scaling of workers.

Are you interested in implementing this yourself?

Sure :)
We already saw that the first proposed implementation had this metric exported: #4612

@jamieklassen
Copy link
Member

jamieklassen commented Jan 24, 2020

Rather than implementing a full queue (like concurrency-safe FIFO guarantee or any kind of priority - which does not currently exist, for the record!) I suspect it would be enough for you to emit a metric whenever the

All workers are busy at the moment, please stand-by.

event happens, which seems to be around this block of code. I can imagine this being a strong enough heuristic to say "my workers are getting busy".

Then I'm thinking you could autoscale depending on how often this event has occurred in the last hour (or whatever granularity/tuning makes sense)?

Frankly I'm not sharp when it comes to k8s autoscaling, so I'd need a sanity check on this assertion. Am I making sense? off-base?

@tenjaa
Copy link
Contributor Author

tenjaa commented Jan 25, 2020

Oh I only looked at the linked proposed implementation and assumed the final one was more or less the same and it was just forgotten to expose the metric.

In general it should be possible to build a custom metric based on the logs.
Using heuristics we could probably say "three logs in the last minute => three jobs in the queue".
But the FAQ advise against it: https://prometheus.io/docs/introduction/faq/#how-to-feed-logs-into-prometheus

What do you think about not having a queue but a counter?
Maybe even the amount of currently active tasks should be enough. If we just substract the amount of workers (which is already known) we also have the amount of unscheduled tasks.

@tenjaa
Copy link
Contributor Author

tenjaa commented Mar 5, 2020

Hey @pivotal-jamie-klassen do you have any feedback about the counter idea?
Would it be fine that we'd propose a pullrequest for it?

@jamieklassen
Copy link
Member

@tenjaa seems fair to me. Especially if you experiment with using that metric in your own environment! I guess you'd probably emit a counter per worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants