Improve metrics #3973

ukabu · 2019-06-04T14:24:55Z

There are many things here, maybe we should split them after discussion

What challenge are you facing?

We are running multiple Concourse 5.1 (soon to be 5.2), our largest one has more than 180 teams, more than 13000 jobs per week, more than 160K resources checks/hour. We have peeks of more than 300 jobs/hour. We run 6 ATCs and 41 Workers.

We have a difficult time inferring how our Concourse is used by our users. Some existing metrics are missing important labels and it's very difficult to cross-reference performance issues in jobs with workers performance metrics.

We currently use Prometheus to collect metrics

What would make this better?

Task level metrics labeled with team, pipeline, job, task, docker-image: this would help us infer the type of tasks run by our teams
Proper labeling of job level metrics. There is currently no labels for the job name (some metrics have it since 5.1, but not all).
Label all metrics with Concourse name (we have multiple Concourse installations that uses the same Prometheus)
Label all pipeline/job/task metrics with worker guid. This would help use correlate/explain why some jobs are very slow at times.
Remove duplicate metrics (worker volumes and containers). Currently, ATC all export more or less the same metrics for worker containers and volumes. We currently average them, but it would be much better if they weren't duplicated.

We've also seen a correlation between ATC performance and Prometheus. If for some reason, Prometheus is slow collecting the exported metrics, ATC memory utilization starts to rise.

Are you interested in implementing this yourself?

We would, but have little experience in Go today (and many other things to do).

marco-m · 2019-06-04T18:28:53Z

There are a lot of open tickets related to metrics improvements and prometheus. Did you search the existing tickets before opening this one?

ukabu · 2019-06-04T20:20:31Z

Related issues:

Remove duplicate metrics (worker volumes and containers)

Proper labeling of job level metrics

Include team and tag in worker metrics #3036 (complementary to what we describe above)
Enrich prometheus build metrics #2584

Label all metrics with Concourse name

Bug: Grafana dashboard issue - deployment name from concourse is missing #2124

jchesterpivotal · 2019-06-05T19:09:56Z

Also #3958

eedwards-sk · 2019-06-20T15:59:03Z

Also #4038 and #3196

stale · 2019-08-19T16:40:41Z

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be ~~exterminated~~ closed, in accordance with our stale issue process.

eedwards-sk · 2019-08-19T16:42:50Z

Stalebot, sit! Good bot. Please keep this issue open. Metrics are still a bit of a mess and need unification and improvements per-provider.

cirocosta · 2019-08-26T13:52:27Z

Hey, I started experimenting with trimming down the emitters to just Prometheus and have it covering more of the inner workings of web and worker nodes: #4247

Please let me know what you think (in the PR, please 😁)!

The tl;dr is that by doing so we could:

focus on making Concourse observable by letting others pull info from it in a format that is becoming ubiquitous (the prometheus exposition format)
make it easier for us (contributors too!) to expose metrics
not re-implement logic around pushing metrics to different systems (other tools already do that in a better way than us: https://github.com/influxdata/telegraf/tree/master/plugins/outputs)

I'd be very happy to "hear" your concerns / thoughts on it.

thank you!

stale · 2019-10-25T14:10:02Z

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be ~~exterminated~~ closed, in accordance with our stale issue process.

lrstanley · 2019-10-26T01:27:55Z

stalebot pls

stale · 2019-12-25T01:43:59Z

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be ~~exterminated~~ closed, in accordance with our stale issue process.

vito · 2019-12-25T03:49:31Z

Lemme just slap a label on here to calm the stale bot down. :P Seems like this is useful if not just as an aggregator.

ukabu added the enhancement label Jun 4, 2019

stale bot added the wontfix label Aug 19, 2019

stale bot removed the wontfix label Aug 19, 2019

stale bot added the wontfix label Oct 25, 2019

stale bot removed the wontfix label Oct 26, 2019

stale bot added the wontfix label Dec 25, 2019

stale bot removed the wontfix label Dec 25, 2019

vito added the help wanted label Dec 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metrics #3973

Improve metrics #3973

ukabu commented Jun 4, 2019 •

edited

marco-m commented Jun 4, 2019

ukabu commented Jun 4, 2019

jchesterpivotal commented Jun 5, 2019

eedwards-sk commented Jun 20, 2019

stale bot commented Aug 19, 2019

eedwards-sk commented Aug 19, 2019

cirocosta commented Aug 26, 2019

stale bot commented Oct 25, 2019

lrstanley commented Oct 26, 2019

stale bot commented Dec 25, 2019

vito commented Dec 25, 2019

Improve metrics #3973

Improve metrics #3973

Comments

ukabu commented Jun 4, 2019 • edited

What challenge are you facing?

What would make this better?

Are you interested in implementing this yourself?

marco-m commented Jun 4, 2019

ukabu commented Jun 4, 2019

jchesterpivotal commented Jun 5, 2019

eedwards-sk commented Jun 20, 2019

stale bot commented Aug 19, 2019

eedwards-sk commented Aug 19, 2019

cirocosta commented Aug 26, 2019

stale bot commented Oct 25, 2019

lrstanley commented Oct 26, 2019

stale bot commented Dec 25, 2019

vito commented Dec 25, 2019

ukabu commented Jun 4, 2019 •

edited