Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metrics #3973

Open
ukabu opened this issue Jun 4, 2019 · 11 comments
Open

Improve metrics #3973

ukabu opened this issue Jun 4, 2019 · 11 comments

Comments

@ukabu
Copy link

ukabu commented Jun 4, 2019

There are many things here, maybe we should split them after discussion

What challenge are you facing?

We are running multiple Concourse 5.1 (soon to be 5.2), our largest one has more than 180 teams, more than 13000 jobs per week, more than 160K resources checks/hour. We have peeks of more than 300 jobs/hour. We run 6 ATCs and 41 Workers.

We have a difficult time inferring how our Concourse is used by our users. Some existing metrics are missing important labels and it's very difficult to cross-reference performance issues in jobs with workers performance metrics.

We currently use Prometheus to collect metrics

What would make this better?

  • Task level metrics labeled with team, pipeline, job, task, docker-image: this would help us infer the type of tasks run by our teams
  • Proper labeling of job level metrics. There is currently no labels for the job name (some metrics have it since 5.1, but not all).
  • Label all metrics with Concourse name (we have multiple Concourse installations that uses the same Prometheus)
  • Label all pipeline/job/task metrics with worker guid. This would help use correlate/explain why some jobs are very slow at times.
  • Remove duplicate metrics (worker volumes and containers). Currently, ATC all export more or less the same metrics for worker containers and volumes. We currently average them, but it would be much better if they weren't duplicated.

We've also seen a correlation between ATC performance and Prometheus. If for some reason, Prometheus is slow collecting the exported metrics, ATC memory utilization starts to rise.

Are you interested in implementing this yourself?

We would, but have little experience in Go today (and many other things to do).

@marco-m
Copy link
Contributor

marco-m commented Jun 4, 2019

There are a lot of open tickets related to metrics improvements and prometheus. Did you search the existing tickets before opening this one?

@ukabu
Copy link
Author

ukabu commented Jun 4, 2019

Related issues:

Remove duplicate metrics (worker volumes and containers)

Proper labeling of job level metrics

Label all metrics with Concourse name

@jchesterpivotal
Copy link
Contributor

Also #3958

@eedwards-sk
Copy link

Also #4038 and #3196

@stale
Copy link

stale bot commented Aug 19, 2019

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be exterminated closed, in accordance with our stale issue process.

@stale stale bot added the wontfix label Aug 19, 2019
@eedwards-sk
Copy link

Stalebot, sit! Good bot. Please keep this issue open. Metrics are still a bit of a mess and need unification and improvements per-provider.

@stale stale bot removed the wontfix label Aug 19, 2019
@cirocosta
Copy link
Member

Hey, I started experimenting with trimming down the emitters to just Prometheus and have it covering more of the inner workings of web and worker nodes: #4247

Please let me know what you think (in the PR, please 😁)!

The tl;dr is that by doing so we could:

  • focus on making Concourse observable by letting others pull info from it in a format that is becoming ubiquitous (the prometheus exposition format)
  • make it easier for us (contributors too!) to expose metrics
  • not re-implement logic around pushing metrics to different systems (other tools already do that in a better way than us: https://github.com/influxdata/telegraf/tree/master/plugins/outputs)

I'd be very happy to "hear" your concerns / thoughts on it.

thank you!

@stale
Copy link

stale bot commented Oct 25, 2019

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be exterminated closed, in accordance with our stale issue process.

@stale stale bot added the wontfix label Oct 25, 2019
@lrstanley
Copy link
Member

stalebot pls

@stale stale bot removed the wontfix label Oct 26, 2019
@stale
Copy link

stale bot commented Dec 25, 2019

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be exterminated closed, in accordance with our stale issue process.

@stale stale bot added the wontfix label Dec 25, 2019
@vito
Copy link
Member

vito commented Dec 25, 2019

Lemme just slap a label on here to calm the stale bot down. :P Seems like this is useful if not just as an aggregator.

@stale stale bot removed the wontfix label Dec 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants