This repository has been archived by the owner. It is now read-only.

prometheus: Remove job scheduling metric #274

Merged
merged 3 commits into from May 30, 2018

Conversation

Projects
None yet
3 participants
@databus23
Contributor

databus23 commented May 17, 2018

For each pipeline the job_scheduling_duration metric is emitted for each job in that pipeline during each scheduling tick. Without this label the prometheus metric is not really helpful as it only reflects the duration of an arbitrary job that was emitted last before the scrape.

/cc @TimSimmons

@TimSimmons

This comment has been minimized.

Contributor

TimSimmons commented May 17, 2018

I can what you're getting at here, I wonder if it might be more useful if these scheduling duration metrics were a Histogram? That way you'd be able to get an idea across all the instances of the event, vs just the last one.

My other concern with this is the metric cardinality this will result in. I won't pretend I'm intimately familiar with the internals here, but is it useful to know how long a certain job in a pipeline takes to schedule? What makes jobs within a pipeline different as far as these scheduling durations go?

@databus23

This comment has been minimized.

Contributor

databus23 commented May 17, 2018

Regarding the histogram idea, I don't think thats needed in this case. Each pipeline is processed in its own loop with a sleep of 10 seconds. So the events for a pipeline/job are emitted sequentially at a rate not faster then 6 times per minute. I think its more helpful to have exact durations for the sampled scheduling ticks than a histogram providing less precision and a higher metric cardinality.

Regarding the overall cardinality it highly depends on the size of the concourse deployment (number of pipelines and jobs per pipeline) but it could easily by hundreds of metrics. Nonetheless I would prefer to have these metrics available.
We are facing serious performance problems with our rather large concourse deployment for some time now and are still trying to understand which pipelines are causing this.
As I am understanding this the time it takes to schedule an individual pipeline job depends mainly on how many versions are recorded for the individual resources that go into the job.

For example we had a poorly written custom resource that by accident discovered a new version during each check run (every 30 seconds). Because we didn't trigger on the resource we didn't notice it at first. After a few month that resources had 300k versions in the databasae and all jobs using that resource took several minutes to schedule.
Having those per job scheduling metrics available would have helped a lot in this case.

@TimSimmons

This comment has been minimized.

Contributor

TimSimmons commented May 18, 2018

Fair enough, if your large concourse deployment would appreciate it, I'm sure ours would too.

@vito

This comment has been minimized.

Member

vito commented May 23, 2018

Hmm, I also share @TimSimmons's concerns over cardinality here. These would definitely be valuable in understanding why your Concourse is falling over, but that may in turn make peoples' InfluxDB fall over instead.

Could you instead use the logging-based metrics emitter for debugging this? That won't work right now because I'm pretty sure we only allow one emitter to be configured, but that could be an easy change. You could then configure --prometheus-... and --emit-to-logs, using your log sink to view higher-fidelity metrics like per-job scheduling times, without making Prometheus unhappy in large multi-tenant installations.

@databus23

This comment has been minimized.

Contributor

databus23 commented May 23, 2018

@vito This change is not adding any new metrics. Its just exposing the metrics that are already emitted with the correct label set to prometheus. the Influxdb emitter already sends this out periodically for every job.

Do we have any numbers available about how many distinct jobs are defined in a large concourse installation so that we can reason about the cardinality?

@databus23

This comment has been minimized.

Contributor

databus23 commented May 23, 2018

@TimSimmons speaking of cardinality: I doubled that with the last commit by switching from a single gauge per job to two counters. I noticed that gauges have the problem that you can't really tell when a metric is going away (e.g. job is deleted). the exporter will happily serve the last emitted value until the application is restarted.

In general I would suggest to switch to counters for the other pipeline scheduling metrics (duration_full and duration_loading) so that deleted pipelines are properly reflected in the graphs (e.g. rate of 0).

@vito

This comment has been minimized.

Member

vito commented May 23, 2018

@databus23 Sorry, I meant Prometheus when I said InfluxDB. We've had cardinality issues with InfluxDB in the past. I remember before Prometheus was added as a first-class metric emitter, folks using Riemann -> Prometheus gateway had their ops team complaining about the cardinality killing their cluster, so I'm being careful here too. (Wait, was that you?)

We have an instance with 2,077 active jobs. That could realistically could go much higher - we have a few large companies interested in having hundreds of teams sharing deployments, and as a design goal would like to be able to support arbitrarily large public instances.

@databus23

This comment has been minimized.

Contributor

databus23 commented May 23, 2018

Hmm our postgresql database has 2284 rows in the jobs table, that could be a bit much.

@databus23

This comment has been minimized.

Contributor

databus23 commented May 23, 2018

@vito Wasn't me :) I just finally got around enabling metrics using the prometheus emitter. I tried getting from riemman to prometheus but gave up.

Ok but that probably means we can't expose the job specific metrics in prometheus and should remove the metric that is currently there (without the job label) as its bogus.

databus23 added some commits May 17, 2018

Switch to counters for the job scheduling duration
Using gauges has the drawback that the last emitted value will persist even if the job is removed and no new metrics are emitted for this particular job.

@databus23 databus23 force-pushed the databus23:prometheus-job-label branch from 8860cb1 to fd160b5 May 23, 2018

@databus23

This comment has been minimized.

Contributor

databus23 commented May 29, 2018

@vito @TimSimmons Should I update the pull request to remove the job scheduling metric then? The current implementation is misleading and adding the appropriate job label would increase the metric cardinality significantly.

@vito

This comment has been minimized.

Member

vito commented May 29, 2018

@databus23 Yeah, I think that's a good idea if the data it's emitting is impossible to consume usefully.

remove job scheduling metric
This metric is not useful in its current form as its not labeled with the job name. Adding the job name can potentially lead to an unwieldy number of timeseries as there can be several thousands active of jobs in a larger concourse deployment.

@databus23 databus23 changed the title from prometheus: Add job label to job scheduling metric to prometheus: Remove job scheduling metric May 30, 2018

@databus23

This comment has been minimized.

Contributor

databus23 commented May 30, 2018

@vito @TimSimmons Ok I updated the PR to remove the job scheduling metric for now. PTAL

@vito vito added the enhancement label May 30, 2018

@vito

This comment has been minimized.

Member

vito commented May 30, 2018

Thanks!

@vito vito merged commit 254462b into concourse:master May 30, 2018

2 checks passed

ci/pivotal-cla Thank you for signing the Contributor License Agreement!
Details
concourse-ci/status Concourse CI build success
Details

@vito vito modified the milestone: v3.14.0 Jun 5, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.