Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: standardize on prometheus for metrics #4247

Draft
wants to merge 1 commit into
base: master
from

Conversation

@cirocosta
Copy link
Member

cirocosta commented Aug 9, 2019

Hey,

I wanted to try out the idea of getting rid of all of our emitters and standardizing on
Concourse being a system that can be observed without having to worry about emitting its
measurements to others (e.g., adopting a pull-based approach and nothing more).

	
	something that cares 
	about the state of a		(e.g., prometheus, telegraf ...)
	concourse installation
	         |
	         |
	    "how are you doing?"
	         |
		 |
		 *---> concourse
		 	  http://concourse:$prom_port/metrics


At the moment, I didn't yet convert all the metrics that we have, and IMO, that should not
be done in a 1:1 mapping anyway - there might be a certain set of metrics that we just
don't care about, and others that we do, but we just never implemented.

Also, I tried to be super conservative when it comes to labeling in order to try to avoid
future problems with cardinality explosion, so all measurements (at the moment) have
bounded cardinality, which might mean that they're not super useful for very
context-specific metrics like "is this particular job being slower over time".

ps.: this is just an experiment to think about the area of consolidating the emitters.

I didn't play with the Telegraf Prometheus input plugin (see
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/prometheus) which should
"unlock" Prometheus-based metrics to those who can't just use Prometheus and need
something to emit metrics to their systems), but I'd be curious to know if other people
did!

Also, do you see this being problematic in any sense in the future? Is super high-context
relevant for you? (if so, should we start thinking about other approaches that are not
metrics that can drive this "high-context" observability - like distr tracing - into the
mix?)

I'd appreciate a lot your feedback!

Thanks!

Makefile Outdated
@@ -0,0 +1,54 @@
all: run

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 9, 2019

Author Member

(this is just for me to be able to more easily run this without having web running in a container 😅 )

make database
make worker
make run

(I should've named run web).

@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Aug 9, 2019

In terms of "developer experience", it ends up looking like this for us:

  1. add the definition of your measurement to metrics.go
var (
	// other metrics ...

	ResourceChecksDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name: "concourse_resource_checks_duration_seconds",
			Help: "How long resource checks take",
		},
		[]string{LabelStatus},
	)
)
  1. update the in-memory repr of your measurement wherever it happens
metric.
	ResourceChecksDuration.
	WithLabelValues("failed").
	Observe(checkDuration.Seconds())
@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Aug 9, 2019

Something I like about this is that instrumenting individual components of Concourse would
be simpler, as one would just go through that simple model of "I define a variable that
represents the measurement", and then "in my code, I increment it".

That'd mean having an easy time adding metrics to components like tsa, beacon,
baggageclaim ... etc

And yeah, in case your worker doesn't run in a network that has prometheus scraping it,
you could have telegraf sending that data to a datasource that your grafana dashboard
could consume stats from.

Or, in case you have your workers in a cluster where you do have a Prometheus server
running, just scrape those components as so.

@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Aug 9, 2019


	(same cluster, diff networks and diff proms)

	.--------network 1-----
	|
	|    .---.--prom--.-----.
	|    |   |        |     |
	|  atc1 tsa1     atc2 tsa2
	|
	|


	.--------network 2------------
	|
	|    .------.---prom2-.------.
	|    |      |         |      |
	|  beacon1 bc1      beacon2 bc2 ...



or


	(all managed in the same cluster & network)

	.--------network 1-----
	|
	|    .---.--prom--.-----.---------.------.---------.------.
	|    |   |        |     |         |      |         |      |
	|  atc1 tsa1     atc2 tsa2        |      |         |      |
	|  				beacon1 bc1      beacon2 bc2 ...


or


	(external workers connected to a given cluster)

	.--------network N------------
	|
	|    .---------------------------> datadog
	|    |        .------------------> datadog ...
	|    |        |
	|  telegraf telegraf
	|    |        |     
	|  beacon1   bc1    


"resource": event.ResourceName,
"team_name": event.TeamName,
},
ResourceChecksDuration = prometheus.NewHistogramVec(

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 9, 2019

Author Member

here's an example of how you'd be defining the measurement


if err != nil {
if rErr, ok := err.(resource.ErrResourceScriptFailed); ok {
logger.Info("check-failed", lager.Data{"exit-status": rErr.ExitStatus})
metric.
ResourceChecksDuration.
WithLabelValues("failed").

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 9, 2019

Author Member

the corresponding example of how you leverage that measurement you defined in metrics.go

Runner: metric.PeriodicallyEmit(
logger.Session("periodic-metrics"),
10*time.Second,
Name: "metrics-handler",

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 9, 2019

Author Member

all that you need to have an endpoint exporting prometheus metrics

@kcmannem

This comment has been minimized.

Copy link
Member

kcmannem commented Aug 9, 2019

I like this.

"That'd mean having an easy time adding metrics to components like tsa, beacon,
baggageclaim ... etc"

^ thats gunna be super useful to us on runtime.

@cirocosta cirocosta force-pushed the cirocosta:prom-only branch 3 times, most recently from e8c1b46 to 241d00d Aug 25, 2019
@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Aug 26, 2019

@kcmannem @ddadlani I started collecting metrics from the workers:

Screen Shot 2019-08-26 at 8 56 11 AM

what other measurements would be nice to capture that are currently not possible to do today? 🤔

@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Aug 26, 2019

Moving on, some things that could make this better:

  1. use promtheus.Timers when interacting with observers (histograms & summaries)
  1. potentially (?) move to summaries rather than using histograms
  • there's CPU and memory cost associated with performing the quantile
    computations in the client
    • how costly? I'm not very sure.
    • keeping a small Objectives map (quantiles to compute) would probably
      make this not very CPU intensive, but, still, a stream needs to be
      kept in memory
  1. investigate how influx' telegraf prometheus input plays with these metrics
  • I'm really not sure how these metrics would look like when exported.
    • my guess is that counters & gauges would be "as expected", but histograms
      would look weird. moving to summaries could make it better though
@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Aug 26, 2019

There are certain metrics here that are clearly lacking:

  • number of teams
  • number of pipelines
  • ...

These metrics that are predominantelly statistics from the database, are missing, as there's not a very good way of having them exposed by a single component (e.g., it'd be weird to have two ATCs reporting the number of teams that they just got from the database).

For that case, I'm leaning towards having a second component (https://github.com/cirocosta/flight_recorder/ 😅), rather than having a "periodic emitter" like we did before 🤔

Wdyt?

thanks!

@@ -43,6 +46,9 @@ func (wc *workerCollector) Run(ctx context.Context) error {

if len(affected) > 0 {
logger.Info("marked-workers-as-stalled", lager.Data{"count": len(affected), "workers": affected})

// workers_stalled

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 26, 2019

Author Member

Would it make more sense to have it as workers_marked_stalled as that's the actual action taking place? 🤔

That way it'd make sense to be per-ATC (still being possible to be summed up).

@@ -57,6 +59,13 @@ func (w workerHelper) createGardenContainer(
Env: env,
Properties: gardenProperties,
})

metrics.

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 26, 2019

Author Member

This is web's perspective of how long a container creation is taking.

I wonder if this would be more valuable if we included the worker name in the label set, so that we could tell what's the particular worker that is facing some long time to have the container created 🤔

Naturally, in the best scenario we'd have distributed tracing in a way that would allow us to have all the context and separate between time taken as seen from the web and time taken by just garden, but I think what we have for now (+ having worker name in the label set) would be good enough


HttpResponseDuration.
With(prometheus.Labels{
"code": strconv.Itoa(metrics.Code),

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 26, 2019

Author Member

I don't think there's a lot of value of having the plain code here - we could, instead, move towards 2xx, reducing the amount of possible values that this can take down to 4 (2xx, 3xx, 4xx, 5xx) - just what we care about


var (
slowBuckets = []float64{
1, 30, 60, 120, 180, 300, 600, 900, 1200, 1800, 2700, 3600, 7200, 18000, 36000,

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 26, 2019

Author Member

ooh yeah, this bad boy here ends up generating 16 labels for each measurement that we end up adding 😅 moving to summaries, that would reduce quite a bit the load we put on the server (specially given that slowBuckets is used exclusively for build time - the metric with the highest cardinality).

registry.MustRegister(collector)
}

registry.MustRegister(

This comment has been minimized.

Copy link
@cirocosta

cirocosta Aug 26, 2019

Author Member

I'm still not very happy with having to not forget to add the variable here, but I couldn't think of a good way of automatically adding (without having to do something like what promauto does - reimplementing the constructors).

@cirocosta cirocosta mentioned this pull request Aug 26, 2019
@kcmannem

This comment has been minimized.

Copy link
Member

kcmannem commented Aug 26, 2019

Can we also do some sort of timeline graph that plots when heartbeats happen. I feel like this could come in handy. (Is this covered by the fact that workers go into stalled state if they don't heartbeat?)

@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Aug 26, 2019

@kcmannem hmmmm 🤔 I think we could increment a counter whenever a heartbeat happens, labelling it as either successful or not

e.g.

// in case of success
metrics.Heartbeats.WithLabelValues("success").Inc()

// in case of error
metrics.Heartbeats.WithLabelValues("failure").Inc()

or, if we care about how long the heartbeating takes, we could keep track of the duration of that too 🤔

having the scraping going on, you could tell when those heartbeats happened 😁

wdyt?

thx!


oh, and btw, would that be worker heartbeating from the worker perspective, or the heartbeats that tsa does? 🤔 thank you!

@kcmannem

This comment has been minimized.

Copy link
Member

kcmannem commented Aug 26, 2019

@cirocosta my bad, I meant heartbeating from the workers perspective. I'm ok with the success or failures but if there is minimal cost, it might be helpful to know the duration as well. It'll give us visibility around how heartbeating works under different kinds of pressure (high CPU load, high network load).

@gerhard

This comment has been minimized.

Copy link

gerhard commented Sep 2, 2019

This cannot come soon enough! The current metrics setup based on Influxdb is bursting - again. Throwing more CPUs & memory at it, until we have this new solution.

I discovered this while getting frustrated with the current metrics setup, used to create #4340

I highly recommend reaching out to @brian-brazil when you are ready with this, he has valuable insights into best practices: prometheus/docs#1414

Thanks for working on this @cirocosta !

@brian-brazil

This comment has been minimized.

Copy link

brian-brazil commented Sep 2, 2019

A quick high level comment:

These metrics that are predominantelly statistics from the database, are missing, as there's not a very good way of having them exposed by a single component

The /metrics of a process should expose information about the performance of the process itself, not be making RPCs to other systems to gather data or expose the data it's processing. So for example the MySQLd exporter exposes metrics about MySQLd performance - not raw data from tables. You'd have a different exporter for this. Kubelet vs kube-state-metrics is a good example of this.

That's not to say you can't expose such metrics on another endpoint (and sometimes you can cheat a little for some cheap metrics a given process knows about anyway) from the same process though.

@cirocosta cirocosta force-pushed the cirocosta:prom-only branch from c15c986 to 4fd6f8e Sep 6, 2019
@gerhard

This comment has been minimized.

Copy link

gerhard commented Sep 6, 2019

I would enjoy starting to dog-food this in 2 weeks from now @cirocosta . Happy to delay if you need more time, but I am seriously intrigued about what you have going here 😉

This is an experiment to try to validate how good/bad would it be to
have **ONLY** prometheus as the metric emitter that we use.

it removes some, adds more context to others, and also introduces
`worker` as a metrics provider.

e.g.:

- builds_started: removed as I'm sure the value from `builds` is already
  good enough

- builds: add `team`, `pipeline`, and `job`.

- scheduling_load & scheduling_job: we don't even have them in
  dashboards right now ¯\_(ツ)_/¯

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Copy link
Contributor

marco-m left a comment

@cirocosta I am very happy to see this happening :-)

I have a few comments related to the roll out strategy. Since this will be replacing all the other emitters in one go (at least this is what I understood), I think you need to consider two aspects:

  • Provide a sort of migration/adaptation document for people who do not use Prometheus, for example the Telegraph plugin you mention. For a change like this, I think it would make sense to validate that the Telegraph plugin actually works as expected.

  • Think about Prometheus metrics compatibility. By this I mean that if this PR changes or removes some of the existing exported metrics, they should be mentioned in the release notes, so that an operator will know which queries to modify in Grafana and in AlertManager. Maybe a quick way is simply to take two ATCs, one without this PR, one with this PR, scrape both with curl atc/metrics and have a look at the diff.

Thanks!

@@ -355,6 +334,14 @@ func (cmd *RunCommand) Execute(args []string) error {
return <-ifrit.Invoke(sigmon.New(runner)).Wait()
}

func LogLockAcquired(logger lager.Logger, lockID lock.LockID) {

This comment has been minimized.

Copy link
@marco-m

marco-m Sep 26, 2019

Contributor

This function doesn't log the lockID. Is this on purpose?

This comment has been minimized.

Copy link
@cirocosta

cirocosta Oct 8, 2019

Author Member

good catch! yeah, I didn't put much thought on it 😁 just enough to get it compiling

@cessien

This comment has been minimized.

Copy link

cessien commented Oct 1, 2019

This is awesome! I was pretty unhappy with the cardinality for build duration that prometheus metrics offered so far. We're going to build this PR and try this out internally (CarGurus). I'll let you know what we like/ don't like about it

@eedwards-sk

This comment has been minimized.

Copy link

eedwards-sk commented Oct 3, 2019

I have concerns over this idea. I do not use prometheus, I use Datadog. I expect concourse to continue to support emitting statsd metrics that I can consume from my Datadog Agent.

The current datadog support is officially recognized by Datadog: https://docs.datadoghq.com/integrations/concourse_ci/

This means using the current statsd configuration allows me to consume concourse metrics into my Datadog account without any additional charges (even in spite of some cardinality issues that would otherwise be very expensive #3196 #4038)

I think losing compatibility or breaking the current contract that is documented by Datadog would be a painful regression.

In addition to not wanting to deal with prometheus, I also don't like the idea of having to manage or run a telegraf agent.

That's one more thing I'd now be responsible for configuring. Their current output plugin, assuming this is what would be used, could also possibly result in a break of the "free metrics" benefit of the current metrics namespace that Datadog represents. As per https://github.com/influxdata/telegraf/tree/master/plugins/outputs/datadog "Metrics are grouped by converting any _ characters to . in the Point Name.", which would mean needing to ensure that the namespace ended up identical to the default as it was before (concourse.ci).

I'm also concerned if this means going to a "pull based" model rather than "push based", and how that affects system load and the frequency of reported metrics. The Datadog agent allows for high-frequency metric submission (udp statsd can receive lots of metrics per second) and automatically handles aggregation, sampling, batching and a bunch of other useful stuff to ensure metrics are captured and shipped.

If pull based, how does this affect the frequency of metric emissions? Does an agent need to poll aggressively to get more data points? Is the endpoint going to return a series of data points over a timespan and the agent is responsible for knowing what data points it already has?

@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Oct 8, 2019

Hi @eedwards-sk, thank you a lot for putting time on sharing this feedback!


The current datadog support is officially recognized by Datadog: https://docs.datadoghq.com/integrations/concourse_ci/

That's a very valid point! It can definitely be a deal breaker as I can imagine
that this significantly reduces the expenses at the end of the day (which
matters a lot).

I wonder if in this case it's a matter of engaging with Datadog 🤔

We expect to have newer metrics as the system changes, and improve them when we
see fit, so I can imagine that we'd like not to be too dependant on a vendor
expecting a restricted set of metrics. I see this is a win-win as being able to
better observe a system in DataDog would be something great for them too (as a
selling point).

In addition to not wanting to deal with prometheus, I also don't like the idea
of having to manage or run a telegraf agent.

Yeah .. I totally agree! I'd not like to do that either - now there's one more
thing to break, to upgrade, to have security vulns ...

My view on this though is that despite that seeing like a problem, as the
industry moves forward towards standard formats (like OpenMetrics), we (as the
ones developing a CI system) are able to focus less and less on how to deal with
all these metric systems, and just implement the standard that everyone agrees
on, and let those that spend their entire they thinking about metrics, to
implement the agents that collect from those exporting in that standard format.

We can see that with Datadog itself, announcing that the Datadog agent is now
capable of ingesting metrics from those exposing them using the prometheus
exposition format (which OpenMetrics is pretty much entirely based on):

This is also true for NewRelic, another emitter that we currently have to
maintain:

Although it might be a steep transiton, my view is that it's a good thing as
having less of these things that are not the core of what Concourse cares about,
allows us to have less of these edges full of dust that no one really owns.

I'm also concerned if this means going to a "pull based" model rather than
"push based", and how that affects system load and the frequency of reported
metrics. The Datadog agent allows for high-frequency metric submission (udp
statsd can receive lots of metrics per second) and automatically handles
aggregation, sampling, batching and a bunch of other useful stuff to ensure
metrics are captured and shipped.

If pull based, how does this affect the frequency of metric emissions? Does an
agent need to poll aggressively to get more data points? Is the endpoint going
to return a series of data points over a timespan and the agent is responsible
for knowing what data points it already has?

That's indeed a good point.

It might be that people could be using our metrics not as "just monitoring", but
as events that they rely on the exact timestamp of the submission for some
reason (e.g., we emit build finished metrics on a very granular basis, thus,
one could perhaps leverage that as something that tells exactly when a given job
failed), and indeed, with a pull-based model as I proposed here, that'd not be
possible as Prometheus (or any system that scrapes the endpoint) would be
essentially looking at snapshots of how the counters/gauges/etc look like at the
moment of scraping.

In this case, we'd indeed be breaking that functionality.

Would you mind sharing what's your use case for high frequncy?


please let me know what you think - happy to discuss!

Thanks!

@eedwards-sk

This comment has been minimized.

Copy link

eedwards-sk commented Oct 8, 2019

@cirocosta Thanks for linking the prometheus article. Datadog is so big now a days I miss half the stuff they're doing.

Regarding the costs issue, as long as metrics continue to be prefixed with concourse.ci. then I believe Datadog would not charge for them. Although I'm not sure how metric namespaces work when using their prometheus (OpenMetrics) support, or how they get mapped as they're converted or processed.

I agree that concourse shouldn't have to carry the burden of knowing how to deal with a myriad of providers. I was under the impression that statsd is already a fairly generic protocol, which datadog supports (with some straightforward enhancement for dogstatsd / tags).

So I wonder if this is really just moving from one standard to another? If this new OpenMetrics approach works as well / better than statsd, I'm not opposed. Although it feels possibly like a k8s-driven change, and as much as the industry makes it seem, not everyone is using k8s, myself included. So perhaps I'm just not seeing this immediate need as clearly.

It might be that people could be using our metrics not as "just monitoring", but
as events that they rely on the exact timestamp of the submission for some
reason (e.g., we emit build finished metrics on a very granular basis, thus,
one could perhaps leverage that as something that tells exactly when a given job
failed), and indeed, with a pull-based model as I proposed here, that'd not be
possible as Prometheus (or any system that scrapes the endpoint) would be
essentially looking at snapshots of how the counters/gauges/etc look like at the
moment of scraping.

Those are definitely things I look for out of metrics -- correlation is a very strong tool. Datadog reinforces that by letting me aggregate and correlate multiple metrics from e.g. the same host or hosts on the same timescale.

concourse-datadog-1

So it can be useful to see that a spike in CPU load was correlated to a build starting, or IOWait spiking when the container count increased.

If a polling agent operates regularly enough, then I assume it can still approximate the relative timeframe of when the events are occurring. However this still raises the issue of balancing poll frequency versus load due to polling (and serving a stats request).

@brian-brazil

This comment has been minimized.

Copy link

brian-brazil commented Oct 8, 2019

Although I'm not sure how metric namespaces work when using their prometheus (OpenMetrics) support, or how they get mapped as they're converted or processed.

They'll be charged as custom metrics. I don't think they've any integrations yet that use Prometheus under the covers, but I guess it's only a matter of time. Asking them is best.

Although it feels possibly like a k8s-driven change

While Prometheus is commonly used on k8, it is not k8 specific in any way. Plus the major vendors all support the format, and it'll be more efficient than statsd.

It might be that people could be using our metrics not as "just monitoring", but
as events that they rely on the exact timestamp of the submission for some
reason

If you think this is likely, I'd suggest exporting gauge metrics with the relevant timestamp(s).

@eedwards-sk

This comment has been minimized.

Copy link

eedwards-sk commented Oct 8, 2019

They'll be charged as custom metrics.

Not necessarily. It depends on the namespace. (as stated above, if the metrics show up as concourse.ci.{FOO} etc, they should still fall under the existing custom metric exception)

@eedwards-sk

This comment has been minimized.

Copy link

eedwards-sk commented Oct 8, 2019

@brian-brazil

Although it feels possibly like a k8s-driven change

While Prometheus is commonly used on k8, it is not k8 specific in any way. Plus the major vendors all support the format, and it'll be more efficient than statsd.

That's what I mean by "k8s driven". I didn't say k8s specific :)

Can you clarify why it's more efficient than statsd? That sounds highly relevant to this discussion.

@brian-brazil

This comment has been minimized.

Copy link

brian-brazil commented Oct 8, 2019

Can you clarify why it's more efficient than statsd? That sounds highly relevant to this discussion.

https://www.robustperception.io/which-kind-of-push-events-or-metrics Statsd is (usually) events, Prometheus is metrics.

@eedwards-sk

This comment has been minimized.

Copy link

eedwards-sk commented Oct 8, 2019

@brian-brazil

Thanks! That's a well summarized overview. I would still present the (anecdotal) perspective that, in practice, at scale, statsd has worked effectively in production for the past few years I've used it alongside Datadog, under a myriad of configurations and load.

I personally don't have any drivers to migrate away from it, and I'll continue to have numerous stacks and services using it for some time.

Thus I'll be interested to see what this looks like when going from OpenMetrics -> Datadog. If there is relative parity in terms of configuration complexity / management liability, I don't really care what is "inside the box", as long as it surfaces the info we want :)

@brian-brazil

This comment has been minimized.

Copy link

brian-brazil commented Oct 8, 2019

statsd has worked effectively in production for the past few years I've used it alongside Datadog, under a myriad of configurations and load.

I believe DataDog has a per-machine statsd, which avoids the usual issues from attempting to run a single statsd for many many machines.

Thus I'll be interested to see what this looks like when going from OpenMetrics -> Datadog.

It'll be the exact same, it's already all supported AIUI. Datadog have contributed a number of performance improvements to the Prometheus Python client to facilitate their Prometheus/OpenMetrics integration.

@pivotal-jwinters

This comment has been minimized.

Copy link
Contributor

pivotal-jwinters commented Nov 5, 2019

Admittedly I have not read through this entire thread so I might be missing some valuable context here, but after briefly chatting with @cirocosta it seems like we would be able to reduce the burden of adding new metrics if we added a type to our top level metric interface. This type would represent things like counter, or histogram.

That way each emitter can move away from mapping the event.name in favour of event.type. This means that the individual emitter would only ever change if a new type was added, which should be rare. Emitters would also be able to ignore types that don't translate between push or pull collection strategies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.