New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experiment: standardize on prometheus for metrics #4247
Conversation
Makefile
Outdated
@@ -0,0 +1,54 @@ | |||
all: run | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is just for me to be able to more easily run this without having web
running in a container
make database
make worker
make run
(I should've named run
web
).
In terms of "developer experience", it ends up looking like this for us:
var (
// other metrics ...
ResourceChecksDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "concourse_resource_checks_duration_seconds",
Help: "How long resource checks take",
},
[]string{LabelStatus},
)
)
metric.
ResourceChecksDuration.
WithLabelValues("failed").
Observe(checkDuration.Seconds()) |
Something I like about this is that instrumenting individual components of Concourse would That'd mean having an easy time adding metrics to components like And yeah, in case your Or, in case you have your workers in a cluster where you do have a Prometheus server |
or
or
|
atc/metric/metrics.go
Outdated
"resource": event.ResourceName, | ||
"team_name": event.TeamName, | ||
}, | ||
ResourceChecksDuration = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's an example of how you'd be defining the measurement
|
||
if err != nil { | ||
if rErr, ok := err.(resource.ErrResourceScriptFailed); ok { | ||
logger.Info("check-failed", lager.Data{"exit-status": rErr.ExitStatus}) | ||
metric. | ||
ResourceChecksDuration. | ||
WithLabelValues("failed"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the corresponding example of how you leverage that measurement you defined in metrics.go
atc/atccmd/command.go
Outdated
Runner: metric.PeriodicallyEmit( | ||
logger.Session("periodic-metrics"), | ||
10*time.Second, | ||
Name: "metrics-handler", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all that you need to have an endpoint exporting prometheus metrics
I like this. "That'd mean having an easy time adding metrics to components like tsa, beacon, ^ thats gunna be super useful to us on runtime. |
Moving on, some things that could make this better:
|
There are certain metrics here that are clearly lacking:
These metrics that are predominantelly statistics from the database, are missing, as there's not a very good way of having them exposed by a single component (e.g., it'd be weird to have two ATCs reporting the number of teams that they just got from the database). For that case, I'm leaning towards having a second component (https://github.com/cirocosta/flight_recorder/ Wdyt? thanks! |
@@ -43,6 +46,9 @@ func (wc *workerCollector) Run(ctx context.Context) error { | |||
|
|||
if len(affected) > 0 { | |||
logger.Info("marked-workers-as-stalled", lager.Data{"count": len(affected), "workers": affected}) | |||
|
|||
// workers_stalled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to have it as workers_marked_stalled
as that's the actual action taking place?
That way it'd make sense to be per-ATC (still being possible to be sum
med up).
@@ -57,6 +59,13 @@ func (w workerHelper) createGardenContainer( | |||
Env: env, | |||
Properties: gardenProperties, | |||
}) | |||
|
|||
metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is web's perspective of how long a container creation is taking.
I wonder if this would be more valuable if we included the worker name in the label set, so that we could tell what's the particular worker that is facing some long time to have the container created
Naturally, in the best scenario we'd have distributed tracing in a way that would allow us to have all the context and separate between time taken as seen from the web
and time taken by just garden
, but I think what we have for now (+ having worker name in the label set) would be good enough
|
||
HttpResponseDuration. | ||
With(prometheus.Labels{ | ||
"code": strconv.Itoa(metrics.Code), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a lot of value of having the plain code
here - we could, instead, move towards 2xx
, reducing the amount of possible values that this can take down to 4 (2xx
, 3xx
, 4xx
, 5xx
) - just what we care about
|
||
var ( | ||
slowBuckets = []float64{ | ||
1, 30, 60, 120, 180, 300, 600, 900, 1200, 1800, 2700, 3600, 7200, 18000, 36000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh yeah, this bad boy here ends up generating 16 labels for each measurement that we end up adding slowBuckets
is used exclusively for build
time - the metric with the highest cardinality).
registry.MustRegister(collector) | ||
} | ||
|
||
registry.MustRegister( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not very happy with having to not forget to add the variable here, but I couldn't think of a good way of automatically adding (without having to do something like what promauto
does - reimplementing the constructors).
Can we also do some sort of timeline graph that plots when heartbeats happen. I feel like this could come in handy. (Is this covered by the fact that workers go into stalled state if they don't heartbeat?) |
@kcmannem hmmmm e.g. // in case of success
metrics.Heartbeats.WithLabelValues("success").Inc()
// in case of error
metrics.Heartbeats.WithLabelValues("failure").Inc() or, if we care about how long the heartbeating takes, we could keep track of the duration of that too having the scraping going on, you could tell when those heartbeats happened wdyt? thx! oh, and btw, would that be worker heartbeating from the worker perspective, or the heartbeats that |
@cirocosta my bad, I meant heartbeating from the workers perspective. I'm ok with the success or failures but if there is minimal cost, it might be helpful to know the duration as well. It'll give us visibility around how heartbeating works under different kinds of pressure (high CPU load, high network load). |
This cannot come soon enough! The current metrics setup based on Influxdb is bursting - again. Throwing more CPUs & memory at it, until we have this new solution. I discovered this while getting frustrated with the current metrics setup, used to create #4340 I highly recommend reaching out to @brian-brazil when you are ready with this, he has valuable insights into best practices: prometheus/docs#1414 Thanks for working on this @cirocosta ! |
A quick high level comment:
The /metrics of a process should expose information about the performance of the process itself, not be making RPCs to other systems to gather data or expose the data it's processing. So for example the MySQLd exporter exposes metrics about MySQLd performance - not raw data from tables. You'd have a different exporter for this. Kubelet vs kube-state-metrics is a good example of this. That's not to say you can't expose such metrics on another endpoint (and sometimes you can cheat a little for some cheap metrics a given process knows about anyway) from the same process though. |
I would enjoy starting to dog-food this in 2 weeks from now @cirocosta . Happy to delay if you need more time, but I am seriously intrigued about what you have going here |
This is an experiment to try to validate how good/bad would it be to have **ONLY** prometheus as the metric emitter that we use. it removes some, adds more context to others, and also introduces `worker` as a metrics provider. e.g.: - builds_started: removed as I'm sure the value from `builds` is already good enough - builds: add `team`, `pipeline`, and `job`. - scheduling_load & scheduling_job: we don't even have them in dashboards right now ¯\_(ツ)_/¯ Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@cirocosta I am very happy to see this happening :-)
I have a few comments related to the roll out strategy. Since this will be replacing all the other emitters in one go (at least this is what I understood), I think you need to consider two aspects:
-
Provide a sort of migration/adaptation document for people who do not use Prometheus, for example the Telegraph plugin you mention. For a change like this, I think it would make sense to validate that the Telegraph plugin actually works as expected.
-
Think about Prometheus metrics compatibility. By this I mean that if this PR changes or removes some of the existing exported metrics, they should be mentioned in the release notes, so that an operator will know which queries to modify in Grafana and in AlertManager. Maybe a quick way is simply to take two ATCs, one without this PR, one with this PR, scrape both with curl atc/metrics and have a look at the diff.
Thanks!
@@ -355,6 +334,14 @@ func (cmd *RunCommand) Execute(args []string) error { | |||
return <-ifrit.Invoke(sigmon.New(runner)).Wait() | |||
} | |||
|
|||
func LogLockAcquired(logger lager.Logger, lockID lock.LockID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't log the lockID. Is this on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! yeah, I didn't put much thought on it
This is awesome! I was pretty unhappy with the cardinality for build duration that prometheus metrics offered so far. We're going to build this PR and try this out internally (CarGurus). I'll let you know what we like/ don't like about it |
I have concerns over this idea. I do not use prometheus, I use Datadog. I expect concourse to continue to support emitting statsd metrics that I can consume from my Datadog Agent. The current datadog support is officially recognized by Datadog: https://docs.datadoghq.com/integrations/concourse_ci/ This means using the current statsd configuration allows me to consume concourse metrics into my Datadog account without any additional charges (even in spite of some cardinality issues that would otherwise be very expensive #3196 #4038) I think losing compatibility or breaking the current contract that is documented by Datadog would be a painful regression. In addition to not wanting to deal with prometheus, I also don't like the idea of having to manage or run a telegraf agent. That's one more thing I'd now be responsible for configuring. Their current output plugin, assuming this is what would be used, could also possibly result in a break of the "free metrics" benefit of the current metrics namespace that Datadog represents. As per https://github.com/influxdata/telegraf/tree/master/plugins/outputs/datadog "Metrics are grouped by converting any _ characters to . in the Point Name.", which would mean needing to ensure that the namespace ended up identical to the default as it was before ( I'm also concerned if this means going to a "pull based" model rather than "push based", and how that affects system load and the frequency of reported metrics. The Datadog agent allows for high-frequency metric submission (udp statsd can receive lots of metrics per second) and automatically handles aggregation, sampling, batching and a bunch of other useful stuff to ensure metrics are captured and shipped. If pull based, how does this affect the frequency of metric emissions? Does an agent need to poll aggressively to get more data points? Is the endpoint going to return a series of data points over a timespan and the agent is responsible for knowing what data points it already has? |
Hi @eedwards-sk, thank you a lot for putting time on sharing this feedback!
That's a very valid point! It can definitely be a deal breaker as I can imagine I wonder if in this case it's a matter of engaging with Datadog We expect to have newer metrics as the system changes, and improve them when we
Yeah .. I totally agree! I'd not like to do that either - now there's one more My view on this though is that despite that seeing like a problem, as the We can see that with Datadog itself, announcing that the Datadog agent is now This is also true for NewRelic, another emitter that we currently have to Although it might be a steep transiton, my view is that it's a good thing as
That's indeed a good point. It might be that people could be using our metrics not as "just monitoring", but In this case, we'd indeed be breaking that functionality. Would you mind sharing what's your use case for high frequncy? please let me know what you think - happy to discuss! Thanks! |
@cirocosta Thanks for linking the prometheus article. Datadog is so big now a days I miss half the stuff they're doing. Regarding the costs issue, as long as metrics continue to be prefixed with I agree that concourse shouldn't have to carry the burden of knowing how to deal with a myriad of providers. I was under the impression that statsd is already a fairly generic protocol, which datadog supports (with some straightforward enhancement for dogstatsd / tags). So I wonder if this is really just moving from one standard to another? If this new OpenMetrics approach works as well / better than statsd, I'm not opposed. Although it feels possibly like a k8s-driven change, and as much as the industry makes it seem, not everyone is using k8s, myself included. So perhaps I'm just not seeing this immediate need as clearly.
Those are definitely things I look for out of metrics -- correlation is a very strong tool. Datadog reinforces that by letting me aggregate and correlate multiple metrics from e.g. the same host or hosts on the same timescale. So it can be useful to see that a spike in CPU load was correlated to a build starting, or IOWait spiking when the container count increased. If a polling agent operates regularly enough, then I assume it can still approximate the relative timeframe of when the events are occurring. However this still raises the issue of balancing poll frequency versus load due to polling (and serving a stats request). |
They'll be charged as custom metrics. I don't think they've any integrations yet that use Prometheus under the covers, but I guess it's only a matter of time. Asking them is best.
While Prometheus is commonly used on k8, it is not k8 specific in any way. Plus the major vendors all support the format, and it'll be more efficient than statsd.
If you think this is likely, I'd suggest exporting gauge metrics with the relevant timestamp(s). |
Not necessarily. It depends on the namespace. (as stated above, if the metrics show up as |
That's what I mean by "k8s driven". I didn't say k8s specific :) Can you clarify why it's more efficient than statsd? That sounds highly relevant to this discussion. |
https://www.robustperception.io/which-kind-of-push-events-or-metrics Statsd is (usually) events, Prometheus is metrics. |
Thanks! That's a well summarized overview. I would still present the (anecdotal) perspective that, in practice, at scale, statsd has worked effectively in production for the past few years I've used it alongside Datadog, under a myriad of configurations and load. I personally don't have any drivers to migrate away from it, and I'll continue to have numerous stacks and services using it for some time. Thus I'll be interested to see what this looks like when going from OpenMetrics -> Datadog. If there is relative parity in terms of configuration complexity / management liability, I don't really care what is "inside the box", as long as it surfaces the info we want :) |
I believe DataDog has a per-machine statsd, which avoids the usual issues from attempting to run a single statsd for many many machines.
It'll be the exact same, it's already all supported AIUI. Datadog have contributed a number of performance improvements to the Prometheus Python client to facilitate their Prometheus/OpenMetrics integration. |
Admittedly I have not read through this entire thread so I might be missing some valuable context here, but after briefly chatting with @cirocosta it seems like we would be able to reduce the burden of adding new metrics if we added a That way each emitter can move away from mapping the event.name in favour of event.type. This means that the individual emitter would only ever change if a new type was added, which should be rare. Emitters would also be able to ignore types that don't translate between push or pull collection strategies. |
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
thanks for everyone who helped out expressing your thoughts & concerns! I'll close this one for now in favor of #5104 |
Hey,
I wanted to try out the idea of getting rid of all of our emitters and standardizing on
Concourse being a system that can be observed without having to worry about emitting its
measurements to others (e.g., adopting a pull-based approach and nothing more).
At the moment, I didn't yet convert all the metrics that we have, and IMO, that should not
be done in a 1:1 mapping anyway - there might be a certain set of metrics that we just
don't care about, and others that we do, but we just never implemented.
Also, I tried to be super conservative when it comes to labeling in order to try to avoid
future problems with cardinality explosion, so all measurements (at the moment) have
bounded cardinality, which might mean that they're not super useful for very
context-specific metrics like "is this particular job being slower over time".
ps.: this is just an experiment to think about the area of consolidating the emitters.
I didn't play with the Telegraf Prometheus input plugin (see
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/prometheus) which should
"unlock" Prometheus-based metrics to those who can't just use Prometheus and need
something to emit metrics to their systems), but I'd be curious to know if other people
did!
Also, do you see this being problematic in any sense in the future? Is super high-context
relevant for you? (if so, should we start thinking about other approaches that are not
metrics that can drive this "high-context" observability - like distr tracing - into the
mix?)
I'd appreciate a lot your feedback!
Thanks!