New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experiment: add tracing to ATC #4607
Conversation
I think this is awesome. It's a big piece of my ideal world:
- We introduce tracing throughout Concourse so that contributors can see how all the parts work and make it much easier to identify and fix bottlenecks and strange behavior, improving efficiency and resiliency. I'm pretty excited to play with this kind of thing in a live environment - perhaps even with APM?
- We simplify our logging and focus on readability and contributor UX so it's easier to see when important things happen like build triggering and version detection without all the noise and curly braces. I think we might want to do this by removing all of our
lager
usage and incrementally adding important log lines back (including any error logs), probably usinglager
since that seems popular in our ecosystem. (We may want to revisit audit logs at this point?) @xoebus and I started on a project to automate this: https://github.com/vito/prohibition - but haven't gotten too far. - We continue to invest in metrics, but stop overloading it as an event system and focus more on Prometheus-style static values, reducing cardinality and simplifying our metrics stack. Maintaining push and pull and all kinds of different implementations is a bit of a nightmare. There's a reason we started with only Riemann (...though there were super valid reasons for people to not want to use Riemann, so we backpedaled, but maybe folks would find Prometheus(+Telegraf) more agreeable).
Does that all sound about right to you?
I haven't had a chance to play with these changes (or review them) yet, but I approve in principle and would like to see this merged in and iterated on soon!
Had time to test it out manually and it works as far as I can tell. Gonna approve even though this is marked experimental since I'd like to see this kind of thing grow quickly.
Think I might work on a DataDog APM output after this is merged.
One thing I noticed: while a build is running it shows up in the Jaeger UI as <trace-without-root-span>
- is that normal? Is it because the trace isn't done, and a span is only fully emitted with its name and such once it's done?
Also, any idea what those naked run-script
and find-or-create-container
spans are from? Maybe it's image_resource
checking/fetching? It looks like the run-script
is running /opt/resource/check
.
Where does this leave PR #4598? I want to be able to view trace events data in Honeycomb in particular, because as far as I know this is the only system that lets me attach arbitrary fields of arbitrarily high cardinality to events. Traces let me understand how individual pieces of work flow through the system, but what I'm really looking for is "needles in a haystack full of needles" -- I want to be able to view all events of a particular type in aggregate, and then bucket them across arbitrary independent variables, to look for independent variables associated with anomalies. I'm trying to answer questions like "why do certain resources sometimes stop triggering, and why does restarting the ATC fix that problem?" I'd like to be able to examine as many properties as possible of the events associated with resource triggering (and any other operation that exhibits odd behavior like that.) It's unclear to me right now if Honeycomb is compatible with the Open Telemetry standard yet. I might be able to hack it together even if it's not, but that would mean spending some time getting that working, rather than working on improving the contents of the events and traces Concourse is emitting. The other path forward would be to merge my PR so I can continue to experiment with Honeycomb instrumentation, with the intention of rolling whatever work I do there into the Open Telemetry compatible emission path, whenever Honeycomb gets compatibility there fully online. Let me know which path you'd like me to pursue @cirocosta @vito. I probably won't be back on this in earnest until November 8th, but I'd like to be able to do the work to get some data into Honeycomb at that point and start thinking about what else might be useful to emit. |
In general though-- I'm real pleased to see the vision @vito 's laid out above -- my main problem with Concourse right now is that I can't always understand what's happening inside of it when something goes wrong, but I think the technology is available now to give operators a "pane of glass" view of the operations of the entire system. In my experience-- Logs are useful for understanding the exact chain of events on a specific node, or a small number of nodes. They're also handy for storing audit events for long periods of time. Human readability is really important in both cases. Metrics are key for understanding whether problems that you know a lot about already are happening. I'd love to see Concourse be able to emit some really rich events and traces, and let metrics and logs focus on being good at what they're good at, rather than first-line debugging tools. |
@njbennett Cheers for the feedback! These are all pretty high-impact things involving a lot of mechanical labor to change so it's good to hear signs of alignment from the community before we commit. For Honeycomb, it sounds like you're in a similar position to me where I'd like to have support for emitting to DataDog APM which doesn't seem to have a native OpenTelemetry exporter yet. However they both have OpenCensus implementations, which seem to fit a very similar interface to
In comparing the two |
@cirocosta Yes man. You showed that I was not the only one dreaming this. Thank you. |
[ipm]: (a reminder for me):
|
the tracing package is responsible for providing a thin wrappen on top of opentelemetry. by having this wrapper, we're able to protect ourselves from the underlying changes that `opentelemetry` has been going through as it goes through the process of getting ready for a final release. the package is meant to be used in a two step fashion: 1. an exporter is configured (initializing the SDK to not use a noop tracer) - `ConfigureTracer` 2. spans are then created - StartSpan() / span.End() so that code can be instrumented in no more than ~3 lines of code. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
this commit makes the tracing functionality available by allowing operators to specify `--tracing-*` flags to the `concourse web` command to leverage the tracing infrastructure put in place. support for submitting the traces to a jaeger collector can be done like so: --tracing-jaeger-endpoint=<> --tracing-jaeger-service=<> and for just printing those spans to stdout: --tracing-stdout=true ps.: a little bit of refactoring on `stdout` was made to avoid having to check for `errors` where not needed. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
building upon the tracing package, this commit adds tracepoints to all of the steps that Concourse exposes to pipeline authors. even though most of these steps don't incur in any tangigle costs, the approach here ofmaking them visible to the operator is so that one can easily resonate about underlying pipeline as opposed to not when not having the same exact structure. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
adding a tracepoint right after ATC acquires the lock to track that build, we're able to capture the duration of the entiire build as seen by this ATC who's keeping track of it. as the context propagates further down to other steps, we can correlate the the entire build with its steps. ps.: this commit DOES NOT address the case of an ATC being phased out in the middle of the build tracking and having the work picked up by another installation (although this should be possible if we own our traceIDs). Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
by adding a tracepoint at this low level method, we're able to exactly time how long the execution of a given script took. having other higher-level components already instrumented, via context propagation we're able to have the span created here connected with the step that initiated it. e.g.: build --------------- get ---------- run-script ------ ps.: this does not differentiate the cases of attaching to a process that is already running (similar to how we do not yet deal with build tracking when ATC starts tracking a build that it didn't track from start to finish). Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
expanding down to the very low level of container creation, here we allow us to keep track of how long several things take: - entire time taken in `findorcreatecontainer` - not differentiating between a hit or miss at the moment - not updating the span with the `worker` chosen yet - time taken to prepare the base image - again, not differentiating between hits/misses - time taken to prepare the volumes Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
having a span being generated at the place where a task gets to be run allows us to segregate the time taken to perform all of the preparation from the actual running of the executable. ps.: this commit DOES NOT deal with differentiating a complete run from attaching. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As of nov 5th (2019), opentelemetry-go got its first (alpha) release: v0.1.0. With it, some changes came, including: 1. ability to specify jaeger tags 2. a new global tracer registration This commit addresses [1] by adding an extra flag to the set of `jaeger-*` commands, and [2] by using the new `trace.Provider` interface (tests updated as well). Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
it turns out that using opentelemetry's nooptracer ends up in some cost (as it updates the trace context). by introducing a variable that indicates whether we've registered or not, we're able to short-circuit that entirely, while changing 0 code for those consuming the `tracing` api. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Hey, two updates:
The reason for not going with just their NoopTracer is that the library goes E.g., for a random call to
Thus, here's how it looked before (using the NoopTracer):
By shortcircuiting that entirely (with a
Added docs here: concourse/docs#275 |
Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
documenting the feature added by concourse/concourse#4607 Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
Superseded by #5043 Thanks! |
Hey,
In #4247, I brought up that perhaps
we should not think of our metrics as events, but rather as indicators of
system status, but I didn't really went with a proposal for what could be the
thing that would let operators have something that resambles events (with very
precise timing, high cardinality, etc).
some time ago though, @njbennett had brought up something quite interesting
(#3958): Concourse could emit build
events as traces!
and, that's essentially what I tried implementing here, trying to be as neutral
as possible with regards to vendors by going straight with opentelemetry.
here's the result of this PR:
(yesterday, I demoed the PR to the team - here's where you can find what we talked about: https://ops.tips/notes/tracing-builds-concourse/)
Although the PR does work (yeah, please, try it out!), my goal with this one is
to gather some thoughts from all of us to see whether that's something that we'd
like to do, think about possible problems, and ... just reason about it, thus,
marking it as "draft".
The PR is broken down in a series of commits that kinda tell a story:
ourselves from the changes that are constantly happening underneath
web
command, so that one canconfigure it
container creation.
Please let me know what you think! I'm quite excited about the possibilities
here, and, if we do this right, I can see this being a big addition to
Concourse.
Thank y'all!
here's a sample
docker-compose.yml
that spins up 3 workers and has Jaeger set up: https://gist.github.com/cirocosta/6a676a071d64e0076a30139e76f46a74