experiment: add tracing to ATC #4607
In #4247, I brought up that perhaps
and, that's essentially what I tried implementing here, trying to be as neutral
here's the result of this PR:
(yesterday, I demoed the PR to the team - here's where you can find what we talked about: https://ops.tips/notes/tracing-builds-concourse/)
Although the PR does work (yeah, please, try it out!), my goal with this one is
The PR is broken down in a series of commits that kinda tell a story:
Please let me know what you think! I'm quite excited about the possibilities
here's a sample
vito left a comment
I think this is awesome. It's a big piece of my ideal world:
Does that all sound about right to you?
I haven't had a chance to play with these changes (or review them) yet, but I approve in principle and would like to see this merged in and iterated on soon!
vito left a comment •
Had time to test it out manually and it works as far as I can tell. Gonna approve even though this is marked experimental since I'd like to see this kind of thing grow quickly.
Think I might work on a DataDog APM output after this is merged.
One thing I noticed: while a build is running it shows up in the Jaeger UI as
Also, any idea what those naked
Where does this leave PR #4598?
I want to be able to view trace events data in Honeycomb in particular, because as far as I know this is the only system that lets me attach arbitrary fields of arbitrarily high cardinality to events. Traces let me understand how individual pieces of work flow through the system, but what I'm really looking for is "needles in a haystack full of needles" -- I want to be able to view all events of a particular type in aggregate, and then bucket them across arbitrary independent variables, to look for independent variables associated with anomalies.
I'm trying to answer questions like "why do certain resources sometimes stop triggering, and why does restarting the ATC fix that problem?" I'd like to be able to examine as many properties as possible of the events associated with resource triggering (and any other operation that exhibits odd behavior like that.)
It's unclear to me right now if Honeycomb is compatible with the Open Telemetry standard yet. I might be able to hack it together even if it's not, but that would mean spending some time getting that working, rather than working on improving the contents of the events and traces Concourse is emitting.
The other path forward would be to merge my PR so I can continue to experiment with Honeycomb instrumentation, with the intention of rolling whatever work I do there into the Open Telemetry compatible emission path, whenever Honeycomb gets compatibility there fully online.
Let me know which path you'd like me to pursue @cirocosta @vito. I probably won't be back on this in earnest until November 8th, but I'd like to be able to do the work to get some data into Honeycomb at that point and start thinking about what else might be useful to emit.
In general though-- I'm real pleased to see the vision @vito 's laid out above -- my main problem with Concourse right now is that I can't always understand what's happening inside of it when something goes wrong, but I think the technology is available now to give operators a "pane of glass" view of the operations of the entire system.
In my experience--
Logs are useful for understanding the exact chain of events on a specific node, or a small number of nodes. They're also handy for storing audit events for long periods of time. Human readability is really important in both cases.
Metrics are key for understanding whether problems that you know a lot about already are happening.
I'd love to see Concourse be able to emit some really rich events and traces, and let metrics and logs focus on being good at what they're good at, rather than first-line debugging tools.
@njbennett Cheers for the feedback! These are all pretty high-impact things involving a lot of mechanical labor to change so it's good to hear signs of alignment from the community before we commit.
For Honeycomb, it sounds like you're in a similar position to me where I'd like to have support for emitting to DataDog APM which doesn't seem to have a native OpenTelemetry exporter yet. However they both have OpenCensus implementations, which seem to fit a very similar interface to
In comparing the two
the tracing package is responsible for providing a thin wrappen on top of opentelemetry. by having this wrapper, we're able to protect ourselves from the underlying changes that `opentelemetry` has been going through as it goes through the process of getting ready for a final release. the package is meant to be used in a two step fashion: 1. an exporter is configured (initializing the SDK to not use a noop tracer) - `ConfigureTracer` 2. spans are then created - StartSpan() / span.End() so that code can be instrumented in no more than ~3 lines of code. Signed-off-by: Ciro S. Costa <email@example.com>
this commit makes the tracing functionality available by allowing operators to specify `--tracing-*` flags to the `concourse web` command to leverage the tracing infrastructure put in place. support for submitting the traces to a jaeger collector can be done like so: --tracing-jaeger-endpoint=<> --tracing-jaeger-service=<> and for just printing those spans to stdout: --tracing-stdout=true ps.: a little bit of refactoring on `stdout` was made to avoid having to check for `errors` where not needed. Signed-off-by: Ciro S. Costa <firstname.lastname@example.org>
building upon the tracing package, this commit adds tracepoints to all of the steps that Concourse exposes to pipeline authors. even though most of these steps don't incur in any tangigle costs, the approach here ofmaking them visible to the operator is so that one can easily resonate about underlying pipeline as opposed to not when not having the same exact structure. Signed-off-by: Ciro S. Costa <email@example.com>
adding a tracepoint right after ATC acquires the lock to track that build, we're able to capture the duration of the entiire build as seen by this ATC who's keeping track of it. as the context propagates further down to other steps, we can correlate the the entire build with its steps. ps.: this commit DOES NOT address the case of an ATC being phased out in the middle of the build tracking and having the work picked up by another installation (although this should be possible if we own our traceIDs). Signed-off-by: Ciro S. Costa <firstname.lastname@example.org>
by adding a tracepoint at this low level method, we're able to exactly time how long the execution of a given script took. having other higher-level components already instrumented, via context propagation we're able to have the span created here connected with the step that initiated it. e.g.: build --------------- get ---------- run-script ------ ps.: this does not differentiate the cases of attaching to a process that is already running (similar to how we do not yet deal with build tracking when ATC starts tracking a build that it didn't track from start to finish). Signed-off-by: Ciro S. Costa <email@example.com>
expanding down to the very low level of container creation, here we allow us to keep track of how long several things take: - entire time taken in `findorcreatecontainer` - not differentiating between a hit or miss at the moment - not updating the span with the `worker` chosen yet - time taken to prepare the base image - again, not differentiating between hits/misses - time taken to prepare the volumes Signed-off-by: Ciro S. Costa <firstname.lastname@example.org>
having a span being generated at the place where a task gets to be run allows us to segregate the time taken to perform all of the preparation from the actual running of the executable. ps.: this commit DOES NOT deal with differentiating a complete run from attaching. Signed-off-by: Ciro S. Costa <email@example.com>
As of nov 5th (2019), opentelemetry-go got its first (alpha) release: v0.1.0. With it, some changes came, including: 1. ability to specify jaeger tags 2. a new global tracer registration This commit addresses  by adding an extra flag to the set of `jaeger-*` commands, and  by using the new `trace.Provider` interface (tests updated as well). Signed-off-by: Ciro S. Costa <firstname.lastname@example.org>
it turns out that using opentelemetry's nooptracer ends up in some cost (as it updates the trace context). by introducing a variable that indicates whether we've registered or not, we're able to short-circuit that entirely, while changing 0 code for those consuming the `tracing` api. Signed-off-by: Ciro S. Costa <email@example.com>
The reason for not going with just their NoopTracer is that the library goes
E.g., for a random call to
Thus, here's how it looked before (using the NoopTracer):
By shortcircuiting that entirely (with a
Added docs here: concourse/docs#275