Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: add tracing to ATC #4607

Closed
wants to merge 10 commits into from
Closed

experiment: add tracing to ATC #4607

wants to merge 10 commits into from

Conversation

@cirocosta
Copy link
Member

cirocosta commented Oct 16, 2019

Hey,

In #4247, I brought up that perhaps
we should not think of our metrics as events, but rather as indicators of
system status, but I didn't really went with a proposal for what could be the
thing that would let operators have something that resambles events (with very
precise timing, high cardinality, etc).

some time ago though, @njbennett had brought up something quite interesting
(#3958): Concourse could emit build
events as traces!

and, that's essentially what I tried implementing here, trying to be as neutral
as possible with regards to vendors by going straight with opentelemetry.

here's the result of this PR:

Screen Shot 2019-10-15 at 9 17 21 PM

(yesterday, I demoed the PR to the team - here's where you can find what we talked about: https://ops.tips/notes/tracing-builds-concourse/)

Although the PR does work (yeah, please, try it out!), my goal with this one is
to gather some thoughts from all of us to see whether that's something that we'd
like to do, think about possible problems, and ... just reason about it, thus,
marking it as "draft".

The PR is broken down in a series of commits that kinda tell a story:

  • first, I implement a little wrapper on top of opentelemetry to shield
    ourselves from the changes that are constantly happening underneath
  • then, subsequently, I wire that to the web command, so that one can
    configure it
  • then, I keep adding tracepoints from the very high level build tracker down to
    container creation.

Please let me know what you think! I'm quite excited about the possibilities
here, and, if we do this right, I can see this being a big addition to
Concourse.

Thank y'all!


here's a sample docker-compose.yml that spins up 3 workers and has Jaeger set up: https://gist.github.com/cirocosta/6a676a071d64e0076a30139e76f46a74

@cirocosta cirocosta requested review from concourse/core-api as code owners Oct 16, 2019
Copy link
Member

vito left a comment

I think this is awesome. It's a big piece of my ideal world:

  • We introduce tracing throughout Concourse so that contributors can see how all the parts work and make it much easier to identify and fix bottlenecks and strange behavior, improving efficiency and resiliency. I'm pretty excited to play with this kind of thing in a live environment - perhaps even with APM?
  • We simplify our logging and focus on readability and contributor UX so it's easier to see when important things happen like build triggering and version detection without all the noise and curly braces. I think we might want to do this by removing all of our lager usage and incrementally adding important log lines back (including any error logs), probably using lager since that seems popular in our ecosystem. (We may want to revisit audit logs at this point?) @xoebus and I started on a project to automate this: https://github.com/vito/prohibition - but haven't gotten too far.
  • We continue to invest in metrics, but stop overloading it as an event system and focus more on Prometheus-style static values, reducing cardinality and simplifying our metrics stack. Maintaining push and pull and all kinds of different implementations is a bit of a nightmare. There's a reason we started with only Riemann (...though there were super valid reasons for people to not want to use Riemann, so we backpedaled, but maybe folks would find Prometheus(+Telegraf) more agreeable).

Does that all sound about right to you?

I haven't had a chance to play with these changes (or review them) yet, but I approve in principle and would like to see this merged in and iterated on soon!

Copy link
Member

vito left a comment

Had time to test it out manually and it works as far as I can tell. Gonna approve even though this is marked experimental since I'd like to see this kind of thing grow quickly. 👍

Think I might work on a DataDog APM output after this is merged.

One thing I noticed: while a build is running it shows up in the Jaeger UI as <trace-without-root-span>- is that normal? Is it because the trace isn't done, and a span is only fully emitted with its name and such once it's done?

Also, any idea what those naked run-script and find-or-create-container spans are from? Maybe it's image_resource checking/fetching? It looks like the run-script is running /opt/resource/check.

Screenshot from 2019-10-25 10-42-03

@njbennett

This comment has been minimized.

Copy link

njbennett commented Oct 25, 2019

Where does this leave PR #4598?

I want to be able to view trace events data in Honeycomb in particular, because as far as I know this is the only system that lets me attach arbitrary fields of arbitrarily high cardinality to events. Traces let me understand how individual pieces of work flow through the system, but what I'm really looking for is "needles in a haystack full of needles" -- I want to be able to view all events of a particular type in aggregate, and then bucket them across arbitrary independent variables, to look for independent variables associated with anomalies.

I'm trying to answer questions like "why do certain resources sometimes stop triggering, and why does restarting the ATC fix that problem?" I'd like to be able to examine as many properties as possible of the events associated with resource triggering (and any other operation that exhibits odd behavior like that.)

It's unclear to me right now if Honeycomb is compatible with the Open Telemetry standard yet. I might be able to hack it together even if it's not, but that would mean spending some time getting that working, rather than working on improving the contents of the events and traces Concourse is emitting.

The other path forward would be to merge my PR so I can continue to experiment with Honeycomb instrumentation, with the intention of rolling whatever work I do there into the Open Telemetry compatible emission path, whenever Honeycomb gets compatibility there fully online.

Let me know which path you'd like me to pursue @cirocosta @vito. I probably won't be back on this in earnest until November 8th, but I'd like to be able to do the work to get some data into Honeycomb at that point and start thinking about what else might be useful to emit.

@njbennett

This comment has been minimized.

Copy link

njbennett commented Oct 25, 2019

In general though-- I'm real pleased to see the vision @vito 's laid out above -- my main problem with Concourse right now is that I can't always understand what's happening inside of it when something goes wrong, but I think the technology is available now to give operators a "pane of glass" view of the operations of the entire system.

In my experience--
Events and traces are great for getting a high level view of the entire system at once, critical for the early stages of debugging distributed systems, and for finding problems you didn't even know were happening. The big drawback with them is that the datasets tend to be extremely large because they're optimized for understanding, not long term storage.

Logs are useful for understanding the exact chain of events on a specific node, or a small number of nodes. They're also handy for storing audit events for long periods of time. Human readability is really important in both cases.

Metrics are key for understanding whether problems that you know a lot about already are happening.

I'd love to see Concourse be able to emit some really rich events and traces, and let metrics and logs focus on being good at what they're good at, rather than first-line debugging tools.

@vito

This comment has been minimized.

Copy link
Member

vito commented Oct 28, 2019

@njbennett Cheers for the feedback! These are all pretty high-impact things involving a lot of mechanical labor to change so it's good to hear signs of alignment from the community before we commit.

For Honeycomb, it sounds like you're in a similar position to me where I'd like to have support for emitting to DataDog APM which doesn't seem to have a native OpenTelemetry exporter yet. However they both have OpenCensus implementations, which seem to fit a very similar interface to export.SpanSyncer which is returned here, so I bet we could easily add our own wrappers which use them under the hood:

In comparing the two SpanData structs, though, it looks like OpenTelemetry has dropped the Annotations field, which is unfortunate since that sounds like the part you care the most about. With OpenTelemetry being the eventual successor to OpenCensus (as far as I'm aware) I'm not sure if you'll have much choice in the long run. 🤔 I wonder if they replaced that with something else, or if they dropped it? (Or maybe Attributes replaces it? I see it's in both - maybe Annotations was deprecated in OpenCensus and dropped in OpenTelemetry?)

@marco-m

This comment has been minimized.

Copy link
Contributor

marco-m commented Oct 29, 2019

@cirocosta Yes man. You showed that I was not the only one dreaming this. Thank you.

@cirocosta cirocosta added this to In Flight in Worker Resiliency Nov 4, 2019
@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Nov 4, 2019

[ipm]: (a reminder for me):

  • add release notes
  • add docs mentioned the experimental state of this
  • check if there's anything left that would prevent this from being merged
cirocosta added 8 commits Oct 15, 2019
the tracing package is responsible for providing a thin wrappen on top
of opentelemetry.

by having this wrapper, we're able to protect ourselves from the
underlying changes that `opentelemetry` has been going through as it
goes through the process of getting ready for a final release.

the package is meant to be used in a two step fashion:

1. an exporter is configured (initializing the SDK to not use a noop
   tracer)
  - `ConfigureTracer`

2. spans are then created
  - StartSpan() / span.End()

so that code can be instrumented in no more than ~3 lines of code.

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
this commit makes the tracing functionality available by allowing
operators to specify `--tracing-*` flags to the `concourse web` command
to leverage the tracing infrastructure put in place.

support for submitting the traces to a jaeger collector can be done like
so:

	--tracing-jaeger-endpoint=<>
	--tracing-jaeger-service=<>

and for just printing those spans to stdout:

	--tracing-stdout=true

ps.: a little bit of refactoring on `stdout` was made to avoid having to
check for `errors` where not needed.

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
building upon the tracing package, this commit adds tracepoints to all
of the steps that Concourse exposes to pipeline authors.

even though most of these steps don't incur in any tangigle costs, the
approach here ofmaking them visible to the operator is so that one can
easily resonate about underlying pipeline as opposed to not when not
having the same exact structure.

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
adding a tracepoint right after ATC acquires the lock to track that
build, we're able to capture the duration of the entiire build as seen
by this ATC who's keeping track of it.

as the context propagates further down to other steps, we can correlate
the the entire build with its steps.

ps.: this commit DOES NOT address the case of an ATC being phased out in
the middle of the build tracking and having the work picked up by
another installation (although this should be possible if we own our
traceIDs).

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
by adding a tracepoint at this low level method, we're able to exactly
time how long the execution of a given script took.

having other higher-level components already instrumented, via context
propagation we're able to have the span created here connected with the
step that initiated it.

e.g.:

	build           ---------------
	  get                ----------
	    run-script           ------

ps.: this does not differentiate the cases of attaching to a process
that is already running (similar to how we do not yet deal with build
tracking when ATC starts tracking a build that it didn't track from
start to finish).

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
expanding down to the very low level of container creation, here we
allow us to keep track of how long several things take:

- entire time taken in `findorcreatecontainer`
  - not differentiating between a hit or miss at the moment
  - not updating the span with the `worker` chosen yet
- time taken to prepare the base image
  - again, not differentiating between hits/misses
- time taken to prepare the volumes

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
having a span being generated at the place where a task gets to be run
allows us to segregate the time taken to perform all of the preparation
from the actual running of the executable.

ps.: this commit DOES NOT deal with differentiating a complete run from
attaching.

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
As of nov 5th (2019), opentelemetry-go got its first (alpha) release:
v0.1.0. With it, some changes came, including:

1. ability to specify jaeger tags
2. a new global tracer registration

This commit addresses [1] by adding an extra flag to the set of
`jaeger-*` commands, and [2] by using the new `trace.Provider` interface
(tests updated as well).

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@cirocosta cirocosta force-pushed the build-tracing-1 branch from abbbea2 to 59389fc Nov 9, 2019
Worker Resiliency automation moved this from In Flight to PR Review Nov 9, 2019
it turns out that using opentelemetry's nooptracer ends up in some cost
(as it updates the trace context).

by introducing a variable that indicates whether we've registered or
not, we're able to short-circuit that entirely, while changing 0 code
for those consuming the `tracing` api.

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Nov 11, 2019

Hey,

two updates:

  1. bumped opentelemetry-go to v0.1.0 - their first release (from 6 days ago
    😁)

  2. added a benchmark to ensure that we're close to a zero-cost abstraction
    ("what you don't use, you don't pay; what you use, you couldn't handcode any
    better") - more specfically w/ regards to the "you don't pay if you don't
    use".

1 was a matter of updating the tracer registration, and the imports.
2 was all about us providing the NoopSpan when not registered.

The reason for not going with just their NoopTracer is that the library goes
ahead w/ growing the context that you provide to it, which turns into 1
alloc per op.

E.g., for a random call to StartSpan, we can see that we get to context.WithValue:

    context.WithValue
        at /usr/local/go./context/context.go:487
    go.opentelemetry.io/otel/api/trace.SetCurrentSpan
        at /go/pkg/mod/go.opentelemetry.io/otel@v0.1.0/api/trace/current.go:28
    go.opentelemetry.io/otel/api/trace.NoopTracer.Start
        at /go/pkg/mod/go.opentelemetry.io/otel@v0.1.0/api/trace/noop_trace.go:50
    go.opentelemetry.io/otel/api/trace.(*NoopTracer).Start
        at <autogenerated>:1
    github.com/concourse/concourse/atc/tracing.StartSpan
        at ./atc/tracing/tracer.go:55

Thus, here's how it looked before (using the NoopTracer):

    goos: darwin
    goarch: amd64
    pkg: github.com/concourse/concourse/atc/tracing
    BenchmarkNoTracing-12               33.2 ns/op     0 B/op   0 allocs/op
    BenchmarkNoopTracer-12              84.1 ns/op    48 B/op   1 allocs/op
    BenchmarkNoopTracerWithAttrs-12      193 ns/op    96 B/op   2 allocs/op


         flat  flat%   sum%        cum   cum%
      672.53MB   100%   100%   672.53MB   100%  context.WithValue
             0     0%   100%   672.53MB   100%  github.com/concourse/concourse/atc/tracing.StartSpan
             0     0%   100%   672.53MB   100%  github.com/concourse/concourse/atc/tracing_test.BenchmarkNoopTracer
             0     0%   100%   672.53MB   100%  github.com/concourse/concourse/atc/tracing_test.noop
             0     0%   100%   672.53MB   100%  go.opentelemetry.io/otel/api/trace.NoopTracer.Start
             0     0%   100%   672.53MB   100%  go.opentelemetry.io/otel/api/trace.SetCurrentSpan


    Total: 672.53MB
             0   672.53MB (flat, cum)   100% of Total
             .          .     50:func StartSpan(
             .          .     51:   ctx context.Context,
             .          .     52:   component string,
             .          .     53:   attrs Attrs,
             .          .     54:) (context.Context, trace.Span) {
             .   672.53MB     55:   ctx, span := global.TraceProvider().GetTracer("atc").Start(
             .          .     56:           ctx,
             .          .     57:           component,
             .          .     58:   )
             .          .     59:


    Total: 672.53MB
      672.53MB   672.53MB (flat, cum)   100% of Total
             .          .    482:           panic("nil key")
             .          .    483:   }
             .          .    484:   if !reflectlite.TypeOf(key).Comparable() {
             .          .    485:           panic("key is not comparable")
             .          .    486:   }
      672.53MB   672.53MB    487:   return &valueCtx{parent, key, val}
             .          .    488:}
             .          .    489:
             .          .    490:// A valueCtx carries a key-value pair. It implements Value for that key and
             .          .    491:// delegates all other calls to the embedded Context.
             .          .    492:type valueCtx struct {

By shortcircuiting that entirely (with a tracing.Configured) variable, we're
then able to avoid all of that, which brings us down to 0 allocs:

    goos: darwin
    goarch: amd64
    pkg: github.com/concourse/concourse/atc/tracing
    BenchmarkNoTracing-12                33.2 ns/op  0 B/op    0 allocs/op
    BenchmarkNoopTracerWithAttrs-12      55.5 ns/op  0 B/op    0 allocs/op
    BenchmarkNoopTracer-12               37.0 ns/op  0 B/op    0 allocs/op

Added docs here: concourse/docs#275

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
cirocosta added a commit to concourse/docs that referenced this pull request Nov 11, 2019
documenting the feature added by concourse/concourse#4607

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@xtreme-sameer-vohra xtreme-sameer-vohra added this to In progress in Runtime side-road via automation Nov 25, 2019
@xtreme-sameer-vohra xtreme-sameer-vohra removed this from PR Review in Worker Resiliency Nov 25, 2019
@xtreme-sameer-vohra xtreme-sameer-vohra moved this from In progress to Review in progress in Runtime side-road Nov 25, 2019
@cirocosta cirocosta moved this from Review in progress to To do in Runtime side-road Jan 9, 2020
@xtreme-sameer-vohra xtreme-sameer-vohra moved this from To do to Review in progress in Runtime side-road Jan 13, 2020
@cirocosta cirocosta mentioned this pull request Jan 20, 2020
2 of 8 tasks complete
@cirocosta

This comment has been minimized.

Copy link
Member Author

cirocosta commented Jan 20, 2020

Superseded by #5043

Thanks!

@cirocosta cirocosta closed this Jan 20, 2020
Runtime side-road automation moved this from Review in progress to Done Jan 20, 2020
@cirocosta cirocosta deleted the build-tracing-1 branch Jan 21, 2020
@xtremerui xtremerui mentioned this pull request Jan 23, 2020
1 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
5 participants
You can’t perform that action at this time.