Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit build events as traces in order to visualize the behavior of complex jobs and pipelines #3958

Closed
njbennett opened this issue Jun 1, 2019 · 10 comments

Comments

@njbennett
Copy link

@njbennett njbennett commented Jun 1, 2019

What challenge are you facing?

The Cloud Foundry, PAS, and PKS release engineering teams have complex pipelines, and it's hard to understand their behavior holistically. Many jobs within the pipelines are "flaky" but it's difficult to prioritize work to reduce intermittent failures without knowing which intermittent failures are the most common.

It's also difficult for us to understand the effects of changes to our pipelines. Concourse is our implementation layer, but we're using it to model high level concepts. We care about things like a "run" of a pipeline, or the status of a release train that might represent several partial pipeline runs.

We would like to be able to answer questions like:

  • "Which jobs fail the most?"
  • "Which job failures are the most common?"
  • "Which tasks shared across many jobs fail the most?"
  • "Which task failures are the most common?"
  • "How are many failures in our pipeline are due to underlying problems with Concourse, and how many failures are due to problems with the software under test?"
  • "How long does it take a particular version of a particular resource to make it from the start of the pipeline to the finish?"
  • "Which jobs have a particular version of a resource passed, which jobs have a particular version of a resource failed, and what was the cause of those failures?"

Answering these questions would help us prioritize work to reduce intermittent failures, identify and report bugs to the Concourse team, and measure the impact of changes to our pipelines on their performance.

We can do some of this by exploring the GUI or scraping data from the Concourse API. The former is time-consuming, the latter requires building relatively fragile code that will break whenever that API changes. And, as I understand the API's authentication model, it requires a credential representing a user to access the most interesting data, and is difficult to poll with a machine.

We've been able to answer questions like this with fairly high precision about our acceptance tests for some time now, which has made it easier to keep intermittent failures from that source under control. However, many of the failures we're interested in occur in other steps of our release engineering process, especially deploy and teardown.

What would make this better?

I'd like to be able to generate Honeycomb build events from Concourse tasks, and chain them together into traces.

These events would ideally contain:

  • A span ID representing a particular PAS/cf-deployment/PKS version, so we could follow the progress of a particular version across the entire set of pipelines we manage
  • A span ID representing each individual job, so we could link together tasks into jobs
  • Fields with the version (and possibly other identifying information) from each resource used in the task
  • The exit status of the task

This is currently possible with Travis-CI, CircleCI, and Jenkins, but this implementation depends on information that Concourse (I believe, might be wrong) does not currently make available to tasks. https://github.com/honeycombio/buildevents

We could potentially hack something together that would enable this via some kind of resource, but that requires exposing information to tasks that tasks probably shouldn't have, or information to resources about other resources that they probably shouldn't have.

The "best" solution I've been able to think of is for Concourse to generate these events itself, using the Honeycomb beeline for golang, but this increases the complexity of Concourse itself for an optional feature that may not be useful for many users.

It also potentially requires ongoing comprehension and maintenance of this event emission system by the Concourse team. These events might be just as useful if not more to the Concourse team itself for debugging and understanding Concourse, and might lead to lower-level event emission about internal Concourse behavior that could make it much easier to pin down odd distributed systems behavior, but it is a cost, and makes me wary of attempting of attempting to PR this behavior in.

Are you interested in implementing this yourself?

Very! However, there are a lot of options, and I need guidance from the Concourse team about which is the most sensible, or if there are possibilities that I'm missing.

@jchesterpivotal

This comment has been minimized.

Copy link
Contributor

@jchesterpivotal jchesterpivotal commented Jun 1, 2019

I'm heartily in favour of having a standard, general stream of events with embedded sequencing or context. Right now I achieve some of this end by scraping endpoints and inserting data into a schema of my own, but it would be much easier to consume if it arrived in a predictable, uniform fashion.

@marco-m

This comment has been minimized.

Copy link
Contributor

@marco-m marco-m commented Jun 1, 2019

Simpler but related: #3905

@stale

This comment has been minimized.

Copy link

@stale stale bot commented Jul 31, 2019

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be exterminated closed, in accordance with our stale issue process.

@stale stale bot added the wontfix label Jul 31, 2019
@jama22

This comment has been minimized.

Copy link
Member

@jama22 jama22 commented Aug 2, 2019

very interesting @njbennett. Thanks for taking the time to frame the specific problems you're having and outlining the world of data you would like to see. I think the datapoints that you mentioned would be very useful for Concourse users to have; I know as a PM I've had issues getting quick access to that data as well.

This definitely sounds like data that should be emitted by Concourse itself, so I think you're on the right track with the idea of emitting events. If you're willing to do the work, PRs are welcome. In the past we've had similar PRs come in around metrics emitters that are purely community driven and community maintained e.g. Prometheus and Datadog. I'll only note that there seems to be some changes coming down the line with The Algorithm that may make life easier for you to calculate this data (mentioned in #3905) and we have an aspirational issue around consolidating metrics emitter #2896 that's in the icebox

I also have no idea how popular Honeycomb.io is so I'll take it at your word that it may be a niche emitter option.

cc @vito to fact check any crazy claims I might have made
cc @scottietremendous @matthewpereira because the Concourse for PCF team is looking at building out metrics and may end up talking to you

@stale stale bot removed the wontfix label Aug 2, 2019
@njbennett

This comment has been minimized.

Copy link
Author

@njbennett njbennett commented Aug 2, 2019

Sounds good! @jama22 Today I'm going to take a crack at forking Concourse and building it locally.

If my adding internal metrics sounds reasonable to you, I may start with a PR that adds one or two event types representing actions involved in resource checking/scheduling. We periodically have problems with resources that just stop being updated, and we'd like to understand that behavior a bit better.

If I do take that route, would you like me to open a separate issue for generic "Add Honeycomb support" feature?

@Justin-W

This comment has been minimized.

Copy link

@Justin-W Justin-W commented Aug 16, 2019

I have very similar distributed tracing requirements to those described by @njbennett, except in our case we're looking to use DataDog (via either their native ddagent or via a generic OpenTracing-compatible implementation) instead of HoneyComb.

@Justin-W

This comment has been minimized.

Copy link

@Justin-W Justin-W commented Aug 16, 2019

Today I'm going to take a crack at forking Concourse and building it locally.

@njbennett : Did you make any progress on your fork? I suspect that the same sections of Concourse code would need to be touched, regardless of whether Honeycomb or DataDog is being used. So I might be able to piggyback off your work and/or collaborate if you have already made some progress.

@stale

This comment has been minimized.

Copy link

@stale stale bot commented Oct 15, 2019

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be exterminated closed, in accordance with our stale issue process.

@stale stale bot added the wontfix label Oct 15, 2019
@ringods

This comment has been minimized.

Copy link

@ringods ringods commented Oct 16, 2019

If an implementation of this will happen, please make sure to use OpenTelemetry and not OpenTracing. OpenTelemetry is the successor to and merger of OpenTracing and OpenCensus projects.

@stale

This comment has been minimized.

Copy link

@stale stale bot commented Dec 15, 2019

Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important.

If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward?

If no activity is observed within the next week, this issue will be exterminated closed, in accordance with our stale issue process.

@stale stale bot added the wontfix label Dec 15, 2019
@stale stale bot closed this Dec 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.