New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit build events as traces in order to visualize the behavior of complex jobs and pipelines #3958
Comments
I'm heartily in favour of having a standard, general stream of events with embedded sequencing or context. Right now I achieve some of this end by scraping endpoints and inserting data into a schema of my own, but it would be much easier to consume if it arrived in a predictable, uniform fashion. |
Simpler but related: #3905 |
Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important. If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward? If no activity is observed within the next week, this issue will be |
very interesting @njbennett. Thanks for taking the time to frame the specific problems you're having and outlining the world of data you would like to see. I think the datapoints that you mentioned would be very useful for Concourse users to have; I know as a PM I've had issues getting quick access to that data as well. This definitely sounds like data that should be emitted by Concourse itself, so I think you're on the right track with the idea of emitting events. If you're willing to do the work, PRs are welcome. In the past we've had similar PRs come in around metrics emitters that are purely community driven and community maintained e.g. Prometheus and Datadog. I'll only note that there seems to be some changes coming down the line with The Algorithm that may make life easier for you to calculate this data (mentioned in #3905) and we have an aspirational issue around consolidating metrics emitter #2896 that's in the icebox I also have no idea how popular Honeycomb.io is so I'll take it at your word that it may be a niche emitter option. cc @vito to fact check any crazy claims I might have made |
Sounds good! @jama22 Today I'm going to take a crack at forking Concourse and building it locally. If my adding internal metrics sounds reasonable to you, I may start with a PR that adds one or two event types representing actions involved in resource checking/scheduling. We periodically have problems with resources that just stop being updated, and we'd like to understand that behavior a bit better. If I do take that route, would you like me to open a separate issue for generic "Add Honeycomb support" feature? |
I have very similar distributed tracing requirements to those described by @njbennett, except in our case we're looking to use DataDog (via either their native ddagent or via a generic OpenTracing-compatible implementation) instead of HoneyComb. |
@njbennett : Did you make any progress on your fork? I suspect that the same sections of Concourse code would need to be touched, regardless of whether Honeycomb or DataDog is being used. So I might be able to piggyback off your work and/or collaborate if you have already made some progress. |
Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important. If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward? If no activity is observed within the next week, this issue will be |
If an implementation of this will happen, please make sure to use OpenTelemetry and not OpenTracing. OpenTelemetry is the successor to and merger of OpenTracing and OpenCensus projects. |
Beep boop! This issue has been idle for long enough that it's time to check in and see if it's still important. If it is, what is blocking it? Would anyone be interested in submitting a PR or continuing the discussion to help move things forward? If no activity is observed within the next week, this issue will be |
What challenge are you facing?
The Cloud Foundry, PAS, and PKS release engineering teams have complex pipelines, and it's hard to understand their behavior holistically. Many jobs within the pipelines are "flaky" but it's difficult to prioritize work to reduce intermittent failures without knowing which intermittent failures are the most common.
It's also difficult for us to understand the effects of changes to our pipelines. Concourse is our implementation layer, but we're using it to model high level concepts. We care about things like a "run" of a pipeline, or the status of a release train that might represent several partial pipeline runs.
We would like to be able to answer questions like:
Answering these questions would help us prioritize work to reduce intermittent failures, identify and report bugs to the Concourse team, and measure the impact of changes to our pipelines on their performance.
We can do some of this by exploring the GUI or scraping data from the Concourse API. The former is time-consuming, the latter requires building relatively fragile code that will break whenever that API changes. And, as I understand the API's authentication model, it requires a credential representing a user to access the most interesting data, and is difficult to poll with a machine.
We've been able to answer questions like this with fairly high precision about our acceptance tests for some time now, which has made it easier to keep intermittent failures from that source under control. However, many of the failures we're interested in occur in other steps of our release engineering process, especially deploy and teardown.
What would make this better?
I'd like to be able to generate Honeycomb build events from Concourse tasks, and chain them together into traces.
These events would ideally contain:
This is currently possible with Travis-CI, CircleCI, and Jenkins, but this implementation depends on information that Concourse (I believe, might be wrong) does not currently make available to tasks. https://github.com/honeycombio/buildevents
We could potentially hack something together that would enable this via some kind of resource, but that requires exposing information to tasks that tasks probably shouldn't have, or information to resources about other resources that they probably shouldn't have.
The "best" solution I've been able to think of is for Concourse to generate these events itself, using the Honeycomb beeline for golang, but this increases the complexity of Concourse itself for an optional feature that may not be useful for many users.
It also potentially requires ongoing comprehension and maintenance of this event emission system by the Concourse team. These events might be just as useful if not more to the Concourse team itself for debugging and understanding Concourse, and might lead to lower-level event emission about internal Concourse behavior that could make it much easier to pin down odd distributed systems behavior, but it is a cost, and makes me wary of attempting of attempting to PR this behavior in.
Are you interested in implementing this yourself?
Very! However, there are a lot of options, and I need guidance from the Concourse team about which is the most sensible, or if there are possibilities that I'm missing.
The text was updated successfully, but these errors were encountered: