Skip to content

v6.0.0

Compare
Choose a tag to compare
@vito vito released this 25 Mar 18:33
559620b

Concourse v6.0: it does things gooder.™

A whole new algorithm for deciding job inputs has been implemented which performs much better for Concourse instances that have a ton of version and build history. This algorithm works in a fundamentally different way, and in some situations will decide on inputs that differ from the previous algorithm. (More details follow in the actual release notes below.)

In the past we've done major version bumps as a ceremony to celebrate big shiny new features. This is the first time it's been done because there are backwards-incompatible changes to fundamental pipeline semantics.

We have tested this release against larger scale than we ever tried to support before, and we've been using it in our own environments for a while now. Despite all that, we still recommend that anyone using Concourse for mission-critical workflows (e.g. shipping security updates) wait for the next few releases, just in case any edge cases are found.

IMPORTANT: Please expect and prepare for some downtime when upgrading to v6.0. On our large scale deployments we have observed 10-20 minutes of downtime as the database is migrated, but this will obviously vary depending on the amount of data.

As this is a significant release with changes that may affect user workflows, we will be taking some time after this release to listen for feedback before moving on to the next big thing.

Please leave any v6.0 feedback you have, good or bad, in issue #5360!

🔗 feature, fix, breaking

  • A new algorithm for determining inputs for jobs has been implemented.

    This new algorithm significantly reduces resource utilization on the web and db nodes, especially for long-lived and/or large-scale Concourse installations.

    The old algorithm used to load up all the resource versions, build inputs, and build outputs into memory then use brute-force to figure out what the next inputs would be. This method worked well enough in most cases, but with a long-lived deployment with thousands or even millions of versions or builds it would start to put a lot of strain on the web and db nodes just to load up the data set. In the future, we plan to collect all versions of resources, which would make this even worse.

    The new algorithm takes a very different approach which does not require the entire dataset to be held in memory and cuts out nearly all of the "brute force" aspect of the old algorithm. We even make use of fancy jsonb index functionality in Postgres; a successful build's set of resource versions are stored in a table which we can easily "intersect" in order to find matching candidates when evaluating passed constraints.

    For a more detailed explanation of how the new algorithm works, check out the section on this in the v10 blog post.

    Before we show the shiny charts showing the improved performance, let's cover the breaking change that the new algorithm needed:

  • Breaking change: for inputs with passed constraints, the algorithm now chooses versions based on the build history of each job in the passed constraint, rather than version history of the input's resource.

    This might make more sense with an example. Let's say we have a pipeline with a resource (Resource) that is used as an input to two jobs (Job 1 and Job 2):

    Difference in behavior between old and new algorithm

    Resource has three versions: v1 (oldest), v2, and v3 (newest).

    Job 1 has Resource as an unconstrained input, so it will always grab the latest version available - v3. In the scenario above, it has done this for Build 1 but then a pipeline operator pinned v1, so Build 2 then ran with v1. So now we have both v1 and v3 having "passed" Job 1, but in
    reverse order.

    The difference between the old algorithm and the new one is which version Job 2 will use for its next build when v1 is un-pinned.

    With the old algorithm, Job 2 would choose v3 as the input version as shown by the orange line. This is because the old algorithm would start from the latest version and then check if that version satisfies the passed constraints.

    With the new algorithm, Job 2 will instead end up with v1, as shown by the green line. This is because the new algorithm starts with the versions from the latest build of the jobs listed in passed constraints, searching through older builds if necessary.

    The resulting behavior is that pipelines now propagate versions downstream from job to job rather than working backwards from versions.

    This approach to selecting versions is much more efficient because it cuts out the "brute force" aspect: by treating the passed jobs as the source of versions, we inherently only attempt versions which already satisfy the constraints and passed through the same build together.

    The remaining challenge then is to find versions which satisfy all of the passed constraints, which the new algorithm does with a simple query utilizing a jsonb index to perform a sort of 'set intersection' at the database level. It's pretty neato!

  • Improved metrics: now that the breaking change is out of the way, let's take a look at the metrics from our large-scale test environment and see if the whole thing was worth it from an efficiency standpoint.

    The first metric shows the database CPU utilization:

    Database CPU Usage

    The left side shows that the CPU was completely pegged at 100% before the upgrade. This resulted in a slow web UI, slow pipeline scheduling performance, and complaints from our Concourse tenants.

    The right side shows that after upgrading to v6.0 the usage dropped to ~65%. This is still pretty high, but keep in mind that we intentionally gave this environment a pretty weak database machine so we don't just keep scaling up and pretending our users have unlimited funds for beefy hardware. Anything less than 100% usage here is a win.

    This next metric is shows database data transfer:

    Database Data Transfer

    This shows that after upgrading to 6.0 we do a lot less data transfer from the database, because we no longer have to load the full algorithm dataset into memory.

    Not having to load the versions DB is also reflected in the amount of time it took just do do it as part of scheduling:

    Load VersionsDB machine hours

    This graph shows that at the time just before the upgrade, the web node was spending 1 hour and 11 minutes of time per half-hour just loading the dataset. This entire step is gone, as reflected by the graph ending upon the upgrade to 6.0.

    Another optimization we've made is to simply not do work that doesn't need to be done. Previously Concourse would schedule every job every 10 seconds, but now it'll only schedule jobs which actually need to be scheduled.

    This heat-map, combined with the graph below, shows the before vs. after distribution of job scheduling time, and shows that after 6.0 there is much less time being spent scheduling jobs, freeing the web node to do other more important things:

    Scheduling by Job

    Note that while the per-job scheduling time has increased in duration and variance, as shown in the heat map, this is not really a fair comparison: the data for the old algorithm doesn't include all the time spent loading up the dataset, which is done once per pipeline every 10 seconds.

    The fact that the new algorithm is able to schedule some jobs in the same amount of time that the older, CPU-bound, in-memory, brute-force algorithm took is actually pretty nice considering it now involves going over the network to querying the database.

    Note: this new approach means that we need to carefully keep tabs on state changes so that we know when jobs need to be scheduled. If you have a job doesn't seem to be queuering new builds, and you think it should, try out the new fly schedule-job command; it's been added just in case we missed a spot. This command will mark the job as 'needs to be scheduled' and the scheduler will pick it up on the next tick.

  • Migrating existing data: you may be wondering how the upgrade's data migration works with such a huge change to how the scheduler uses data.

    The answer is: very carefully.

    If we were to do an in-place migration of all of the data to the new format used by the algorithm, the upgrade would take forever. To give you an idea of how long, even just adding a column to the builds table in our environment took about 16 minutes. Now imagine that multiplied by all of the inputs and outputs for each build.

    So instead of doing it all at once in a migration on startup, the algorithm will lazily migrate data for builds as it needs to. Overall, this should result in very little work to do as most jobs will have a satisfiable set of inputs without having to go too far back in the history of upstream jobs.

  • Bonus feature: along with the new algorithm, we wanted to make it easier to troubleshoot why a build is stuck in a "pending" state. So, if the algorithm fails to find a satisfactory set of inputs, the reason will now be shown for each input in the build preparation.

  • Bonus fix: the new algorithm fixes an edge case described in #3832. In this case, multiple resources with corresponding versions (e.g. a v1.2.3 semver resource and then a binary in S3 corresponding to that version) are correlated by virtue of being passed along the pipeline together.

    When one of the correlated versions was disabled, the old algorithm would incorrectly continue to use the other versions, matching it with an incorrect version for the resource whose version was disabled. Bad news bears!

    Because the new algorithm always works by selecting entire sets of versions at a time, they will always be correlated, and this problem goes away. Good news...uh, goats!

🔗 feature, breaking

  • LIDAR is now on by default! In fact, not only is it on by default, it is now the only option. The old and busted 'Radar' resource checking component has been removed and the --enable-lidar flag will no longer be recognized. #3704

    With the switch to LIDAR, the metrics pertaining to resource checking have also changed (via #5171). Please consult the now-updated Metrics documentation and update your dashboards accordingly!

🔗 breaking

  • We have removed support for emitting metrics to Riemann.

    In the early days, Riemann gave us hope of only having to support a single metrics sink. That world didn't really pan out.

    We're now trying to standardize on OpenTelemetry, and pull-style metrics (i.e. Prometheus) rather than push, which means we'll be slowly transitioning away from our current support of many-different-metrics-sinks as it is a bit of a maintenance nightmare.

🔗 fix, security

  • Fix an edge case of CVE-2018-15798 where redirect URI during login flow could be embedded with a malicious host.

🔗 feature

  • #413. We finally did it.

    rerun-button

    Build re-running has been implemented.

    Build re-running was one of the most long-running and popular feature requests. We never got around to it because most of the motivation for it came from PR flows, which was (and still is) a pretty broken usage pattern which we were trying to address directly rather than have it act as a primary motivator for design decisions like this one.

    Ultimately, the validity of re-triggering builds was never in question - but we try to look a level deeper and get to the bottom of things, sometimes to a fault. Ideas surrounding PR flows dragged on for a while and went through a few redesigns (context in the v10 blog post), so we're sorry it took so long to finally do this - but it's here now!

    Why was this feature implemented in this release after so much time, you ask? Well, the new scheduling and pinning semantics kind of made it necessary to have proper support for re-running builds. Without re-running, folks were relying on version pinning in order to re-run a job with an older version. But with the new scheduling semantics, that pinned version would propagate to jobs downstream as if it's the latest version for the pipeline to converge on, which might not be what you want. (But it probably is what you want if you're using pinning how it's originally meant to be used.)

🔗 feature

  • Following a multi-pronged attack through various optimizations, the dashboard has become more responsive:

    • With #4862, we optimized the ListAllJobs endpoint so that it no longer requires decrypting and parsing the configuration for every single job. This dramatically reduces resource utilization for deployments with a ton of pipelines.

    • With #5262, we now cache the last-fetched data in local browser storage so that navigating to the dashboard renders at least some useful data rather than blocking on all the data being fetched fresh from the backend.

    • With #5118, we implemented infinite scrolling and lazy rendering, which should greatly improve performance on installations with a ton of pipelines configured. The initial page load can still be quite laggy, but interacting with the page afterwards now performs a lot better. We'll keep chipping away at this problem and may have larger changes in store for the future.

    • With #5023, the dashboard will no longer "pile on" requests to a slow backend. Previously if the web node was under too much load, it could take longer to respond to the ListAllJobs endpoint than the default polling interval, and the dashboard could start another request before the last one finished. It will now wait for the previous request to complete before making another.

    Overall, while we're still not completely happy with the dashboard performance on gigantic installations, these changes should make the experience feel a bit better.

🔗 feature

  • @evanchaoli introduced another new step type in #4973: the load_var step! This step can be used to load a value from a file at runtime and set it in a "local var source" so that later steps in the build may pass the value to fields like params.

    With this primitive, resource type authors will no longer have to implement two ways to parameterize themselves (i.e. tag and tag_file). Resource types can now implement simpler interfaces which expect values to be set directly, and Concourse can handle the busywork of reading the value from a file.

    This feature, like set_pipeline step, is considered experimental until its corresponding RFC, RFC #27 is resolved. The step will helpfully remind you of this fact by printing a warning on every single use.

🔗 feature

  • In #4614, @julia-pu implemented a way for put steps to automatically determine the artifacts they need, by configuring inputs: detect. With detect, the step will walk over its params and look for paths that correspond to artifact names in the build plan (e.g. tag: foo/bar or just repository: foo). When it comes time to run, only those named artifacts will be given to the step, which can avoid wasting a lot of time transferring artifacts the step doesn't even need.

    This feature may become the default in the future if it turns out to be useful and safe enough in practice. Try it out!

🔗 fix

  • In #5149, @evanchaoli implemented an optimization which should lower the resource checking load on some instances: instead of checking all resources, only resources which are actually used as inputs will be checked.

🔗 fix

  • We fixed a bug where users that have upgraded from Concourse v5.6.0 to v5.8.0 with lidar enabled, they might experience a resource never being able to check because it is failing to create a check step. #5014

🔗 fix

  • Builds could get stuck in pending state for jobs that are set to run serially. If a build is scheduled but not yet started and the ATC restarts, the next time the build is picked up it will get stuck in pending state. This is because the ATC sees the job is set to run in serial and there is already a build being scheduled, so it will not continue to start that scheduled build. This bug is now fixed with the new release, where builds will never be stuck in a scheduled state because of it's serial configuration. #4065

🔗 fix

  • If you had lidar enabled, there is the possibility of some duplicate work being done in order to create checks for custom resource types. This happens when there are multiple resources that use the same custom resource type, they will all try to create a check for that custom type. In the end, there will only be one check that happens but the work involved with creating the check is duplicated. This bug was fixed so that there will be only one attempt to create a check for a custom resource type even if there are multiple resources that use it. #5158

🔗 fix

  • The length of time to keep around the history of a resource check was defaulted to 6 hours, but we discovered that this default might cause slowness for large deployments because of the number of checks that are kept around. The default is changed to 1 minute, and it is left up to the user to configure it higher if they would like to keep around the history of checks for longer. #5157

🔗 feature

  • We have started adding a --team flag to Fly commands so that you can run them against different teams that you're authorized to perform actions against, without having to log in to the team with a separate Fly target. (#4406)

    So far, the flag has been added to intercept, trigger-job, pause-job, unpause-job, and jobs. In the future we will likely either continue with this change or start to re-think the overall Fly flow to see if there's a better alternative.

🔗 fix

  • Previously, the build tracker would unconditionally fire off a goroutine for each in-flight build (which then locks and short-circuits if the build is already tracked). We changed it so that the build tracker will only do so if we don't have a goroutine for it already. #5075

🔗 fix

  • We fixed a bug for job that have any type of serial groups set (serial: true, serial_groups or max_in_flight). Whenever a build for that job would be scheduled and we check for if the job has hit max in flight, it would unnecessarily recreate all the serial groups in the database. #2724

🔗 fix

  • The scheduler will separate the scheduling of rerun and regular builds (builds created by the scheduler and manually triggered builds) so that in situations where a rerun build is failing to schedule, maybe the input versions are not found, it will not block the scheduler from scheduling regular builds. #5039

🔗 feature

  • You can now easily enable or disable a resource version from the comfort of your command line using the new fly commands fly enable-resource-version and fly disable-resource-version, thanks to @stigtermichiel! #4876

🔗 fix

  • We fixed a bug where the existence of missing volumes that had child volumes referencing it was causing garbage collecting all missing volumes to fail. Missing volumes are any volumes that exists in the database but not on the worker. #5038

🔗 fix

  • The ResourceTypeCheckingInterval is not longer respected because of the removal of radar in this release with lidar becoming the default resource checker. Thanks to @evanchaoli for removed the unused flag --resource-type-checking-interval! #5100

🔗 fix

  • The link for the helm chart in the concourse github repo README was fixed thanks to @danielhelfand! #4986

🔗 feature

  • Include job label in build duration metrics exported to Prometheus. #4976

🔗 fix

  • The database will now use a version hash to look up resource caches in order to speed up any queries that reference resource caches. This will help speed up the resource caches garbage collection. #5093

🔗 fix

  • If you have lidar enabled, we fixed a bug where pinning an old version of a mock resource would cause it to become the latest version. #5127

🔗 fix

  • Explicitly whitelisted all traffic for concourse containers in order to allow outbound connections for containers on Windows. Thanks to @aemengo! #5159

🔗 feature

  • Add experimental support for exposing traces to Jaeger or Stackdriver.

    With this feature enabled (via --tracing-(jaeger|stackdriver)-* variables in concourse web), the web node starts recording traces that represent the various steps that a build goes through, sending them to the configured trace collector. #5043

    As this feature is being built using OpenTelemetry, expect to have support for other systems soon.

🔗 feature

  • @joshzarrabi added the --all flag to the fly pause-pipeline and fly unpause-pipeline commands. This allows users to pause or unpause every pipeline on a team at the same time. #4092

🔗 fix

  • In the case that a user has multiple roles on a team, the pills on the team headers on the dashboard now accurately reflect the logged-in user's most-privileged role on each team. #5133

🔗 fix

  • Set a default value of 4h for rebalance-interval. Previously, this value was unset. With the new default, the workers will reconnect to a randomly selected TSA (SSH Gateway) every 4h.

🔗 fix

  • With #5015, worker state metrics will be emitted even for states with 0 workers, rather than not emitting the metric at all. This should make it easier to confirm that there are in fact 0 stalled workers as opposed to not having any knowledge of it.

🔗 fix

  • Bump golang.org/x/crypto module from v0.0.0-20191119213627-4f8c1d86b1ba to v0.0.0-20200220183623-bac4c82f6975 to address vulnerability in ssh package.

🔗 feature

  • Improve the initial page load time by lazy-loading Javascript that isn't necessary for the first render. #5148

🔗 feature

  • @aledeganopix4d added a last updated column to the output of fly pipelines showing
    the last date where the pipeline was set or reset. #5113

🔗 fix

  • Ensure the build page doesn't get reloaded when you highlight a log line, and fix auto-scrolling to a highlighted log line. #5275

🔗 fix

  • With #4168, fly sync no longer requires a target to be registered beforehand; instead, a --concourse-url (or -c) flag may be specified.

    This should make it a bit easier to keep your CLI in sync if and when we change the login process again.

🔗 fix

  • fly validate-pipeline will no longer blow up when given a pipeline config which uses var_sources.

🔗 fix

  • We've tweaked the UI on the resource page; when a version is pinned, rather than cramming the pinned version into the header, the "pin bar" for the version will now replace the "checking successfully" bar, since pinning ultimately prevents a resource from checking.