New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent resources defined across pipelines and teams should only correspond to a single version history #2386

Closed
vito opened this Issue Jul 16, 2018 · 27 comments

Comments

@vito
Copy link
Member

vito commented Jul 16, 2018

What challenge are you facing?

Today, pipeline resource versions are collected in to a versioned_resources table. This table was predated the "life" epic (#629). It contains the following schema:

atc=# \d versioned_resources
                               Table "public.versioned_resources"
   Column    |  Type   | Collation | Nullable |                     Default                     
-------------+---------+-----------+----------+-------------------------------------------------
 id          | integer |           | not null | nextval('versioned_resources_id_seq'::regclass)
 version     | text    |           | not null | 
 metadata    | text    |           | not null | 
 type        | text    |           | not null | 
 enabled     | boolean |           | not null | true
 resource_id | integer |           |          | 
 check_order | integer |           | not null | 0

It points to resource_id, making this per-pipeline-resource. This means that multiple pipelines with the same resource configs will be redundantly collecting the same version/metadata information.

A Modest Proposal

To be honest, this isn't a huge deal right now, aside from wasted database space and redundant checking. However if we make the relationship between a pipeline's resources and the abstract version history a bit tighter, there are actually a few benefits:

  • We can reduce the amount of checking required across pipelines for equivalent resources.
  • We can reduce the amount of data recorded for equivalent resources.
  • When a user changes their pipeline resource's configuration, the history will be "re-set" (ref. #145) to the new config, and should always be correct.
  • There may be some as-yet-unknown improvements we can make to the database model by having a cleaner representation.
  • As part of concourse/rfcs#1, we're going to start collecting all versions, not just starting from pipeline configuration time. There'll be a lot more data to record, so sharing it between duplicated resources will make things a lot more efficient.

Implementation Notes

Enabling/Disabling versions

Enabling/disabling versions should remain scoped to pipeline resources, obviously. This can be done via a join table (pipeline_resource_config_versions or some such).

Distinct check intervals

Now that we only check once per resource config, there's a little gotcha. Different pipelines can have varying check_every settings.

Here's one idea: record last_checked on the resource config, and have each pipeline's radar component just check if the last_checked is >= their interval. So, we'll check at the fastest defined frequency. Pipelines with longer check_intervals will have versions show up more quickly than expected, but that really shouldn't matter.

Pausing pipeline resources

Currently, users pause pipeline resources with the intended effect that no new versions are collected and used for later builds. This is really awkward when other pipelines result in checking the config anyway.

We could still support today's behavior by "faking it" and having pausing a resource really just 'pin' it to whatever the version was at the time. But actually, that sounds a lot like #1288. Maybe we should just implement that instead, and remove the resource pausing functionality?

@vito vito added this to Icebox in Core via automation Jul 16, 2018

@vito vito moved this from Icebox to Backlog in Core Jul 16, 2018

@krishicks

This comment has been minimized.

Copy link
Contributor

krishicks commented Jul 17, 2018

I really like this. Reducing the redundant check containers for a single resource would make me feel like a better netizen.

Pipelines with longer check_intervals will have versions show up more quickly than expected, but that really shouldn't matter.

Is the idea here that even though new versions would show up, the check_interval would be respected for triggering new builds? I can see someone using different check_intervals across pipelines to create varying pipeline behaviours, and that would be silently lost here.

One thing I was thinking of recently, and which I discussed with @xoebus in the past, is separating resource config from pipeline config. If multiple pipelines share the same resource config, why do I have to copy the configuration? This can become a chore when managing the same resource across many pipelines.

It would be great if I could just configure resources kind of like a cloud_config and then just refer to them in jobs in pipelines by name. Then a check_every would only be specified once for a particular configuration, and there'd be no surprise about which is the canonical resource definition.

@vito vito referenced this issue Jul 18, 2018

Open

RFC: Resources v2 #1

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 18, 2018

Is the idea here that even though new versions would show up, the check_interval would be respected for triggering new builds? I can see someone using different check_intervals across pipelines to create varying pipeline behaviours, and that would be silently lost here.

I really hope people aren't relying on that. check_every isn't meant to have any impact on builds triggering - the cause and effect there are decoupled. It's only really meant for setting a less frequent polling interval if the server it's pointed to can't take the load.

If multiple pipelines share the same resource config, why do I have to copy the configuration? This can become a chore when managing the same resource across many pipelines.

Interesting but sounds like a separate issue. I don't think we're ready to tackle that as part of this - it kind of goes against the current approach which is to have pipelines be entirely self-contained. (It's a trade-off.)

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 23, 2018

Hmm, one scary thing about this approach is that check containers can be hijacked. If we're not careful that could allow a malicious user to hijack "their" check container and influence it in such a way that the version detecting is affected, thereby affecting all teams.

Similarly, who should own the check container? It'd be nice if we could just have one check container per resource config, but then which team does it belong to? That would really limit who would be able to hijack it and debug it, if it's failing.

Here's one approach: we could make the containers global, and then take the container out of commission once it's hijacked. This would still allow anyone to debug a failing check and keep them from influencing its behavior. We should be careful to not allow hijacking of a container with an in-flight check call, too. Locking may be all we need there.

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 23, 2018

Oof. This is actually really difficult with credential rotation (especially with automated rotation via credential managers).

If someone rotates their password to Docker Hub or their private key for their Git repo, that shouldn't reset the entire history of the resource. But with versions tied directly to a resource config, it'll be a different resource config any time the credentials change.

At the same time, we need the resource config to be fully resolved, because there should be some sort of authentication check before we share all the versions and especially the cached bits (which are also identified in part by resource config).

Should we separate credentials from config? What would that look like? Would it require a change to the resource interface? I'm thinking a simple 'check access' call - maybe this could also be for validation prior to use?

Back to the drawing board...

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 23, 2018

Yet another concern: time resources configured with intervals like 1m, 5m, etc. will all find new versions at the same time when they're all tied to the same history. That would basically mean stampedes of periodic builds across the workers. (The same is technically true for any resource that many pipelines point to, but time resource usage would be especially severe by nature.)

Maybe sharing the version history globally just isn't worth it. :/ The initial concern was regarding database storage now that checks will be returning the entire history, but to be honest the disk usage will probably be nothing compared to the build events table.

I still think there are good things that can come out of this, but perhaps they can be prioritized differently:

  • Splitting out auth config from source config will be hugely beneficial to anyone with frequent automated credential rotation. It would enable cache reuse throughout, and this could also determine how we implement config validation in resources (something we initially wanted in concourse/rfcs#1 but didn't find the motivation to include as the scope was already large).
  • Tying pipeline resource history to its resource config makes sense for achieving #145. (Purging wouldn't be necessary, as the history would always be 'correct'.) This would also benefit from splitting auth config out, so that the version history doesn't reset every time the credentials change.
@krishicks

This comment has been minimized.

Copy link
Contributor

krishicks commented Jul 23, 2018

What about having a single "real" check container that does the checking on the minimum configured interval. Then there are "fake" check containers that have their actual check_interval configured, and which are the containers you can hijack. They ask the real check container (which itself speaks "check") for new versions. Think of the "real" one as being Concourse's internal bookkeeping.

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 23, 2018

@krishicks Yeah, that'd help make hijacking safer. (Though I'm not sure how the 'fake' container would delegate to the 'real' one).

We could also do something like provision a fresh 'check' container when they try to hijack it. We could even run a one-off check call or write the JSON payload that we would be calling check with to a temporary file, so the user can manually run the check themselves. At that point it'd be less of "hijack" and more of "provision me a debugging container".

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 23, 2018

To balance out some of the pessimism in my recent comments, there's one big benefit to consolidating version history: this could dramatically reduce the load on external services caused by constant checks. It would be amazing if we could reduce things down to one check interval at a time per Concourse installation for e.g. a git repo or custom resource type. That'd make GitHub Enterprise and private Docker registry operators much happier.

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 24, 2018

One way to mitigate the concerns with the time resource could be to just have "shared history" be an optional feature, as determined by the resource type (not the user).

Looking at the list of supported resource types, we'd benefit from shared history for the following:

  • git
  • hg
  • s3
  • docker-image
  • github-release
  • bosh-io-stemcell
  • bosh-io-release
  • semver (less so)
  • pool (less so)

The following resource types however wouldn't really benefit much from it at all, for different reasons:

  • time
  • cf
  • tracker (which is still there apparently)
  • bosh-deployment

So, maybe it could be an opt-in behavior via the info call we'll be adding in concourse/rfcs#1, and default to false?

All other behavior would be the same (i.e. however we handle hijack), they just wouldn't share their history.

@vito

This comment has been minimized.

Copy link
Member Author

vito commented Jul 27, 2018

Having checking performed globally across teams would likely make it easier to fix cases like #2010 - right now we have to make the RCCS, which is global, and then the WRCCS, which is per-team, but they're in a different transaction so there's a chance the RCCS goes away while we're trying to create the WRCCS. (The fairly long grace period before the RCCS actually goes away is supposed to make this unlikely, but there must be something else going on.)

@snegivulcan

This comment has been minimized.

Copy link

snegivulcan commented Aug 1, 2018

We are also very much interested in effectively reducing the number of times resource check ( docker_image in our case ) is performed. In our case we are running out of GCR (Google Cloud Repository ) new version check API quota, which is 3 per second per IP. We have a concourse cluster with 3 worker nodes, with a total of 24 pipelines. Each pipeline polls for 6 docker_image resources on avg and we have about 2 dozens docker_image resources in total.

@mhuangpivotal mhuangpivotal moved this from Backlog to In Flight in Core Aug 2, 2018

@clarafu

This comment has been minimized.

Copy link
Contributor

clarafu commented Aug 13, 2018

Brief notes on some initial changes we need to make during the spike.

Migrations

  • new resource-config-versions (resource-config-id, version, metadata)
  • new join table pipeline-disabled-resource-config-versions (pipeline-id, resource-config-version-id)(not true anymore)
  • edit resource-configs (add last-checked)

(later)

  • edit resources (remove last-checked, paused later)
  • edit resource-types (remove last-checked, version, type?)

Radar

  • check for last checked on resource configs (instead of resources)
  • if last checked is less than the check every interval, then we check

note.: two resources that reference the same resource_config might have different check_every intervals - we need to either pick the lower or the bigger

@clarafu

This comment has been minimized.

Copy link
Contributor

clarafu commented Aug 13, 2018

Side refactor:

During the spike, we realized that the "UsedResourceConfig" object needs to have a connection and a lockFactory on it for us to do locking around the resource config. This made us revisit the whole structure around creating resource configs and the purpose behind these used resource configs and we have decided to do a refactor to make more clean abstractions between these old "ResourceConfigs" and "UsedResourceConfigs". "ResourceConfigs" objects are being used to construct a resource config in the database, which means that it really isn't a resource config itself but more of fields we need in order to construct one. This will be renamed to "ResourceConfigDescriptor". "UsedResourceConfigs" are used for when there is a "use" for a resource config in the database but it only really made sense when we still had the resource config uses table (which was removed). This will be renamed to "ResourceConfig".

clarafu added a commit to concourse/atc that referenced this issue Aug 14, 2018

use interfaces for resource config/cache objects
concourse/concourse#2386

UsedResourceConfig is renamed to ResourceConfig and
ResourceConfig/ResourceCache is renamed to
ResourceConfig/CacheDescriptor. For more info read the comment on the
story.

clarafu added a commit that referenced this issue Aug 14, 2018

bump atc
#2386

Submodule src/github.com/concourse/atc 5a69d264c..c85f590a3:
  > use interfaces for resource config/cache objects
@vito

This comment has been minimized.

Copy link
Member Author

vito commented Oct 3, 2018

@marco-m Hmm I don't think this will actually help that case at all. Credential resolution is done prior to even determining whether to run the check or not, since we have to resolve the credentials to determine the resource config to check in the first place.

@marco-m

This comment has been minimized.

Copy link

marco-m commented Oct 4, 2018

@vito so I was too optimistic! ;-) Time to go re-understand all this.

@vito vito modified the milestones: Spatial Resources, Resources v2, Resource Pinning Jan 7, 2019

@vito vito added this to Planned in Spatial Resources via automation Jan 8, 2019

@vito vito added this to Planned in Resources v2 via automation Jan 8, 2019

@vito vito moved this from Planned to Done in Spatial Resources Jan 8, 2019

@vito vito moved this from Planned to Done in Resources v2 Jan 8, 2019

@vito vito added this to the v5.0.0 milestone Jan 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment