Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable timeout for resource checks #2352

Closed
topherbullock opened this issue Jul 6, 2018 · 5 comments
Closed

Configurable timeout for resource checks #2352

topherbullock opened this issue Jul 6, 2018 · 5 comments

Comments

@topherbullock
Copy link
Member

@topherbullock topherbullock commented Jul 6, 2018

The ATC should have a configurable global-resource-check-timeout which allows operators to enforce a global timeout for all resource checks. This would be propagated to the radar package's resource_scanner and resource_type_scanner as a context.WithTimeout that will terminate the scanning script in the container when the timeout is reached.

This should also be configurable at a per-resource level, and override the deployment level default
eg.

resources:
- name: booklit
  type: git
  check_timeout: 1min
  source:
    uri: https://github.com/vito/booklit
    branch: master

Gotchas / Wild thoughts:
OR the deployment level config could be for a default for now, and we can add a maximum a-la build log retention

Do timeouts have some effect on the check interval? Maybe a consistently timing out resource could
back off, or indicate its sad state where we currently show "checked 23s ago"

Should custom resource type checking timeouts include the time spent checking for the custom type? My thought is it shouldn't, but the contract to the user is important here;
as a pipeline developer, I don't want to be waiting >1minute for my check if I set check_timeout: 1min. There should be some reasonable propagation of the timeouts to the dependent resource type checks.

We should look at how *deep breath* worker_resource_config_check_sessions impact how the resource check containers get balanced across the cluster; currently there's a (hardcoded) configuration for resource check containers to expire and in that period (max 5 min) there will be one overlapping container for the same check. This could change from a constant to a per-resource or cluster defaulted setting, but it would impact GC's marking of containers to delete as well.

@topherbullock topherbullock added this to Icebox in Runtime via automation Jul 6, 2018
@topherbullock topherbullock moved this from Icebox to Backlog in Runtime Aug 8, 2018
@xtremerui xtremerui moved this from Backlog to In Flight in Runtime Aug 10, 2018
@xtremerui

This comment has been minimized.

Copy link
Contributor

@xtremerui xtremerui commented Aug 10, 2018

WIP Changes are pushed to branch WIP-resource-check-timeouts for both atc and testflight

@pivotal-jamie-klassen

This comment has been minimized.

Copy link
Contributor

@pivotal-jamie-klassen pivotal-jamie-klassen commented Aug 14, 2018

Resource check timeouts and resource type check timeouts should be separate stories, as they have pretty different implementation requirements:

  1. Testing - we already have fly check-resource but not fly check-resource-type. This means that our only option for testing the timeout on a resource type check involves setting a pipeline with a custom resource where check_every is set to some very small interval and waiting "an appropriate amount of time" before asserting on the results of this check -- this has the distinct stench of a flakey/needlessly-long-running test.
  2. Feedback - my concern here developed from thinking about testing but is still relevant. How do we surface to users that their resource type check timed out? Now that I think about it, I'm not sure how to discover resource type check errors in general.
  3. Configuration -- just as users have found separate check_every/*_CHECKING_INTERVAL for resources and resource types (per #2251), I'd have to ask whether they would find it useful to have separate check timeout configuration for these two scenario as well, i.e. global-resource-check-timeout and global-resource-type-check-timeout. This is more of a 'wild thought'.

@vito @topherbullock thoughts? For the time being we will proceed by narrowing the scope of this story and writing another one for the resource type check timeout case.

@pivotal-jamie-klassen pivotal-jamie-klassen changed the title Configurable timeout for resource and resource type checks Configurable timeout for resource checks Aug 14, 2018
@vito

This comment has been minimized.

Copy link
Member

@vito vito commented Aug 14, 2018

@pivotal-jamie-klassen Good points! To be honest we should really think about surfacing resource types in the web UI soon, and I think users will really like being able to do all the things they can with resources with resource types. Resource type developers for example would love to be able to force a check, and it's obviously helpful to see check errors there too.

I think while we're working on #2386 we should keep this in mind, and we should support doing the same things you can with resources as with resource types. Maybe with a generic "resource config" API/UI? Just spitballing, but this could also be useful for things like concourse/rfcs#7 which introduces yet another toplevel "resourcey" concept.

@pivotal-jamie-klassen

This comment has been minimized.

Copy link
Contributor

@pivotal-jamie-klassen pivotal-jamie-klassen commented Aug 14, 2018

We decided to create a default value of 1h for global-resource-check-timeout.

Why a default?

Sometimes resource checks run for a long time, for one of two reasons:

  1. On purpose - I have a custom resource that does a lot of valuable stuff that takes a long time
  2. By accident - something in my resource ecosystem is broken and things have huge lag or hang

While we might shock the people in column 1, we can suggest they increase their timeout and return to happiness. The greater good is served because we might be able to save the people in column 2 from their ignorance.

Why 1h?

Long-running resource checks seem pretty normal. Longer than 1 hour is unreasonable, because that is the max expiry period for a resource check container -- if the container is considered stale, the check should certainly be considered timed out.

pivotal-saman-alvi added a commit that referenced this issue Aug 15, 2018
#2352

Submodule src/github.com/concourse/atc 8d030443a..e904dc7d5:
  > Configurable timeout for resource checks
Submodule src/github.com/concourse/testflight 44b41dc37..a47221a84:
  > Configurable timeout for resource checks
  > fixtures and test for resource check timeouts

Signed-off-by: Saman Alvi <salvi@pivotal.io>
@pivotal-jamie-klassen pivotal-jamie-klassen moved this from In Flight to Done in Runtime Aug 15, 2018
@jama22 jama22 added the size/medium label Aug 27, 2018
@vito vito added this to the v4.1.0 milestone Sep 4, 2018
@vito

This comment has been minimized.

Copy link
Member

@vito vito commented Sep 4, 2018

@topherbullock While working on #2386, @clarafu and I found that supporting pipeline-configurable check timeouts is really awkward once resource histories and checking are global. We would either have to enforce the shortest configured timeout for everyone (which seems exploitable), or only enforce it if/when the pipeline that acquired the lock has a timeout configured (which seems ineffective).

I wonder if we can get away with having only the global setting? A global default makes a lot of sense for the operator to defend against stuck resources/resource types, but I can't think of any scenarios where I, as a pipeline author, would proactively enforce timeouts on my resources. I guess you might want to configure it after you see something going wrong, but it's not so easy to notice that in the first place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Runtime
Accepted
5 participants
You can’t perform that action at this time.