Skip to content
This repository has been archived by the owner on Jun 12, 2023. It is now read-only.

Global setting for per step timeout #170

Open
evilmarty opened this issue Jul 22, 2016 · 12 comments
Open

Global setting for per step timeout #170

evilmarty opened this issue Jul 22, 2016 · 12 comments

Comments

@evilmarty
Copy link

Per-step timeout is supported in build pipelines via timeout_in_seconds and via the interface but it would be great to set a default timeout_in_minutes either as an agent option or build setting. By default the value could be zero indicate an indefinite timeout.

The reason for this is to avoid agents being stuck on jobs that are either exceptionally too long or stuck because of bugs. Making sure every step is configured with an automatic timeout is difficult to manage, especially with numerous projects that include pipeline definitions in source control.

@ozbillwang
Copy link

ozbillwang commented May 3, 2017

@evilmarty

The link you provided about timeout_in_seconds has no option about timeout now.

I currently can define timeout via #36 through web interface.

But how to put this timeout option in pipeline.yml?

Is it the same question raised here?

Updates

Thanks, @evilmarty

I search again and found the document: https://buildkite.com/docs/pipelines/command-step

I can add it in pipeline.yml now.

timeout_in_minutes: 60

@evilmarty
Copy link
Author

The docs have been updated and have removed the step declarations examples. It is in the master branch of your docs so maybe a regression?

My question is how can I set a global timeout in the absence of one being set in the UI or in a YAML file?

@avtar
Copy link

avtar commented Feb 12, 2018

I'm curious about this as well. Is there a way to have a global timeout that doesn't involve the web interface?

@avtar
Copy link

avtar commented Mar 7, 2018

Anyone? Bueller?

@pda
Copy link
Member

pda commented Jun 27, 2018

I'd very much like to see an agent-level default job timeout so that frozen jobs don't run forever.

This is especially important because the scaling policy for https://github.com/buildkite/elastic-ci-stack-for-aws currently requires zero running jobs before scaling in. So a single frozen job can prevent scale-in and cost lots of money on a large stack.

A configuration option on https://github.com/buildkite/agent would be great — however I did a bit of exploration in the hopes of opening a PR but it looks like the timeout is driven server-side so there's no good way to add the option on the agent without some backend changes.

@pda
Copy link
Member

pda commented Jun 27, 2018

Trying to think how this could work as agent configuration when timeouts are backend-driven.

It would be possible to implement an agent-side timeout. However I don't think there's an existing way for the agent to communicate that it was a timeout; it would look like a general command failure. And the agent timeout could race the server-side step timeout if they're similar. The agent API could be extended to allow agent-driven timeouts, but it would still be racy and inconsistent with per-step timeouts. I don't think this is a good idea.

Instead, when an agent connects to the backend it could advertise the default timeout. Then it can be visible on the agent listing etc. When a job is allocated to an agent, it would use the per-step timeout if present, otherwise the agent default timeout. Enforcing the timeout (per-step or per-agent) remains backend driven. That doesn't seem like such a bad option.

@keithpitt
Copy link
Member

I think this is an important thing to fix! Will move discussion over to the PR.

@BRMatt
Copy link

BRMatt commented Jun 12, 2019

Just want to chime in to say this would be really useful - our elastic stack bill went through the roof because we didn't notice a few stuck jobs that prevented our stack from scaling down for ~3w. 😱

huonw added a commit to stellargraph/stellargraph that referenced this issue Feb 21, 2020
If a step in build hangs or takes an unusually long time, previously CI would
let it continue, occupying machines forever. In lieu of a global timeout
(buildkite/feedback#170,
https://forum.buildkite.community/t/pipeline-timeouts/722), we can manually
apply a timeout to every step, as a last resort to catch slow/hung builds. This
uses the `timeout_in_minutes`
(https://buildkite.com/docs/pipelines/command-step#command-step-attributes)
optional attribute:

> The number of minutes a job created from this step is allowed to run. If the job does not finish within this limit, it will be automatically canceled and the build will fail. 

Our steps currently range from ~30 seconds to ~10 minutes, so 30 minutes should
be a safe "something serious is wrong" timeout.

See: #905
@goodspark
Copy link

Sorry for yet another +1 comment, but this would be really useful.

@heidimhurst
Copy link

+1, would be very useful

@samsarkleio
Copy link

+1 would be very useful

@heidimhurst
Copy link

heidimhurst commented Jul 26, 2022

fwiw this appears to now be available in the UI pipeline settings > builds; see Changelog notes

image

Suggest closing this issue @evilmarty

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants