-
Notifications
You must be signed in to change notification settings - Fork 24
Global setting for per step timeout #170
Comments
The link you provided about timeout_in_seconds has no option about timeout now. I currently can define timeout via #36 through web interface. But how to put this timeout option in pipeline.yml? Is it the same question raised here? UpdatesThanks, @evilmarty I search again and found the document: https://buildkite.com/docs/pipelines/command-step I can add it in pipeline.yml now.
|
The docs have been updated and have removed the step declarations examples. It is in the master branch of your docs so maybe a regression? My question is how can I set a global timeout in the absence of one being set in the UI or in a YAML file? |
I'm curious about this as well. Is there a way to have a global timeout that doesn't involve the web interface? |
Anyone? Bueller? |
I'd very much like to see an agent-level default job timeout so that frozen jobs don't run forever. This is especially important because the scaling policy for https://github.com/buildkite/elastic-ci-stack-for-aws currently requires zero running jobs before scaling in. So a single frozen job can prevent scale-in and cost lots of money on a large stack. A configuration option on https://github.com/buildkite/agent would be great — however I did a bit of exploration in the hopes of opening a PR but it looks like the timeout is driven server-side so there's no good way to add the option on the agent without some backend changes. |
Trying to think how this could work as agent configuration when timeouts are backend-driven. It would be possible to implement an agent-side timeout. However I don't think there's an existing way for the agent to communicate that it was a timeout; it would look like a general command failure. And the agent timeout could race the server-side step timeout if they're similar. The agent API could be extended to allow agent-driven timeouts, but it would still be racy and inconsistent with per-step timeouts. I don't think this is a good idea. Instead, when an agent connects to the backend it could advertise the default timeout. Then it can be visible on the agent listing etc. When a job is allocated to an agent, it would use the per-step timeout if present, otherwise the agent default timeout. Enforcing the timeout (per-step or per-agent) remains backend driven. That doesn't seem like such a bad option. |
I think this is an important thing to fix! Will move discussion over to the PR. |
Just want to chime in to say this would be really useful - our elastic stack bill went through the roof because we didn't notice a few stuck jobs that prevented our stack from scaling down for ~3w. 😱 |
If a step in build hangs or takes an unusually long time, previously CI would let it continue, occupying machines forever. In lieu of a global timeout (buildkite/feedback#170, https://forum.buildkite.community/t/pipeline-timeouts/722), we can manually apply a timeout to every step, as a last resort to catch slow/hung builds. This uses the `timeout_in_minutes` (https://buildkite.com/docs/pipelines/command-step#command-step-attributes) optional attribute: > The number of minutes a job created from this step is allowed to run. If the job does not finish within this limit, it will be automatically canceled and the build will fail. Our steps currently range from ~30 seconds to ~10 minutes, so 30 minutes should be a safe "something serious is wrong" timeout. See: #905
Sorry for yet another +1 comment, but this would be really useful. |
+1, would be very useful |
+1 would be very useful |
fwiw this appears to now be available in the UI pipeline settings > builds; see Changelog notes Suggest closing this issue @evilmarty |
Per-step timeout is supported in build pipelines via timeout_in_seconds and via the interface but it would be great to set a default
timeout_in_minutes
either as an agent option or build setting. By default the value could be zero indicate an indefinite timeout.The reason for this is to avoid agents being stuck on jobs that are either exceptionally too long or stuck because of bugs. Making sure every step is configured with an automatic timeout is difficult to manage, especially with numerous projects that include pipeline definitions in source control.
The text was updated successfully, but these errors were encountered: