Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike : Can we determine when a build step fails because the worker is unusable #2581

topherbullock opened this Issue Sep 12, 2018 · 6 comments


6 participants
Copy link

topherbullock commented Sep 12, 2018

For #755, we'll need to know whether a build step has failed because of user error (a misconfigured step, resource config, or task config) or because the worker is unusable. We'll need to be able to differentiate between the "class" of error which occurred, and retry the step if the error falls into the unusable worker bucket.

Some examples of user error which would cause a step to fail (and we don't want to retry it):

  • misconfigured run.path: ( on a task config. In this case the container will actually be created in garden, and it will fail afterwards
  • pointing to a non-existent Docker Hub image using image_resource
  • trying to map non-existent inputs
  • bad resource params causing check / get / put failures

If the worker is unusable the container will likely fail to create, requests will timeout, volumes will fail to create, or some other catastrophic error; can we differentiate these errors when they're bubbled up to the exec/*_step.go?

The outcome of this spike should be a heuristic for determining when a build step can be retried, or at least a better understanding of scenarios in which we can safely retry a step.

@topherbullock topherbullock added this to Icebox in Runtime via automation Sep 12, 2018

@topherbullock topherbullock moved this from Icebox to In Flight in Runtime Sep 12, 2018


This comment has been minimized.

Copy link

cirocosta commented Sep 12, 2018


Mark and I were looking at how possible errors from Garden can propagate throug ATC.

To do that, we started modifying some code and see how it'd react to some failures arising from Garden.

To make Garden fail, we made use the Darwin worker (compiled in our darwin-based machine) - given that it uses houdini under the hood, to inject failures we modified houdini:

diff --git a/backend.go b/backend.go
index 99248e1..49ef48c 100644
--- a/backend.go
+++ b/backend.go
@@ -1,6 +1,7 @@
 package houdini

 import (
+       "errors"
@@ -56,6 +57,8 @@ func (backend *Backend) Capacity() (garden.Capacity, error) {

 func (backend *Backend) Create(spec garden.ContainerSpec) (garden.Container, error) {
+       return nil, errors.New("a total failure!!")
        id := backend.generateContainerID()

        if spec.Handle == "" {

As a testbed, we tailored the following pipeline (that makes use of darwin as platform):

- name: manual-trigger
  - task: pass
      platform: darwin

        path: 'echo'
        - "hello!!!"

To see the error appearing in ATC, we modified two files: exec/task_step.go and engine/task_step.go:

diff --git a/engine/exec_engine.go b/engine/exec_engine.go
index b4ef2b4..86f58ff 100644
--- a/engine/exec_engine.go
+++ b/engine/exec_engine.go
@@ -3,7 +3,9 @@ package engine
 import (
+       "fmt"
+       "log"
@@ -154,7 +156,18 @@ func (build *execBuild) Resume(logger lager.Logger) {

        done := make(chan error, 1)
        go func() {
-               done <- step.Run(runCtx, state)
+               err := step.Run(runCtx, state)
+               switch err.(type) {
+               case exec.ErrFindOrCreateFailure:
+                       log.Println("++++++++++++++++ would retryyyyy!!!")
+               default:
+                       log.Println("yyyyyyyyyyyy whatever")
+               }
+               fmt.Println("============== - exec_engine")
+               fmt.Printf("ERROR = %v\n", err)
+               fmt.Println("============== - exec_engine:end")
+               done <- err

        for {
diff --git a/exec/task_step.go b/exec/task_step.go
index 2b2bbe5..c5fad26 100644
--- a/exec/task_step.go
+++ b/exec/task_step.go
@@ -9,6 +9,7 @@ import (
+       "runtime/debug"

@@ -136,6 +137,14 @@ func NewTaskStep(

+type ErrFindOrCreateFailure struct {
+       err string
+func (e ErrFindOrCreateFailure) Error() string {
+       return e.err
 // Run will first selects the worker based on the TaskConfig's platform, the
 // TaskStep's tags, and prioritized by availability of volumes for the TaskConfig's
 // inputs. Inputs that did not have volumes available on the worker will be streamed
@@ -152,6 +161,10 @@ func NewTaskStep(
 // task's entire working directory is registered as an ArtifactSource under the
 // name of the task.
 func (action *TaskStep) Run(ctx context.Context, state RunState) error {
+       defer func() {
+               fmt.Println("--------------RUN-----------")
+               debug.PrintStack()
+       }()
        logger := lagerctx.FromContext(ctx)

Running everything together, we can see the error landing in ATC.

What's left is going forward with some more exploration around the actual retries and checking what other kinds of errors would be interesting to catch.

Please let me know if you have any thoughts on this!



This comment has been minimized.

Copy link
Member Author

topherbullock commented Sep 13, 2018

@cirocosta nice!

Thinking a bit more about how this will work in practice, I think we might want to pull the worker pool's "Find or Create" logic into something which can select a worker, try to FoC, and retry if it fails with an error which indicates we can re-attempt the step.

Currently this Pool implements the Client interface, which feels a little wonky, given we also have a WorkerProvider interface for uhh.. providing workers from the "pool" (database). This feels like a good time to split two components off :

  • Worker Pool : a repository of workers we can search for candidates for a given step
  • Worker Client : the interface to a single worker

exec steps would then follow a flow of:

    • find a worker in the Pool to run the step (or maybe a set of AllSatisfying?)
    • try and run the step on the chosen worker using its Client
    • If the step fails and can be re-attempted : GOTO 1. (but exclude the worker we failed on)
    • If the step fails and can't be re-attempted: fail

@cirocosta cirocosta moved this from In Flight to Backlog in Runtime Sep 17, 2018

@jama22 jama22 moved this from Backlog to In Flight in Runtime Sep 17, 2018

@jama22 jama22 added the paused label Sep 17, 2018


This comment has been minimized.

Copy link
Member Author

topherbullock commented Oct 1, 2018

Dropping in some more context on my above comment from a meatspace conversation about this issue with @cirocosta and @mhuangpivotal. They looked at the possibility of splitting Pool and Client but its a pretty large change which may not be necessary; we might want to create a decorator which augments the current behaviour of the Pool with retry logic.


This comment has been minimized.

Copy link

ralekseenkov commented Nov 1, 2018

Might be helpful to categorize issues into three categories:

  • user errors - don't want to retry
  • external errors - want to retry on the same worker
  • worker errors - want to retry on a different worker

See #2542 for an example of external error that requires retries on the same worker.


This comment has been minimized.

Copy link

cirocosta commented Jan 15, 2019


Some more context that might be nice for whoever ends doing some work on this issue: to start a worker locally using vito/houdini (in a MacOS machine) like I did over there, here's a reference you can use to do it:

To run that w/out any modifications, you can change vito/houdini locally, add a replace directive in the go.mod file and then build it as you'd normally do - houdini gets embedded to the binary.

Hope it helps!



This comment has been minimized.

Copy link
Member Author

topherbullock commented Jan 29, 2019

Random thought on this; maybe a good feature to drive this out is to normalize the log messages a user sees when their build is running. Right now its largely dependant on which step is trying to run, and where the worker flakes along the way; timeouts, "worker dissapeared", volume not found, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.