Spike : Can we determine when a build step fails because the worker is unusable #2581

topherbullock · 2018-09-12T15:47:26Z

For #755, we'll need to know whether a build step has failed because of user error (a misconfigured step, resource config, or task config) or because the worker is unusable. We'll need to be able to differentiate between the "class" of error which occurred, and retry the step if the error falls into the unusable worker bucket.

Some examples of user error which would cause a step to fail (and we don't want to retry it):

misconfigured run.path: (https://concourse-ci.org/tasks.html#task-run) on a task config. In this case the container will actually be created in garden, and it will fail afterwards
pointing to a non-existent Docker Hub image using image_resource
trying to map non-existent inputs
bad resource params causing check / get / put failures

If the worker is unusable the container will likely fail to create, requests will timeout, volumes will fail to create, or some other catastrophic error; can we differentiate these errors when they're bubbled up to the exec/*_step.go?

The outcome of this spike should be a heuristic for determining when a build step can be retried, or at least a better understanding of scenarios in which we can safely retry a step.

The text was updated successfully, but these errors were encountered:

cirocosta · 2018-09-12T21:03:08Z

Hey,

Mark and I were looking at how possible errors from Garden can propagate throug ATC.

To do that, we started modifying some code and see how it'd react to some failures arising from Garden.

To make Garden fail, we made use the Darwin worker (compiled github.com/concourse/bin in our darwin-based machine) - given that it uses houdini under the hood, to inject failures we modified houdini:

diff --git a/backend.go b/backend.go
index 99248e1..49ef48c 100644
--- a/backend.go
+++ b/backend.go
@@ -1,6 +1,7 @@
 package houdini

 import (
+       "errors"
        "path/filepath"
        "strconv"
        "sync"
@@ -56,6 +57,8 @@ func (backend *Backend) Capacity() (garden.Capacity, error) {
 }

 func (backend *Backend) Create(spec garden.ContainerSpec) (garden.Container, error) {
+       return nil, errors.New("a total failure!!")
+
        id := backend.generateContainerID()

        if spec.Handle == "" {

As a testbed, we tailored the following pipeline (that makes use of darwin as platform):

---
jobs:
- name: manual-trigger
  plan:
  - task: pass
    config:
      platform: darwin

      run:
        path: 'echo'
        args:
        - "hello!!!"

To see the error appearing in ATC, we modified two files: exec/task_step.go and engine/task_step.go:

diff --git a/engine/exec_engine.go b/engine/exec_engine.go
index b4ef2b4..86f58ff 100644
--- a/engine/exec_engine.go
+++ b/engine/exec_engine.go
@@ -3,7 +3,9 @@ package engine
 import (
        "context"
        "encoding/json"
+       "fmt"
        "io"
+       "log"
        "strconv"
        "strings"
        "sync"
@@ -154,7 +156,18 @@ func (build *execBuild) Resume(logger lager.Logger) {

        done := make(chan error, 1)
        go func() {
-               done <- step.Run(runCtx, state)
+               err := step.Run(runCtx, state)
+               switch err.(type) {
+               case exec.ErrFindOrCreateFailure:
+                       log.Println("++++++++++++++++ would retryyyyy!!!")
+               default:
+                       log.Println("yyyyyyyyyyyy whatever")
+               }
+
+               fmt.Println("============== - exec_engine")
+               fmt.Printf("ERROR = %v\n", err)
+               fmt.Println("============== - exec_engine:end")
+               done <- err
        }()

        for {
diff --git a/exec/task_step.go b/exec/task_step.go
index 2b2bbe5..c5fad26 100644
--- a/exec/task_step.go
+++ b/exec/task_step.go
@@ -9,6 +9,7 @@ import (
        "io"
        "path"
        "path/filepath"
+       "runtime/debug"
        "strconv"
        "strings"

@@ -136,6 +137,14 @@ func NewTaskStep(
        }
 }

+type ErrFindOrCreateFailure struct {
+       err string
+}
+
+func (e ErrFindOrCreateFailure) Error() string {
+       return e.err
+}
+
 // Run will first selects the worker based on the TaskConfig's platform, the
 // TaskStep's tags, and prioritized by availability of volumes for the TaskConfig's
 // inputs. Inputs that did not have volumes available on the worker will be streamed
@@ -152,6 +161,10 @@ func NewTaskStep(
 // task's entire working directory is registered as an ArtifactSource under the
 // name of the task.
 func (action *TaskStep) Run(ctx context.Context, state RunState) error {
+       defer func() {
+               fmt.Println("--------------RUN-----------")
+               debug.PrintStack()
+       }()
        logger := lagerctx.FromContext(ctx)
:

Running everything together, we can see the error landing in ATC.

What's left is going forward with some more exploration around the actual retries and checking what other kinds of errors would be interesting to catch.

Please let me know if you have any thoughts on this!

Thx!

topherbullock · 2018-09-13T14:22:12Z

@cirocosta nice!

Thinking a bit more about how this will work in practice, I think we might want to pull the worker pool's "Find or Create" logic into something which can select a worker, try to FoC, and retry if it fails with an error which indicates we can re-attempt the step.

Currently this Pool implements the Client interface, which feels a little wonky, given we also have a WorkerProvider interface for uhh.. providing workers from the "pool" (database). This feels like a good time to split two components off :

Worker Pool : a repository of workers we can search for candidates for a given step
Worker Client : the interface to a single worker

exec steps would then follow a flow of:

- find a worker in the Pool to run the step (or maybe a set of AllSatisfying?)
- try and run the step on the chosen worker using its Client
- If the step fails and can be re-attempted : GOTO 1. (but exclude the worker we failed on)
- If the step fails and can't be re-attempted: fail

topherbullock · 2018-10-01T15:13:21Z

Dropping in some more context on my above comment from a meatspace conversation about this issue with @cirocosta and @mhuangpivotal. They looked at the possibility of splitting Pool and Client but its a pretty large change which may not be necessary; we might want to create a decorator which augments the current behaviour of the Pool with retry logic.

romangithub1024 · 2018-11-01T03:20:46Z

Might be helpful to categorize issues into three categories:

user errors - don't want to retry
external errors - want to retry on the same worker
worker errors - want to retry on a different worker

See #2542 for an example of external error that requires retries on the same worker.

cirocosta · 2019-01-15T21:58:09Z

Hey,

Some more context that might be nice for whoever ends doing some work on this issue: to start a worker locally using vito/houdini (in a MacOS machine) like I did over there, here's a reference you can use to do it: https://github.com/cirocosta/concourse-naive-worker/blob/158d5883658c2cd1e81975c707cef92c9c7873fd/Makefile#L48-L55

To run that w/out any modifications, you can change vito/houdini locally, add a replace directive in the go.mod file and then build it as you'd normally do - houdini gets embedded to the binary.

Hope it helps!

thx!

topherbullock · 2019-01-29T11:56:47Z

Random thought on this; maybe a good feature to drive this out is to normalize the log messages a user sees when their build is running. Right now its largely dependant on which step is trying to run, and where the worker flakes along the way; timeouts, "worker dissapeared", volume not found, etc.

topherbullock added resiliency spike labels Sep 12, 2018

topherbullock assigned mhuangpivotal and cirocosta Sep 12, 2018

topherbullock added this to Icebox in Runtime via automation Sep 12, 2018

topherbullock moved this from Icebox to In Flight in Runtime Sep 12, 2018

cirocosta moved this from In Flight to Backlog in Runtime Sep 17, 2018

jama22 moved this from Backlog to In Flight in Runtime Sep 17, 2018

jama22 added the paused label Sep 17, 2018

topherbullock mentioned this issue Oct 3, 2018

Concourse Should Automatically Cordon Bad Workers #1965

Open

marco-m mentioned this issue Nov 21, 2018

Add abilitiy to automatically restart interruptible jobs #2838

Open

osis mentioned this issue Jan 8, 2019

Builds failed due to worker errors don't abort/restart #3011

Open

ddadlani added paused and removed paused labels Jan 15, 2019

ddadlani moved this from In Flight to Backlog in Runtime Jan 15, 2019

ddadlani mentioned this issue Feb 15, 2019

Investigate build reattaching through worker restarts #3310

Closed

marco-m mentioned this issue Mar 27, 2019

Re-attempt task only on Concourse error #3606

Closed

ddadlani moved this from Backlog to Icebox in Runtime May 13, 2019

ddadlani removed this from Icebox in Runtime Aug 8, 2019

marco-m-pix4d mentioned this issue Dec 14, 2021

Task ends with 'volume not found' #7231

Open

xtremerui closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike : Can we determine when a build step fails because the worker is unusable #2581

Spike : Can we determine when a build step fails because the worker is unusable #2581

topherbullock commented Sep 12, 2018

cirocosta commented Sep 12, 2018

topherbullock commented Sep 13, 2018

topherbullock commented Oct 1, 2018

romangithub1024 commented Nov 1, 2018

cirocosta commented Jan 15, 2019

topherbullock commented Jan 29, 2019

Spike : Can we determine when a build step fails because the worker is unusable #2581

Spike : Can we determine when a build step fails because the worker is unusable #2581

Comments

topherbullock commented Sep 12, 2018

cirocosta commented Sep 12, 2018

topherbullock commented Sep 13, 2018

topherbullock commented Oct 1, 2018

romangithub1024 commented Nov 1, 2018

cirocosta commented Jan 15, 2019

topherbullock commented Jan 29, 2019