New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike : Can we determine when a build step fails because the worker is unusable #2581
Comments
Hey, Mark and I were looking at how possible errors from Garden can propagate throug ATC. To do that, we started modifying some code and see how it'd react to some failures arising from Garden. To make Garden fail, we made use the Darwin worker (compiled diff --git a/backend.go b/backend.go
index 99248e1..49ef48c 100644
--- a/backend.go
+++ b/backend.go
@@ -1,6 +1,7 @@
package houdini
import (
+ "errors"
"path/filepath"
"strconv"
"sync"
@@ -56,6 +57,8 @@ func (backend *Backend) Capacity() (garden.Capacity, error) {
}
func (backend *Backend) Create(spec garden.ContainerSpec) (garden.Container, error) {
+ return nil, errors.New("a total failure!!")
+
id := backend.generateContainerID()
if spec.Handle == "" { As a testbed, we tailored the following pipeline (that makes use of ---
jobs:
- name: manual-trigger
plan:
- task: pass
config:
platform: darwin
run:
path: 'echo'
args:
- "hello!!!" To see the error appearing in ATC, we modified two files: diff --git a/engine/exec_engine.go b/engine/exec_engine.go
index b4ef2b4..86f58ff 100644
--- a/engine/exec_engine.go
+++ b/engine/exec_engine.go
@@ -3,7 +3,9 @@ package engine
import (
"context"
"encoding/json"
+ "fmt"
"io"
+ "log"
"strconv"
"strings"
"sync"
@@ -154,7 +156,18 @@ func (build *execBuild) Resume(logger lager.Logger) {
done := make(chan error, 1)
go func() {
- done <- step.Run(runCtx, state)
+ err := step.Run(runCtx, state)
+ switch err.(type) {
+ case exec.ErrFindOrCreateFailure:
+ log.Println("++++++++++++++++ would retryyyyy!!!")
+ default:
+ log.Println("yyyyyyyyyyyy whatever")
+ }
+
+ fmt.Println("============== - exec_engine")
+ fmt.Printf("ERROR = %v\n", err)
+ fmt.Println("============== - exec_engine:end")
+ done <- err
}()
for {
diff --git a/exec/task_step.go b/exec/task_step.go
index 2b2bbe5..c5fad26 100644
--- a/exec/task_step.go
+++ b/exec/task_step.go
@@ -9,6 +9,7 @@ import (
"io"
"path"
"path/filepath"
+ "runtime/debug"
"strconv"
"strings"
@@ -136,6 +137,14 @@ func NewTaskStep(
}
}
+type ErrFindOrCreateFailure struct {
+ err string
+}
+
+func (e ErrFindOrCreateFailure) Error() string {
+ return e.err
+}
+
// Run will first selects the worker based on the TaskConfig's platform, the
// TaskStep's tags, and prioritized by availability of volumes for the TaskConfig's
// inputs. Inputs that did not have volumes available on the worker will be streamed
@@ -152,6 +161,10 @@ func NewTaskStep(
// task's entire working directory is registered as an ArtifactSource under the
// name of the task.
func (action *TaskStep) Run(ctx context.Context, state RunState) error {
+ defer func() {
+ fmt.Println("--------------RUN-----------")
+ debug.PrintStack()
+ }()
logger := lagerctx.FromContext(ctx)
: Running everything together, we can see the error landing in ATC. What's left is going forward with some more exploration around the actual retries and checking what other kinds of errors would be interesting to catch. Please let me know if you have any thoughts on this! Thx! |
@cirocosta nice! Thinking a bit more about how this will work in practice, I think we might want to pull the worker pool's "Find or Create" logic into something which can select a worker, try to FoC, and retry if it fails with an error which indicates we can re-attempt the step. Currently this Pool implements the
exec steps would then follow a flow of:
|
Dropping in some more context on my above comment from a meatspace conversation about this issue with @cirocosta and @mhuangpivotal. They looked at the possibility of splitting Pool and Client but its a pretty large change which may not be necessary; we might want to create a decorator which augments the current behaviour of the Pool with retry logic. |
Might be helpful to categorize issues into three categories:
See #2542 for an example of external error that requires retries on the same worker. |
Hey, Some more context that might be nice for whoever ends doing some work on this issue: to start a worker locally using vito/houdini (in a MacOS machine) like I did over there, here's a reference you can use to do it: https://github.com/cirocosta/concourse-naive-worker/blob/158d5883658c2cd1e81975c707cef92c9c7873fd/Makefile#L48-L55 To run that w/out any modifications, you can change Hope it helps! thx! |
Random thought on this; maybe a good feature to drive this out is to normalize the log messages a user sees when their build is running. Right now its largely dependant on which step is trying to run, and where the worker flakes along the way; timeouts, "worker dissapeared", volume not found, etc. |
For #755, we'll need to know whether a build step has failed because of user error (a misconfigured step, resource config, or task config) or because the worker is unusable. We'll need to be able to differentiate between the "class" of error which occurred, and retry the step if the error falls into the unusable worker bucket.
Some examples of user error which would cause a step to fail (and we don't want to retry it):
run.path:
(https://concourse-ci.org/tasks.html#task-run) on a task config. In this case the container will actually be created in garden, and it will fail afterwardsimage_resource
inputs
If the worker is unusable the container will likely fail to create, requests will timeout, volumes will fail to create, or some other catastrophic error; can we differentiate these errors when they're bubbled up to the
exec/*_step.go
?The outcome of this spike should be a heuristic for determining when a build step can be retried, or at least a better understanding of scenarios in which we can safely retry a step.
The text was updated successfully, but these errors were encountered: