Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use-case: Durable Workflow #369

Closed
mumoshu opened this issue Mar 15, 2018 · 16 comments
Closed

Use-case: Durable Workflow #369

mumoshu opened this issue Mar 15, 2018 · 16 comments
Labels
proposal Something to be discussed before moving forward

Comments

@mumoshu
Copy link
Contributor

mumoshu commented Mar 15, 2018

Extracted from #125 (comment)

First of all, I'm not saying that we'd need to bake a workflow engine into Brigade.

But I just wanted to discuss how we could achieve this use-case in both shorter term and longer term.

The interim solution can be implementing something on top of Brigade, and/or collaborating with other OSS projects.

Problem

Suppose a brigade.js to correspond to a "workflow" composed of one or more jobs, it can be said to be "durable" when it survives pod/node/brigade failures. This characteristic - "durability" - is useful when the total time required to run the workflow from start to finish is considerably long. And therefore, even when the brigade worker failed in the middle of the workflow, the restarted worker should continue the workflow from where it had failed.

This use-case is typically achieved via a so-called (durable) workflow engine.

Suppose a workflow engine as a stateful service to complete your DAG of jobs by supporting:

  • Retries / tolerating temporary brigade worker/job/node pod failure (Main concern in this issue)
  • Timeouts
  • Reliable notification on lifecycle events of builds and jobs.

From Brigade user's perspective, if brigade somehow achieved durability, no github PR status remains pending forever when something failed, no time-consuming job(like running an integration test suite) is rerun in case of restarting a workflow.

Possible solutions

I have two possible solutions in my mind today.

Although configuration of the integration would be a mess, I slightly prefer 2, which keeps Brigade doing one thing very well - scripting, not running durable workflows!

  1. We can of course implement a light-weight workflow-engine-like thing inside each brigade gateway. But I feel like it is just a reinvention of the wheel.

  2. We can also decide Brigade's scope to not include a durable workflow engine. Then, we can investigate possible integrations with another workflow-engine for providing durability to Brigade scripts.

I guess, the possible integration may end up including:

  • A brigade gateway to invoke a durable workflow. For example, submit a argo workflow.
  • A connector to allow external durable workflows to brig run $project -e $event_as_you_like.
    • It should be user's reposibility to make a single brig run idempotent, so that the workflow can retry it whenever necessary.
    • If the user don't want to retry the whole brigade script, he/she can freely split the script into multiple event handlers like events.on('step1', ...) and events.on('step2') so that the workflow can retry step1 and step2 independently.
  • For example, reliably marking a github PR status to be failed when a brigade script failed could be done on the workflow engine, not on brigade alone. The workflow invokes brig run -e buid_failed on a notification step, then brigade runs events.on('build_failed') to mark the PR status failed.
@mumoshu
Copy link
Contributor Author

mumoshu commented Mar 23, 2018

FWIW, I've shared a possible solution with Argo here.

@blakestoddard
Copy link

One thing in particular that I've found missing from Brigade is the ability to rerun failed builds. With CI tools like Buildkite, if a build fails because of an intermittent issue (say a DNS resolution fails and docker push doesn't work, causing the build to fail), I can hit "retry" and it does just that. With Brigade, that functionality doesn't exist -- the only way to retry a build is to commit something again.

I'm a big +1 on having the ability to retry builds. Whether that means full durability support is beyond me, but being able to retry things that have failed without having to find something to commit in the repo would be huge for me.

@mumoshu
Copy link
Contributor Author

mumoshu commented Apr 17, 2018

@blakestoddard Thanks for chiming in! Yeah, that's a good point.

From my perspective, implementing durability into brigade at build-level like that would make sense to me, assuming:

  • Implementation won't be that hard
  • Utilizing an external workflow engine just for this use-case seems unfriendly and a bit over-engineering although it is possible...

We currently mark each build as "processed"(hence not needed to be retried later) as long as brigade controller is able to "create" the pod. But it doesn't wait until the pod exists without an error.

Relevant code:
https://github.com/Azure/brigade/blob/ae0685f9a57e5d0ceef2bf62d0773e5f68250d17/brigade-controller/cmd/brigade-controller/controller/controller.go#L80-L96

https://github.com/Azure/brigade/blob/9f314937250463d823d69a595f158aa89000ceca/brigade-controller/cmd/brigade-controller/controller/handler.go#L24-L65

I believe we can make syncSecret async and wrap it into a control-loop so that brigade periodically try to run any unfinished build without a on-going worker pod, until the build finally finishes with a successful status.

@technosophos @adamreese Would this change make sense to you, within the scope of Brigade?

Even if this feature existed, I'd prefer using another workflow engine for a complex workflow composed of multiple brigade scripts. My suggestion here covers "at-least-once build run" use-case only.

@acham
Copy link

acham commented May 15, 2019

Since a job run returns a promise, this works well for retries (tested): https://www.npmjs.com/package/promise-retry. The only catch is that the job name needs to be different for each attempt.

const promiseRetry = require('promise-retry');

const MAX_ATTEMPTS = 5;

let verizonPromise = promiseRetry((retry, number) => {
    console.log(`Verizon job attempt ${number}`);
    var verizonJob = new Job(`vz-attempt${number}`, 'my-verizon-image');

    return verizonJob.run().catch(retry);
  }, {
    retries: MAX_ATTEMPTS,
    factor: 1,
    minTimeout: 500
  });

verizonPromise.catch((err) => {
  console.error(`Verizon job unsuccessful after ${MAX_ATTEMPTS} attempts, aborting workflow`);
  console.error(err);
  process.exit(1);
  });

@technosophos
Copy link
Contributor

Oh! That is an interesting strategy I had never considered! I wonder if it makes sense to include promise-retry as a core library for Brigade

@acham
Copy link

acham commented May 16, 2019

Probably. From the user's perspective, I think that having a property Job.attempts (1 by default) when defining the job would be a convenient interface. The other retry options seem to be designed mainly for HTTP calls.
I can attempt a pull request.

@technosophos
Copy link
Contributor

That would be good. I don't think jobs should do this by default, but I would love to see it as an easy add-on in pipelines for those cases where this is the desired behavior.

/cc @vdice

@krancour
Copy link
Contributor

Retries asides, it seems there is also interest here in resuming a pipeline where it left off if a worker dies mid-pipeline. Is that right?

And therefore, even when the brigade worker failed in the middle of the workflow, the restarted worker should continue the workflow from where it had failed.

I've implemented this in other systems. Realistically, this cannot really be accomplished without major architectural changes and the introduction of a dependency on some kind of message-oriented middleware.

I am curious, however, how common an occurrence it is for workers to fail mid-pipeline. Is it frequent for you, and if so, do you know why? I'm in no way arguing against building for failure, but optimizations that introduce major architectural changes and significant new dependencies aren't things to enter into lightly, so I'm curious to see if we can get more bang for the buck by treating the root cause of worker failures.

@acham
Copy link

acham commented Aug 8, 2019

Generally speaking, much of the design of Kubernetes considers pods to be somewhat fleeting entities which can easily fail. Brigade, being designed for Kubernetes, would do well to also consider them as such.

@radu-matei
Copy link
Contributor

Possibly related: #977

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 25, 2019

Dumping my memory as this issue was featured in today's Brigade mtg.

I think we have several things that can be done within the scope of this issue:

  1. To help making your build idempotent as a whole, add a "checkpointing" helper to a place like brigade-utils.
  • This can be a js func/class to wrap Job that first checks existence of specific K8s object(can be a custom resource like Checkpoint) for the unique key for the job. If it exists, consider it as already run, skip creating the Job pod and instead pull previous result from the K8s object.
    // DurableJob? CheckpointedJob?
    var test = new DurableJob(underlyingJob, {key: `${project}-${prNumber}-test`})
    var res = test.run()
    // `test.run` either create a pod as usual, or consult a `Checkpoint` resource whose metadata.name is `${project}-${prNumber}-test` if it exists, to obtain the `res`
    
    With this, you can rerun your build without worrying to much about duplicated results and too much wasted cluster resources
  1. @acham's retry helper does help making your brigade script tolerate expected/voluntary transient failures. Compare to this, the 1 above is more high-level helper. Probably you can wrap a retried job in a checkpointer for completeness.

  2. Manually detecting and rerunning builds failed due to transient errors are hard.

    To automate it we can enhance builds(stored as K8s secrets) to include an additional field like expiration_date. We can write an another K8s controller that watches and reruns expired builds.

    A build is considered as expired when expiration_date > current_date, and completion date is not set.

@krancour
Copy link
Contributor

This can be a js func/class to wrap Job that first checks existence of specific K8s object(can be a custom resource like Checkpoint) for the unique key for the job.

I haven't thought through this all the way yet, but instead of introducing a new kind of resource type (at the moment, Brigade doesn't use any CRDs) or even a new resource of an existing type (e.g. "checkpoint" encoded in a secret) we should think about what kind of job status can already be inferred from existing resources. Job pods, for instance, stick around after completion. So, could it possibly be enough that when a worker goes to execute a given job for a given build, it checks first to see if such a pod already exists? If it exists and has completed, some status can be inferred. If it exists and is still running, it could wait for it to complete as if it had launched the job itself. If it doesn't exist, then go ahead and launch it.

Again, I haven't thought through all the details here. My suggestion is just to see what kind of mileage we can get out of all the existing resources in play before adding any new ones to the mix.

@krancour
Copy link
Contributor

Closing this. Please see rationale in #995 (comment).

@krancour
Copy link
Contributor

I am re-opening this issue because after ruling this out of scope for the forthcoming Brigade 2.0 proposal, due to technical constraints, I've discovered a realistic avenue to achieving if we accept a minor compromise.

There have been two big technical limitations at work here-- one being that Brigade itself doesn't understand your workflow definitions (only the worker image does-- and those are customizable) and the other being that restoring shared state of the overall workflow to a correct / consistent state prior to resuming where a workflow left off was also not a realistic possibility without first relying on some kind of layered file system (a very big undertaking).

These can be addressed by imposing two requirements on projects that wish to take advantage of some kind of "resume" functionality-- 1. it works for "stateless" workflows only (e.g. those that do not involve a workspace shared among jobs; externalizing state is ok) and 2. projects have to opt-in to the "resume" functionality. Under these conditions, we could safely retry handling of a failed event and whilst doing so bypass any job whose status is already recorded as succeeded.

@krancour krancour reopened this Jun 22, 2020
@krancour krancour added 2.0 proposal Something to be discussed before moving forward and removed enhancement New feature or request labels Jun 22, 2020
@shmargum
Copy link

shmargum commented Aug 5, 2020

Not sure if this really belongs here, but in addition to being able to restart failed pipelines it would be nice to be able to re-run successful pipelines. Having this functionality exposed via the Kashti UI might also be nice :)

I am more interested in being to easily restart entire pipelines since I expect that I can just chain pipelines to achieve some intermediate checkpoint if I want

@krancour
Copy link
Contributor

This is well covered by the 2.0 proposal, which has been ratified and is now guiding the 2.0 development effort. It probably doesn't make sense to track this as a discrete issue anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Something to be discussed before moving forward
Projects
None yet
Development

No branches or pull requests

7 participants