-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use-case: Durable Workflow #369
Comments
FWIW, I've shared a possible solution with Argo here. |
One thing in particular that I've found missing from Brigade is the ability to rerun failed builds. With CI tools like Buildkite, if a build fails because of an intermittent issue (say a DNS resolution fails and I'm a big +1 on having the ability to retry builds. Whether that means full durability support is beyond me, but being able to retry things that have failed without having to find something to commit in the repo would be huge for me. |
@blakestoddard Thanks for chiming in! Yeah, that's a good point. From my perspective, implementing durability into brigade at build-level like that would make sense to me, assuming:
We currently mark each build as "processed"(hence not needed to be retried later) as long as brigade controller is able to "create" the pod. But it doesn't wait until the pod exists without an error. I believe we can make @technosophos @adamreese Would this change make sense to you, within the scope of Brigade? Even if this feature existed, I'd prefer using another workflow engine for a complex workflow composed of multiple brigade scripts. My suggestion here covers "at-least-once build run" use-case only. |
Since a job run returns a promise, this works well for retries (tested): https://www.npmjs.com/package/promise-retry. The only catch is that the job name needs to be different for each attempt. const promiseRetry = require('promise-retry');
const MAX_ATTEMPTS = 5;
let verizonPromise = promiseRetry((retry, number) => {
console.log(`Verizon job attempt ${number}`);
var verizonJob = new Job(`vz-attempt${number}`, 'my-verizon-image');
return verizonJob.run().catch(retry);
}, {
retries: MAX_ATTEMPTS,
factor: 1,
minTimeout: 500
});
verizonPromise.catch((err) => {
console.error(`Verizon job unsuccessful after ${MAX_ATTEMPTS} attempts, aborting workflow`);
console.error(err);
process.exit(1);
}); |
Oh! That is an interesting strategy I had never considered! I wonder if it makes sense to include |
Probably. From the user's perspective, I think that having a property |
That would be good. I don't think jobs should do this by default, but I would love to see it as an easy add-on in pipelines for those cases where this is the desired behavior. /cc @vdice |
Retries asides, it seems there is also interest here in resuming a pipeline where it left off if a worker dies mid-pipeline. Is that right?
I've implemented this in other systems. Realistically, this cannot really be accomplished without major architectural changes and the introduction of a dependency on some kind of message-oriented middleware. I am curious, however, how common an occurrence it is for workers to fail mid-pipeline. Is it frequent for you, and if so, do you know why? I'm in no way arguing against building for failure, but optimizations that introduce major architectural changes and significant new dependencies aren't things to enter into lightly, so I'm curious to see if we can get more bang for the buck by treating the root cause of worker failures. |
Generally speaking, much of the design of Kubernetes considers pods to be somewhat fleeting entities which can easily fail. Brigade, being designed for Kubernetes, would do well to also consider them as such. |
Possibly related: #977 |
Dumping my memory as this issue was featured in today's Brigade mtg. I think we have several things that can be done within the scope of this issue:
|
I haven't thought through this all the way yet, but instead of introducing a new kind of resource type (at the moment, Brigade doesn't use any CRDs) or even a new resource of an existing type (e.g. "checkpoint" encoded in a secret) we should think about what kind of job status can already be inferred from existing resources. Job pods, for instance, stick around after completion. So, could it possibly be enough that when a worker goes to execute a given job for a given build, it checks first to see if such a pod already exists? If it exists and has completed, some status can be inferred. If it exists and is still running, it could wait for it to complete as if it had launched the job itself. If it doesn't exist, then go ahead and launch it. Again, I haven't thought through all the details here. My suggestion is just to see what kind of mileage we can get out of all the existing resources in play before adding any new ones to the mix. |
Closing this. Please see rationale in #995 (comment). |
I am re-opening this issue because after ruling this out of scope for the forthcoming Brigade 2.0 proposal, due to technical constraints, I've discovered a realistic avenue to achieving if we accept a minor compromise. There have been two big technical limitations at work here-- one being that Brigade itself doesn't understand your workflow definitions (only the worker image does-- and those are customizable) and the other being that restoring shared state of the overall workflow to a correct / consistent state prior to resuming where a workflow left off was also not a realistic possibility without first relying on some kind of layered file system (a very big undertaking). These can be addressed by imposing two requirements on projects that wish to take advantage of some kind of "resume" functionality-- 1. it works for "stateless" workflows only (e.g. those that do not involve a workspace shared among jobs; externalizing state is ok) and 2. projects have to opt-in to the "resume" functionality. Under these conditions, we could safely retry handling of a failed event and whilst doing so bypass any job whose status is already recorded as succeeded. |
Not sure if this really belongs here, but in addition to being able to restart failed pipelines it would be nice to be able to re-run successful pipelines. Having this functionality exposed via the Kashti UI might also be nice :) I am more interested in being to easily restart entire pipelines since I expect that I can just chain pipelines to achieve some intermediate checkpoint if I want |
This is well covered by the 2.0 proposal, which has been ratified and is now guiding the 2.0 development effort. It probably doesn't make sense to track this as a discrete issue anymore. |
Extracted from #125 (comment)
First of all, I'm not saying that we'd need to bake a workflow engine into Brigade.
But I just wanted to discuss how we could achieve this use-case in both shorter term and longer term.
The interim solution can be implementing something on top of Brigade, and/or collaborating with other OSS projects.
Problem
Suppose a
brigade.js
to correspond to a "workflow" composed of one or more jobs, it can be said to be "durable" when it survives pod/node/brigade failures. This characteristic - "durability" - is useful when the total time required to run the workflow from start to finish is considerably long. And therefore, even when the brigade worker failed in the middle of the workflow, the restarted worker should continue the workflow from where it had failed.This use-case is typically achieved via a so-called (durable) workflow engine.
Suppose a workflow engine as a stateful service to complete your DAG of jobs by supporting:
From Brigade user's perspective, if brigade somehow achieved durability, no github PR status remains pending forever when something failed, no time-consuming job(like running an integration test suite) is rerun in case of restarting a workflow.
Possible solutions
I have two possible solutions in my mind today.
Although configuration of the integration would be a mess, I slightly prefer 2, which keeps Brigade doing one thing very well - scripting, not running durable workflows!
We can of course implement a light-weight workflow-engine-like thing inside each brigade gateway. But I feel like it is just a reinvention of the wheel.
We can also decide Brigade's scope to not include a durable workflow engine. Then, we can investigate possible integrations with another workflow-engine for providing durability to Brigade scripts.
I guess, the possible integration may end up including:
brig run $project -e $event_as_you_like
.brig run
idempotent, so that the workflow can retry it whenever necessary.events.on('step1', ...)
andevents.on('step2')
so that the workflow can retry step1 and step2 independently.brig run -e buid_failed
on a notification step, then brigade runsevents.on('build_failed')
to mark the PR status failed.The text was updated successfully, but these errors were encountered: