-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Scaffolder: Enable TaskWorker to be Non-blocking #13799
Comments
Hey 👋 so first off thanks for writing such a detailed RFC! It's great to see! We've actually been having some thoughts around this recently, and are wondering what the benefits of the event emitter interface actually is? Looking at the PR it's a breaking change, and wondering if there's really any need for it? What we're thinking is that we actually probably need to fix the We're thinking that we change the Does this make sense? |
Hi @benjdlambert, thanks for the comment! I really had a lot of fun diving into Backstage and creating this RFC 😃 I'll summarize my interpretation of your comment, please correct me if I'm wrong on any point:
Response per point:
If we were to refactor the start() method to be non-blocking like so: start() {
- (async () => {
for (;;) {
- const task = await this.options.taskBroker.claim();
- await this.runOneTask(task);
+ this.options.taskBroker.claim().then(task => {
+ return this.runOneTask(task);
+ });
}
- })();
} NodeJS would run out of memory executing the One viable option is that we could leave the start() {
(async () => {
for (;;) {
const task = await this.options.taskBroker.claim();
- await this.runOneTask(task);
+ this.runOneTask(task);
}
})();
} This option works. And it's the Alternative 1 in the RFC. This option is literally a one-line change and it's a non-breaking change. If the team prefers this approach, I can change the PR. To be transparent, I opted for the more "provocative" PR because one, it introduces a cleaner architecture, and two, to have this conversation 😄.
To step back for a moment, from a JS standpoint, infinite loops are a code smell. This RFC is the result of me following that smell when I first saw the infinite loop in the This is a lot of text, and I left out some other thoughts regarding the TaskWorker/TaskBroker architecture. If the team is interested, I'm happy to hop on a call next week with the team to explore some paths forward. Thanks! |
@howlowck Bit more context around (1) is that it was a conscious decision to go for the current async polling strategy with The way to go if we don't want to limit workers at all would be to drop the |
Hi @Rugvip, thanks for the context. I think I'm starting to understand more of the motivation behind the architecture. Let's see if I'm getting this right (please correct me if I'm mistaken in any way):
I think we are aligned on a few points here:
Is that right? Thanks again for being open and taking the time to have a conversation with me on this 😄 |
@howlowck Yep, pretty much that! Main thing I want to tweak there is the second alignment point. It could be useful to dynamically scale the number of available workers, although I'm thinking it's a little bit overkill. I think the biggest improvement we could make to the system at this point is if all the scaffolder instances collaborated better to distribute the load. Right now it's only the scaffolder instance that receives the task We already have similar setups in other places, and all we need imo is a polling loop that checks for available tasks in the DB. Slap a |
Thanks @Rugvip! I guess the point I want to reiterate is that just because a task is running (or has started) doesn't necessarily mean it's busy (or resource intensive). Most of our scaffolding tasks is simply fetching status requests from an external workflow orchestrator (Argo Workflows in our case). Since simply fetches use very little memory, blocking on the TaskWorker level seems inefficient. I understand the desire to throttle the server and that can still be the default behavior, but what if we give users the ability to choose between blocking (throttle) or non-blocking (run everything) TaskWorkers? Or do you think that behavior is too different for |
@howlowck I think the best low hanging fruit to grab would be to refactor the |
Interesting.. ok. Let's align on setting a worker limit option for now (I'd be happy to work on the implementation). How would the |
@rodmachen @OscarDHdz fyi - Do we need to setup a chat to agree on the solution? We are already working on this internally to contribute back, and would be happy to align with the community and work together as to not duplicate effort hahaha - @Rugvip do you see a need to have a SIG for scaffolder backend? |
We do want to expand on the number of SIGs for sure! That's the intent, but we've been focusing on learning from the development of one SIG to begin with, so we know how to best scale things up. It'll probably be a little bit longer until we roll out the second, third one etc. |
@howlowck I'd say we drop the individually created task workers, and instead have a single task worker with a work limit. |
Thanks @Rugvip. I'll revise my PR to have the TaskWorker manage a limit of concurrent tasks. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Status: Open for comments
Let's Make TaskWorker Non-blocking (in
plugins/scaffolder-backend
)Current State of the World:
Currently, the
router
of thescaffolder-backend
creates theTaskBroker
and creates/starts theTaskWorkers
. The TaskWorker'sstart
then blocks and waits for the next pending task inside of an infinite loop. When a task is created, then theTaskWorker
asks theNunjucksWorkflowRunner
to execute the task, also blocking, until the task is complete.Issues:
TaskBroker
, theTaskWorker
, and theNunjucksWorkflowRunner
within thescaffolder-backend
is very complex. I created a diagram to illustrate the code upon app start (green arrow) and when a Task is created (black arrow).Proposal
I propose we make the TaskWorker non-blocking.
We can even leave the "start" method untouched, and create a new
listen
method in TaskWorker. Instead of using a combination ofasync/await
and Promises to essentially long-poll the pending tasks, then long-poll again for the completion of the task, we can use theevents
' EventEmitter or RxJS Observable to notify the TaskWorker to asynchronously run the task.Steps for one potential implementation (using EventEmitter):
createRouter
method, create an EventEmitter (e.g.taskStatusEvents
).taskStatusEvents
to both theTaskBroker
andTaskWorker
instances upon creation.dispatch
method, usetaskStatusEvents.emit('newTask')
to emit a newTask event.TaskWorker
class, create a newlisten
method. The method usestaskStatusEvents.on('newTask')
to listen to the event and then runsclaim
andrunOneTask
(just like in thestart
method), but it does notawait
onrunOneTask
method.router
, instead of callingstart()
, calllisten()
instead.Alternatives
await
on therunOneTask
method in theTaskWorker
'sstart
method. Since the UI and theNunjucksWorkflowRunner
are interacting outside of the TaskWorker, we don't need to wait for the completion of the task in thestart
method infinite loop.Risks
I can't think of any risks since it's simply a different way of starting tasks without affecting the interface surface for the users of Backstage.
Note:
I have a fork running and tested that both the Proposal, and the first alternative were able to start multiple Tasks with a single TaskWorker. I plan to make a PR along with this RFC.
The text was updated successfully, but these errors were encountered: