Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Scaffolder: Enable TaskWorker to be Non-blocking #13799

Closed
howlowck opened this issue Sep 21, 2022 · 14 comments
Closed

[RFC] Scaffolder: Enable TaskWorker to be Non-blocking #13799

howlowck opened this issue Sep 21, 2022 · 14 comments
Assignees
Labels
area:scaffolder Everything and all things related to the scaffolder project area rfc Request For Comment(s) stale

Comments

@howlowck
Copy link
Contributor

howlowck commented Sep 21, 2022

Status: Open for comments

Let's Make TaskWorker Non-blocking (in plugins/scaffolder-backend)

Current State of the World:

Currently, the router of the scaffolder-backend creates the TaskBroker and creates/starts the TaskWorkers. The TaskWorker's start then blocks and waits for the next pending task inside of an infinite loop. When a task is created, then the TaskWorker asks the NunjucksWorkflowRunner to execute the task, also blocking, until the task is complete.

Issues:

  1. The interaction between the store, the TaskBroker, the TaskWorker, and the NunjucksWorkflowRunner within the scaffolder-backend is very complex. I created a diagram to illustrate the code upon app start (green arrow) and when a Task is created (black arrow).

image

  1. Each TaskWorker is blocking while waiting for a new task then blocking again when running a task. This means that in order to accommodate an increasing number of long-running tasks (especially in an enterprise scenario), we'd need to increase the TaskWorker count and restart the app every time. See Do not block template creation, when a task is still being executed (blocking parallel execution) #13419 for this exact issue, and the fix in scaffolder-backend: use 3 parallel workers by default #13558.

Proposal

I propose we make the TaskWorker non-blocking.

We can even leave the "start" method untouched, and create a new listen method in TaskWorker. Instead of using a combination of async/await and Promises to essentially long-poll the pending tasks, then long-poll again for the completion of the task, we can use the events' EventEmitter or RxJS Observable to notify the TaskWorker to asynchronously run the task.

image

Steps for one potential implementation (using EventEmitter):

  1. In the createRouter method, create an EventEmitter (e.g. taskStatusEvents).
  2. Pass the taskStatusEvents to both the TaskBroker and TaskWorker instances upon creation.
  3. In the dispatch method, use taskStatusEvents.emit('newTask') to emit a newTask event.
  4. In the TaskWorker class, create a new listen method. The method uses taskStatusEvents.on('newTask') to listen to the event and then runs claim and runOneTask (just like in the start method), but it does not await on runOneTask method.
  5. In the router, instead of calling start(), call listen() instead.

Alternatives

  1. The easiest way to get around waiting for long-running task is to simply remove the await on the runOneTask method in the TaskWorker's start method. Since the UI and the NunjucksWorkflowRunner are interacting outside of the TaskWorker, we don't need to wait for the completion of the task in the start method infinite loop.
  2. Dynamic TaskWorkers: This approach uses the existing blocking TaskWorker, but instead of making the TaskWorker non-blocking, the idea is to spin up new TaskWorkers ad-hoc, circumventing the blocking issue.

Risks

I can't think of any risks since it's simply a different way of starting tasks without affecting the interface surface for the users of Backstage.

Note:

I have a fork running and tested that both the Proposal, and the first alternative were able to start multiple Tasks with a single TaskWorker. I plan to make a PR along with this RFC.

@howlowck howlowck added the rfc Request For Comment(s) label Sep 21, 2022
@github-actions github-actions bot added the area:scaffolder Everything and all things related to the scaffolder project area label Sep 21, 2022
@benjdlambert
Copy link
Member

Hey 👋 so first off thanks for writing such a detailed RFC! It's great to see!

We've actually been having some thoughts around this recently, and are wondering what the benefits of the event emitter interface actually is? Looking at the PR it's a breaking change, and wondering if there's really any need for it?

What we're thinking is that we actually probably need to fix the polling behaviour, as the Broker doesn't seem to poll it just gets notified on the host when there's something available.

We're thinking that we change the workers to be a max limit of workers that can be span up, but if you don't provide this option then it's basically Infinity. And then we can create and tear down workers for each job rather than keeping them around?

Does this make sense?

@howlowck
Copy link
Contributor Author

howlowck commented Sep 23, 2022

Hi @benjdlambert, thanks for the comment! I really had a lot of fun diving into Backstage and creating this RFC 😃

I'll summarize my interpretation of your comment, please correct me if I'm wrong on any point:

  1. The PR has a breaking change: introduction of the eventEmitter. The Backstage Team's view is that the value of the eventEmitter doesn't warrant the impact of a breaking change.
  2. We can still achieve non-blocking in other ways without the eventEmitter, and plan to fix the polling behavior.
  3. We could create/destroy the workers (still blocking) dynamically (like Scaffolder: Dynamic Task Workers #13650)

Response per point:

  1. I can't argue that the value of the eventEmitter is greater than the impact of the breaking change, because I don't know the extent of the impact of the breaking change. And you are right, the eventEmitter is simply a mechanism, and maybe it can be abstracted or implemented in a different way. A general guidance personally for me is that using native implementations (like events EventEmitter) is better than building our own, and that a breaking change towards a simpler mental model is worth it. To your point, when I was adding the eventEmitter, I did pause and think "was it really necessary?" and opted for it in the end because of the simplicity it introduced.
  2. That's awesome to hear that there is a plan to fix the polling behavior. To me, the issue is in the TaskWorker, specifically the infinite loop in the start(). I know why it's an infinite loop given the current architecture, and if we make the body of the loop nonblocking, then it surfaces the real issue:

If we were to refactor the start() method to be non-blocking like so:

start() {
-   (async () => {
      for (;;) {
-        const task = await this.options.taskBroker.claim();
-        await this.runOneTask(task);      
+        this.options.taskBroker.claim().then(task => {
+          return this.runOneTask(task);
+        });
      }
-  })();
  }

NodeJS would run out of memory executing the start() method.

One viable option is that we could leave the claim() to be await, and delete the await on the runOneTask() like so:

start() {
    (async () => {
      for (;;) {
        const task = await this.options.taskBroker.claim();
-       await this.runOneTask(task);
+       this.runOneTask(task);
      }
    })();
  }

This option works. And it's the Alternative 1 in the RFC. This option is literally a one-line change and it's a non-breaking change. If the team prefers this approach, I can change the PR. To be transparent, I opted for the more "provocative" PR because one, it introduces a cleaner architecture, and two, to have this conversation 😄.

  1. I would caution against this path, specifically leaving the worker a blocking process and just increasing the worker dynamically. To me, dynamically spinning up and destroying workers is yet another layer of abstraction and another cognitive overhead. The two infinite loops (one in TaskWorker, and one in TaskBroker) are acting as its own event loops which really should be the JS engine's responsibility. We don't need to create our own workload orchestrator too. To me, dynamically creating blocking workers (on a process that can be non-blocking) is putting a band-aid on the core issue outlined in the RFC.

To step back for a moment, from a JS standpoint, infinite loops are a code smell. This RFC is the result of me following that smell when I first saw the infinite loop in the TaskWorker (then seeing it again in the TaskBroker). I really hope the Backstage team reconsiders the value of the eventEmitter and eliminate the infinite loops and long-polling blocking code.

This is a lot of text, and I left out some other thoughts regarding the TaskWorker/TaskBroker architecture. If the team is interested, I'm happy to hop on a call next week with the team to explore some paths forward. Thanks!

@Rugvip
Copy link
Member

Rugvip commented Sep 25, 2022

@howlowck Bit more context around (1) is that it was a conscious decision to go for the current async polling strategy with claim() rather than a callback-like pattern such as event emitter. The reason is that async polling tends to handle back pressure better. With event emitters and other callback patterns it gets pretty complicated if you want to communicate back to the source that you're not ready to accept more work. With a claim() method it's a lot simpler to control the consumption of tasks, and with that the scaling of things out across multiple nodes.

The way to go if we don't want to limit workers at all would be to drop the await from this.runOneTask(task). We of course can't remove the await from taskBroker.claim(), because then we'd actually have an infinite loop. In practice I think we'll want to limit the amount of concurrent work in some way, or perhaps figure out a way to prefer nodes with less ongoing work when distributing tasks.

@howlowck
Copy link
Contributor Author

Hi @Rugvip, thanks for the context. I think I'm starting to understand more of the motivation behind the architecture. Let's see if I'm getting this right (please correct me if I'm mistaken in any way):

  • The blocking polling is a way to throttle the TaskBroker from overloading the system.
  • There needs to be a way for the TaskWorker to tell TaskBroker when to take tasks and when to hold off so reactive architecture isn't suitable because TaskBroker can't just push new events and expect TaskWorker to start them.
  • Choosing between the TaskWorker starting open tasks all at once OR the TaskWorker taking and completing open tasks one at a time, it's the Spotify team's choice to choose the latter.
  • The Spotify team acknowledges that having blocking, polling behavior is not ideal, but it's a cost to pay for having a mechanism to control tasks overtaking memory usage.

I think we are aligned on a few points here:

  • It'd be good if the number of concurrent tasks can be set dynamically by the user, without restarting Backstage
  • It'd be even better if the number of concurrent tasks is adjusted automatically based on available system resources (memory usage for example), and the user is able set a threshold of concurrent tasks (default to infinite).

Is that right?

Thanks again for being open and taking the time to have a conversation with me on this 😄

@Rugvip
Copy link
Member

Rugvip commented Sep 28, 2022

@howlowck Yep, pretty much that!

Main thing I want to tweak there is the second alignment point. It could be useful to dynamically scale the number of available workers, although I'm thinking it's a little bit overkill.

I think the biggest improvement we could make to the system at this point is if all the scaffolder instances collaborated better to distribute the load. Right now it's only the scaffolder instance that receives the task POST request that will actually execute the task, and if that worker is busy then the task will block. It would be much better if a different scaffolder instance was able to pick up the task at that point instead.

We already have similar setups in other places, and all we need imo is a polling loop that checks for available tasks in the DB. Slap a SELECT FOR UPDATE SKIP LOCKED and it's gonna be really quick too. Now if we add that we could also introduce a little bit of fanciness to do a bit better than just a flat distribution. If we adjust the polling rate so that it's inversely proportional to the number of tasks currently running in each node, then we get a naturally balancing system that's still very simple. With that in place I'm feeling that we could have a pretty high limit for the maximum number of tasks, just to make sure nobody is intentionally or unintentionally flooding the workers.

@howlowck
Copy link
Contributor Author

howlowck commented Sep 28, 2022

Thanks @Rugvip! I guess the point I want to reiterate is that just because a task is running (or has started) doesn't necessarily mean it's busy (or resource intensive). Most of our scaffolding tasks is simply fetching status requests from an external workflow orchestrator (Argo Workflows in our case). Since simply fetches use very little memory, blocking on the TaskWorker level seems inefficient.

I understand the desire to throttle the server and that can still be the default behavior, but what if we give users the ability to choose between blocking (throttle) or non-blocking (run everything) TaskWorkers? Or do you think that behavior is too different for scaffolder-backend TaskWorker configuration and warrants a new plugin scaffolder-async-backend for example?

@Rugvip
Copy link
Member

Rugvip commented Sep 28, 2022

@howlowck I think the best low hanging fruit to grab would be to refactor the taskWorkers option in createRouter to act more like a limit than a thing that actually instantiates that many task workers. That way you can set it to Infinity and call it a day.

@howlowck
Copy link
Contributor Author

howlowck commented Sep 28, 2022

Interesting.. ok. Let's align on setting a worker limit option for now (I'd be happy to work on the implementation). How would the createRouter know to either create a new worker or just let an idle worker pick it up?

@gman0922
Copy link
Contributor

gman0922 commented Sep 30, 2022

@rodmachen @OscarDHdz fyi - Do we need to setup a chat to agree on the solution? We are already working on this internally to contribute back, and would be happy to align with the community and work together as to not duplicate effort hahaha - @Rugvip do you see a need to have a SIG for scaffolder backend?

@freben
Copy link
Member

freben commented Oct 1, 2022

We do want to expand on the number of SIGs for sure! That's the intent, but we've been focusing on learning from the development of one SIG to begin with, so we know how to best scale things up. It'll probably be a little bit longer until we roll out the second, third one etc.

@Rugvip
Copy link
Member

Rugvip commented Oct 2, 2022

@howlowck I'd say we drop the individually created task workers, and instead have a single task worker with a work limit.

@howlowck
Copy link
Contributor Author

howlowck commented Oct 3, 2022

Thanks @Rugvip. I'll revise my PR to have the TaskWorker manage a limit of concurrent tasks.

@howlowck
Copy link
Contributor Author

howlowck commented Oct 4, 2022

@Rugvip I updated the PR #13817 with what we discussed. Please review. Thanks!

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 3, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:scaffolder Everything and all things related to the scaffolder project area rfc Request For Comment(s) stale
Projects
None yet
Development

No branches or pull requests

5 participants