alternate scheduling #73

akseg73 · 2023-12-08T17:38:22Z

The fact that kvdo requires tuning of threads explicitly seems to suggest (naive guess) that there may be alternate scheduling approaches that might work better:
-- co-operative scheduling algos for instance.
go coroutines etc also have some advantages for scheduling.

having a design in which multiple threads could be waiting on each other is inherent problematic. Have you considered a design which may not require explicit tuning of threads?

raeburn · 2023-12-18T09:31:11Z

I’m not really familiar with Go and its coroutines, so I’m not sure what you imagine that would look like when used inside the Linux kernel. As I understand it from what I’ve read, goroutines run efficiently by explicitly managing switching between them within one thread (or maybe more than one) as seen by the kernel -- essentially doing its own second level of scheduling independent of the kernel, so that context switches into the kernel can be reduced. It also needs its own management of stack space and such too because of that.

We looked long ago at running everything in one thread, and at least when fast storage is available, and data is generated quickly, one thread quickly becomes a bottleneck. So we went with multiple threads and message passing, with most of the worker threads owning certain data structures (address maps, allocation tables, etc), and put a fair amount of effort into avoiding or reducing synchronization between threads, even the need for locks in most cases. With fast storage, VDO can consume enough CPU time to keep multiple cores busy.

Our message passing approach seems to work pretty well under heavy load, which is our main use case; when a worker thread is resumed, it’s likely to have a bunch of work waiting for it or about to be queued up by other running threads, so it can process a bunch of updates to the data structures it owns before going back to sleep. Though, when lightly loaded, it’s probably not as efficient as one might like.

I did at one point look at an approach that would use our current message passing but instead of creating a dedicated thread for each worker, used a single work_struct per “worker” that would run under the kernel’s workqueues, letting that mechanism determine how many actual threads would be created. (Sort of a three-layer work scheduling system -- kernel threads, workqueue tasks within a worker thread, and VDO tasks within a workqueue task.) It didn’t perform as well as the dedicated threads; I didn’t have time to investigate why, though I’ve got a few guesses. Creating single-threaded kernel workqueues (to enforce serialization around data structures) and making each I/O request use its own work_struct also performed poorly.

(Also, we were asked a while back about comparing against using mutexes rather than passing the I/O requests off between threads so much. But having not written the driver to be switched between models, we’ve kind of baked the design into the code in places, including assumptions about processing messages in a certain order. So it’s not the easiest experiment to run.)

I can imagine a few scheduler tweaks that might help VDO. For example, being able to say, “I’m done but if I haven’t used too much time, go run another thread from this group”. Being able to express that certain pairs or groups of threads exchange messages a lot and it might be helpful to keep them on the same core or same NUMA node, without having to externally dictate exactly which cores are allowed to be used for each thread.

having a design in which multiple threads could be waiting on each other is inherent problematic.

The threads don’t use a “send request, block waiting for result” type approach, if that’s what you’re picturing. The I/O requests are generally either moving forward through the state machine, or parked in one of various wait queues until certain events take place (usually completion of I/O to backing storage). Except for certain threads designated to handle actions that can block, none of the worker threads should block until they’ve completed their available work and moved all the pending I/O requests to their next steps.

Have you considered a design which may not require explicit tuning of threads?

I would like to see such a design. Of course, it’d be easy to simply remove the knobs, fix the values to compiled-in constants, and live with whatever the resulting performance is. Getting performance comparable to what we can get now with tuning would be the trick. (Ideally, something dynamically adaptable to a workload changing over time, which our current set of knobs is not.) It needs to strike a good balance between cache contention (between cores and between NUMA nodes), queueing delays, lock contention, excessive calls into the scheduler, other demands in the system for CPU cycles, throughput vs latency, performance vs power, etc.

akseg73 · 2024-01-11T16:29:43Z

@raeburn thanks for your detailed response.

akseg73 · 2024-01-16T00:32:57Z

ZFS also offers compression and dedup and while it has a lot of tunables, don't recall there being a need for explicit tuning of threads (could be wrong).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alternate scheduling #73

alternate scheduling #73

akseg73 commented Dec 8, 2023

raeburn commented Dec 18, 2023

akseg73 commented Jan 11, 2024 •

edited

Loading

akseg73 commented Jan 16, 2024

alternate scheduling #73

alternate scheduling #73

Comments

akseg73 commented Dec 8, 2023

raeburn commented Dec 18, 2023

akseg73 commented Jan 11, 2024 • edited Loading

akseg73 commented Jan 16, 2024

akseg73 commented Jan 11, 2024 •

edited

Loading