Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Better worker mechanism #9
Assuming there is a priority queue (e.g. a fifo queue sorted according to when it has arrived) of web requests, lock/mutex-based approach in lparallel is not efficient since the CPUs wait a lot. There are 2 promising approaches to handle the conflict between the threads trying to read the same file descriptor.
Quote from Spray List:
Distributed queues in HDA* have an advantage over SprayList in that it does not assume the shared memory (i.e. not limited to threads) and scales in a distributed cluster of >768 machines. In a large scale cluster, communication between machines becomes a large bottleneck since it should be done over the network. Provided a good hash function, HDA* provides a nearly optimal load balancing and also completely eliminates communication bottleneck.
Currently HDA* seems an overkill and the quality of hash function is also a difficult part. SprayList seems best as of now.
If looked a bit a the topic. According to this paper their MultiQueue data structure is faster than a SprayList. Their MultiQueue uses a lock based structure but, uses multiple locks and therefore no thread ever has to wait on a lock. It's quite interesting.
But both (MultiQueue and Spraylist) of these datastructures are optimized as priority queues, and we're using normal FIFO queues. However the knowledge should be translatable.
After scanning the worker.lisp code I see woo has a cluster of workers in a list and circles over them when assigning jobs. So if we have three workers (worker1 ... worker3), worker1 then worker2 then worker3 and then worker1 would get the jobs after four calls to
I still have to profile this but I would suspect that the bottleneck of this process is that starvation of workers can happen quite easily if the workload for each job is not the same. Because at distributing the jobs there is no consideration for how many jobs there are left for each worker.
Another issue with this model is that it seems that after more than 128 * worker-amount cl-speedy-queue will throw errors because it queues are full.
I would propose to refactor to refactor this code to something like this: Instead of having worker specific queue's, we would have one large thread safe MultiQueue. This queue would be backed by a simple non thread safe queue. If a new job arrives this job enqueued in the main MultiQueue. Then all worker threads would get their notify-new-job call to notify them of new jobs. They will all try to dequeue from the master queue.
To make sure that they will not be stuck trying to dequeue if the queue is empty we should implement some sort of check that checks that there is in fact still work left in the queue. However if we would do after every failed lock we would kill our performance. So I would suggest something like this in our dequeue function:
(defun dequeue (queue random-state) (loop with queues-array = (queue-queues-array queue) and queues-amount = (queue-queues-amount queue) and lock-array = (queue-lock-array queue) and try-amount = (queue-repeat queue) for i = (random queues-amount) then (random queues-amount) and j = (random queues-amount) then (random queues-amount) and cur = 1 then (1+ cur) when (> (queue-peek (aref queues-array i)) (queue-peek (aref queues-array j))) do (setf i j) end when (bt:acquire-lock (aref lock-array i) nil) do (if (queue-empty-p (aref queue-array i)) (bt:release-lock (aref lock-array i)) (return (let ((res (cl-speedy-queue:dequeue (aref queues-array i)))) (bt:release-lock (aref lock-array i)) (values res t)) end when (and (= (mod cur try-amount) 0) (multi-queue-empty-p queue)) do (return (values nil nil)) end))
I'm going to try to implement it and see if this brings any performance improvements. I'll try to have something working in less than a week.
Edit: Corrected a problem with the code snippet.
Well I'm reporting back again. According to my benchmarks my new queuing system is faster. I see an improvement of an average of 61219.82 req/sec for the old queue to 67674.51 req/sec for the new queue. This is with 4 woo workers and wrk running with 4 threads and 100 connections. (btw go manages around 75000 req/sec). What is interesting to see is that the maximum drops more than the average (25% and 10%).
I still need to profile the code to find bottlenecks. But I would suspect there's at least bottleneck on how workers are notified of new jobs. That is done here. We map all workers and send them a notification. But I don't know how long
AFAIK Woo doesnt work in distributed environment and assumes the shared memory in the single machine, right? In that case university systems (possibly HPC clusters) are overkill, mine will be fine.
Woo's performance has been improved little by little, and recently, it comes to close with Go's when they both uses single thread.
I suppose your patch would help.
Sorry for the long silence, I was very busy with my studies. I analyzed the data from @guicho271828 (thanks again!) a bit today I the results are not completely positive.
The tests were done on a 36HT core machine. So in my opinion the benchmark with 36 workers is the most representative because SBCL uses OS threads so more threads only results in context switches. Further more I think we should use a relative large amount of connections as this will actually use the queue.
However the performance of
I think we can say that both the queues are performing at the same speed, with maybe a tiny advantage for
I think we can't say
Busy on my side too, sorry. Well, I have 3 concerns about this...
First and foremost, I noticed the difference between JUST QUEUE and Priority Queue. In the case of Woo, perhaps Priority Queue is not necessary, and we just need a fast parallel FIFO queue. I was looking for the wrong paper --- there is surely an additional bottleneck in maintaining the priority.
Second, perhaps we should reconsider the experimental setting. Was the usage of wrk adequate? Was the setting of wrk (amount of payload etc. --- I'm no expert on network) ok? Is OS/scheduler setting relevant? Was the 3-runs statically enough/significant? (probably not...) Also, Fukamachi's code could be highly optimized, perhaps we could see what happens if we compare with the same
Finally, why not first try SprayList, produced by 1st class researchers in Microsoft Research? The paper was accepted in SIGPLAN 2015, a top-notch refereed CS conference, compared to a wild paper on arxiv... Their results are reliable (i.e. it has theoretical guarantee on the chance of contention). I am interested if you could take time to implement it, since it also has to do with my own study (although I will be just a user of parallel queue.) --- just if you have time.
I didn't use SprayList has a few reasons. First of all I found the idea of MultiQueue interesting as their insert implementation should only take O(1) time, and this suits the current model of Woo quite well. Woo uses the 'main' thread to queue jobs for workers. So if a queue has a slow insert operation this would have a large impact on performance.
Another reason to not use SprayList is the requirement for compare-and-swap (CAS) operations. These are not supported by the Common Lisp standard and there is not a library that solves this issue. However we could just default to a slower queue on platforms and/or compilers that don't support CAS.
Another thing I thought would be a disadvantage is that SprayList is an implementation of a priority queue, and is (at least seems to be) optimized for this task. However if we known we only need a FiFo queue there might be simple optimizations for inserting. However one could argue that the need for a skiplist like data structure is not needed as the main advantage of a skiplist is faster search (and thus insert) time.
I agree that you can't draw any definitive answers from these benchmarks. There is also no profiling information, and profiling queue's are quite hard. So pinpointing the exact bottleneck for the current queue might not be possible.
I would be willing to implement a SprayList however I think that we do need (and can) modify it to suit are needs. So this would mean that we simplify the insertion code (always insert at the back and only chose a random height). This will take some time to implement for me, however it should be doable in around three weeks with my current schedule.
How does this sound?
In the earlier comment, you wrote
Is the current configuration of
I didn't know that SprayList requires CAS. Ok then that makes sense, SL would not be a viable option, though there seems to be a compatibility layer in ChanL. Perhaps you can port that. https://github.com/zkat/chanl/blob/master/src/trivial-cas.lisp
Still, I'd rather check the experimental settings before implementing something new.
It is a very good point that the current use of wrk (and mostly the woo callback) does not simulate this kind of behavior. It would be interesting to see what happens if the callback would simulate somewhat more real life execution time. We could use something like this:
(lambda (env) (sleep (/ (max 0 (- (random 1000) 250)) 750)) '(200 () ("Hello World")))
If we execute this enough times the average sleep time would be the same, so it should not affect the benchmark in relative terms.
I saw the compatibility layer in ChanL, however it simply gives a warning to the user if the platform is not supported. As the spraylist would need locks if there is no CAS this would not be unacceptable for our use case. However we could simply keep SL as a default for SBCL and CCL and fallback to a simple queue or MultiQueue for other platforms.
Yesterday I started with implementing an adapted version of SL. My idea is the following for adapting it:
Normally insertion and deletion of within a SkipList (Spraylists are Skiplists with an extra operation: spray) is focused around search. However we don't have to (and can't) search, all items should be inserted at the back and deletion should be done at the front (using spray). So my idea is to keep two pointers (actually arrays of pointers) in the spraylist struct: a head and tail. Insertion is always done using this tail pointer.