New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better worker mechanism #9

Open
fukamachi opened this Issue Jan 6, 2015 · 16 comments

Comments

Projects
None yet
3 participants
@fukamachi
Owner

fukamachi commented Jan 6, 2015

Something like lparallel, but more lightweight.

@guicho271828

This comment has been minimized.

guicho271828 commented Mar 9, 2015

Assuming there is a priority queue (e.g. a fifo queue sorted according to when it has arrived) of web requests, lock/mutex-based approach in lparallel is not efficient since the CPUs wait a lot. There are 2 promising approaches to handle the conflict between the threads trying to read the same file descriptor.

  1. efficient single parallel queue implementation e.g. Spray List : http://research.microsoft.com/apps/pubs/default.aspx?id=209108
  2. distributed parallel queues that minimize the conflict / duplicated effort e.g. Hash Distributed A* http://metahack.org/Kishimoto-Fukunaga-Botea-icaps2009.pdf .

Quote from Spray List:

High-performance concurrent priority queues are essential for ap-
plications such as task scheduling and discrete event simulation.
Unfortunately, even the best performing implementations do not
scale past a number of threads in the single digits. This is because
of the sequential bottleneck in accessing the elements at the head
of the queue in order to perform a DeleteMin operation.

In this paper, we present the SprayList, a scalable prior-
ity queue with relaxed ordering semantics. Starting from a non-
blocking SkipList, the main innovation behind our design is that
the DeleteMin operations avoid a sequential bottleneck by “spray-
ing” themselves onto the head of the SkipList list in a coordinated
fashion. The spraying is implemented using a carefully designed
random walk, so that DeleteMin returns an element among the
first O(p log^3 p) in the list, with high probability, where p is the
number of threads. We prove that the running time of a DeleteMin
operation is O(log3 p), with high probability, independent of the
size of the list.

Distributed queues in HDA* have an advantage over SprayList in that it does not assume the shared memory (i.e. not limited to threads) and scales in a distributed cluster of >768 machines. In a large scale cluster, communication between machines becomes a large bottleneck since it should be done over the network. Provided a good hash function, HDA* provides a nearly optimal load balancing and also completely eliminates communication bottleneck.

Currently HDA* seems an overkill and the quality of hash function is also a difficult part. SprayList seems best as of now.

@guicho271828

This comment has been minimized.

guicho271828 commented Mar 9, 2015

Note that, SprayList is a very fresh paper published in 2014. If you implement it, it is the first practical web server that incorporates this algorithm.

@libre-man

This comment has been minimized.

libre-man commented Aug 28, 2016

If looked a bit a the topic. According to this paper their MultiQueue data structure is faster than a SprayList. Their MultiQueue uses a lock based structure but, uses multiple locks and therefore no thread ever has to wait on a lock. It's quite interesting.

But both (MultiQueue and Spraylist) of these datastructures are optimized as priority queues, and we're using normal FIFO queues. However the knowledge should be translatable.

After scanning the worker.lisp code I see woo has a cluster of workers in a list and circles over them when assigning jobs. So if we have three workers (worker1 ... worker3), worker1 then worker2 then worker3 and then worker1 would get the jobs after four calls to add-job-to-cluster. Then each worker has his own FIFO queue of jobs that he can handle.

I still have to profile this but I would suspect that the bottleneck of this process is that starvation of workers can happen quite easily if the workload for each job is not the same. Because at distributing the jobs there is no consideration for how many jobs there are left for each worker.

Another issue with this model is that it seems that after more than 128 * worker-amount cl-speedy-queue will throw errors because it queues are full.

I would propose to refactor to refactor this code to something like this: Instead of having worker specific queue's, we would have one large thread safe MultiQueue. This queue would be backed by a simple non thread safe queue. If a new job arrives this job enqueued in the main MultiQueue. Then all worker threads would get their notify-new-job call to notify them of new jobs. They will all try to dequeue from the master queue.

To make sure that they will not be stuck trying to dequeue if the queue is empty we should implement some sort of check that checks that there is in fact still work left in the queue. However if we would do after every failed lock we would kill our performance. So I would suggest something like this in our dequeue function:

(defun dequeue (queue random-state)
  (loop with queues-array = (queue-queues-array queue)
        and queues-amount = (queue-queues-amount queue)
        and lock-array = (queue-lock-array queue)
        and try-amount = (queue-repeat queue)

       for  i = (random queues-amount) then (random queues-amount)
        and j = (random queues-amount) then (random queues-amount)
        and cur = 1 then (1+ cur)

       when (> (queue-peek (aref queues-array i)) (queue-peek (aref queues-array j)))
           do (setf i j)
       end
       when (bt:acquire-lock (aref lock-array i) nil)
           do (if (queue-empty-p (aref queue-array i))
                  (bt:release-lock (aref lock-array i))
                  (return (let ((res (cl-speedy-queue:dequeue (aref queues-array i))))
                            (bt:release-lock (aref lock-array i))
                            (values res t))
       end
       when (and (= (mod cur try-amount) 0) (multi-queue-empty-p queue))
           do (return (values nil nil))
       end))

I'm going to try to implement it and see if this brings any performance improvements. I'll try to have something working in less than a week.

Edit: Corrected a problem with the code snippet.

@libre-man

This comment has been minimized.

libre-man commented Aug 29, 2016

Well I'm reporting back again. According to my benchmarks my new queuing system is faster. I see an improvement of an average of 61219.82 req/sec for the old queue to 67674.51 req/sec for the new queue. This is with 4 woo workers and wrk running with 4 threads and 100 connections. (btw go manages around 75000 req/sec). What is interesting to see is that the maximum drops more than the average (25% and 10%).

I still need to profile the code to find bottlenecks. But I would suspect there's at least bottleneck on how workers are notified of new jobs. That is done here. We map all workers and send them a notification. But I don't know how long (lev:ev-async-send) takes. But it might be a problem.

@guicho271828

This comment has been minimized.

guicho271828 commented Aug 29, 2016

Do you have a larger experiment system? Also, make sure you have 4 physical cores, not hyperthread cores.
I can play with 18 phys core machines in Amazon's x8large and also 24 core machines in the lab.

@libre-man

This comment has been minimized.

libre-man commented Aug 29, 2016

I'm afraid I don't have system with more cores. I could try to find somebody who does or has access to our university systems. However it would be possible for you to try it on your machines that would really nice.

@guicho271828

This comment has been minimized.

guicho271828 commented Aug 29, 2016

AFAIK Woo doesnt work in distributed environment and assumes the shared memory in the single machine, right? In that case university systems (possibly HPC clusters) are overkill, mine will be fine.
Need some instruction though, direct me to the test code and your fork.

@guicho271828

This comment has been minimized.

guicho271828 commented Aug 29, 2016

BTW the certiicatie on our website seems expired: libremail.nl uses an invalid security certificate. The certificate is only valid for attach.libremail.nl

@libre-man

This comment has been minimized.

libre-man commented Aug 29, 2016

Thanks, I forgot to upgrade that site to SSL as well. I'll e-mail you in a couple of hours how to test. It might be useful to include a simple scripts that builds everything and runs the tests with different amount of workers for each queue implementation.

@fukamachi

This comment has been minimized.

Owner

fukamachi commented Sep 13, 2016

@libre-man
Thank you for experimenting a new MultiQueue worker system!

Woo's performance has been improved little by little, and recently, it comes to close with Go's when they both uses single thread.
However, for multi worker benchmarking, Woo is much slower than being expected from single-thread benchmarking.

I suppose your patch would help.
I'd like to merge your branch, but are you still working on it?

@libre-man

This comment has been minimized.

libre-man commented Oct 4, 2016

Sorry for the long silence, I was very busy with my studies. I analyzed the data from @guicho271828 (thanks again!) a bit today I the results are not completely positive.

The tests were done on a 36HT core machine. So in my opinion the benchmark with 36 workers is the most representative because SBCL uses OS threads so more threads only results in context switches. Further more I think we should use a relative large amount of connections as this will actually use the queue.

36 Workers

36-workers
As we can see better-queue has a better maximum in almost all cases, however the average case over three runs is different, the difference seems to be negligible. What is also important is that the difference between max and average is higher with better-queue (5608 average difference between _avg and _max for better-queue against 5096 for master). This means the performance for better-queue varies more than the performance of master.

72 Workers

72-workers
As expected the overall performance of 72 workers is worse. So I don't know how useful these results are as nobody should use these settings. One could argue however concurrency aspect of a queue is used more in this test, however I would need the average length and profile statistics to verify or falsify such a claim.

However the performance of better-queue is worse if we can handle more than 200000 requests per second.

18 Workers

18-workers

I think we can say that both the queues are performing at the same speed, with maybe a tiny advantage for master. However we once again see that better-queue is varies way more (7286 against 3570).

9 Workers

9-workers
This is very close again. Here master varies more than better-queue (6618 for better-queue against 6934 for master).

Conclusion

I think we can't say better-queue is better than master. Because wrk also runs we don't achieve maximum performance with 36 workers. So 18 workers might be the most interesting benchmark. With 18 workers there is not significant difference between the maximum of better-queue and master. We do see a difference in average. There also is a higher difference between maximum and average for better-queue, this could be explained by the randomness factor. I think better-queue is not better than master queue anywhere however master is better than better-queue in some places.

@guicho271828

This comment has been minimized.

guicho271828 commented Oct 5, 2016

Busy on my side too, sorry. Well, I have 3 concerns about this...

First and foremost, I noticed the difference between JUST QUEUE and Priority Queue. In the case of Woo, perhaps Priority Queue is not necessary, and we just need a fast parallel FIFO queue. I was looking for the wrong paper --- there is surely an additional bottleneck in maintaining the priority.

Second, perhaps we should reconsider the experimental setting. Was the usage of wrk adequate? Was the setting of wrk (amount of payload etc. --- I'm no expert on network) ok? Is OS/scheduler setting relevant? Was the 3-runs statically enough/significant? (probably not...) Also, Fukamachi's code could be highly optimized, perhaps we could see what happens if we compare with the same optimize setting.

Finally, why not first try SprayList, produced by 1st class researchers in Microsoft Research? The paper was accepted in SIGPLAN 2015, a top-notch refereed CS conference, compared to a wild paper on arxiv... Their results are reliable (i.e. it has theoretical guarantee on the chance of contention). I am interested if you could take time to implement it, since it also has to do with my own study (although I will be just a user of parallel queue.) --- just if you have time.

@libre-man

This comment has been minimized.

libre-man commented Oct 5, 2016

I didn't use SprayList has a few reasons. First of all I found the idea of MultiQueue interesting as their insert implementation should only take O(1) time, and this suits the current model of Woo quite well. Woo uses the 'main' thread to queue jobs for workers. So if a queue has a slow insert operation this would have a large impact on performance.

Another reason to not use SprayList is the requirement for compare-and-swap (CAS) operations. These are not supported by the Common Lisp standard and there is not a library that solves this issue. However we could just default to a slower queue on platforms and/or compilers that don't support CAS.

Another thing I thought would be a disadvantage is that SprayList is an implementation of a priority queue, and is (at least seems to be) optimized for this task. However if we known we only need a FiFo queue there might be simple optimizations for inserting. However one could argue that the need for a skiplist like data structure is not needed as the main advantage of a skiplist is faster search (and thus insert) time.

I agree that you can't draw any definitive answers from these benchmarks. There is also no profiling information, and profiling queue's are quite hard. So pinpointing the exact bottleneck for the current queue might not be possible.

I would be willing to implement a SprayList however I think that we do need (and can) modify it to suit are needs. So this would mean that we simplify the insertion code (always insert at the back and only chose a random height). This will take some time to implement for me, however it should be doable in around three weeks with my current schedule.

How does this sound?

@guicho271828

This comment has been minimized.

guicho271828 commented Oct 9, 2016

In the earlier comment, you wrote

I would suspect that the bottleneck of this process is that starvation of workers can happen quite easily if the workload for each job is not the same. Because at distributing the jobs there is no consideration for how many jobs there are left for each worker.

Is the current configuration of wrk exhibits this variability of the incoming data? If all incoming data from wrk are similar in a sense of amount of work, then it is quite natural if Naive (current round-robin implementation in Woo) and MultiQueue (and hence SprayList too) perform comparatively.

@guicho271828

This comment has been minimized.

guicho271828 commented Oct 9, 2016

I didn't know that SprayList requires CAS. Ok then that makes sense, SL would not be a viable option, though there seems to be a compatibility layer in ChanL. Perhaps you can port that. https://github.com/zkat/chanl/blob/master/src/trivial-cas.lisp

Still, I'd rather check the experimental settings before implementing something new.

@libre-man

This comment has been minimized.

libre-man commented Oct 10, 2016

It is a very good point that the current use of wrk (and mostly the woo callback) does not simulate this kind of behavior. It would be interesting to see what happens if the callback would simulate somewhat more real life execution time. We could use something like this:

(lambda (env)
    (sleep (/ (max 0 (- (random 1000) 250)) 750))
    '(200 () ("Hello World")))

If we execute this enough times the average sleep time would be the same, so it should not affect the benchmark in relative terms.

I saw the compatibility layer in ChanL, however it simply gives a warning to the user if the platform is not supported. As the spraylist would need locks if there is no CAS this would not be unacceptable for our use case. However we could simply keep SL as a default for SBCL and CCL and fallback to a simple queue or MultiQueue for other platforms.

Yesterday I started with implementing an adapted version of SL. My idea is the following for adapting it:

Normally insertion and deletion of within a SkipList (Spraylists are Skiplists with an extra operation: spray) is focused around search. However we don't have to (and can't) search, all items should be inserted at the back and deletion should be done at the front (using spray). So my idea is to keep two pointers (actually arrays of pointers) in the spraylist struct: a head and tail. Insertion is always done using this tail pointer.
We have another quite interesting possibility for simplifying the spraylist. That is that in our model insertion only happens in a single thread. That means that conflicts of multiple insertions at the same time can't happen. I would prefer not to explore this option, however the implementation would be way simpler and it might also mean we can get some extra speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment