-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded performance for small dense problems has regressed #1016
Comments
Parallel operations on small vectors are indeed very slow
Note that (for small vector size, let say 128) it scales kinda-linearly with the number of threads (2,4,8,...), which suggests that this is overhead of launching tasks via thread pool. (parallel benchmark with 1 thread measures cost of going through Adding threshold on number of vector elements ( https://ceres-solver-review.googlesource.com/c/ceres-solver/+/25280 ) seem to solve this problem, here are benchmark results https://docs.google.com/spreadsheets/d/1qfxrJ1MhzEFmyhWYWN4RY3EJcoJF6d6nv4oJkdK_EaA/edit#gid=1902184755 For each sub-table
|
It seems to me that alloca doesn't really help or am I missing something? |
It is only meaningful for
|
Another problem is lock contention in I think we should put the next task into the queue from the task itself. This way there almost always should be only one thread trying to do something with Pros:
Cons:
I have added With the old scheme of launching all tasks from the main thread:
With the proposed scheme:
This makes too short parallel tasks and too much threads less painful. Submitted as a separate CL: https://ceres-solver-review.googlesource.com/c/ceres-solver/+/25300 |
@DmitriyKorchemkin shall we get this latter patch in, since it seems like a pure win? and then re-evaluate your first patch? |
@sandwichmaker yes, I think that we should merge the last one first. It will at least make the growth with the number of threads less pronounced (but the gap between 1 and 2 threads is still huge). |
okay so the scheduler change is in, lets look at the performance of https://ceres-solver-review.git.corp.google.com/c/ceres-solver/+/25280 on top of HEAD now. |
I checked out HEAD, numbers are now very competitive, with 2.2 better than 2.1 until you overspecify # of threads, but only by a modest amount: 1 thread performance Ceres 2.1 mean: 1.15 msec std: 552.40 usec (200 points) 2 thread performance Ceres 2.1 mean: 921.03 usec std: 359.30 usec (200 points) Ceres 2.1 mean: 2.58 msec std: 1.16 msec (1000 points) 4 thread performance Ceres 2.1 mean: 845.64 usec std: 416.76 usec (200 points) Ceres 2.1 mean: 1.96 msec std: 998.91 usec (1000 points) 8 thread performance Ceres 2.1 mean: 801.78 usec std: 407.24 usec (200 points) Ceres 2.1 mean: 1.85 msec std: 919.09 usec (1000 points) 12 thread performance Ceres 2.1 mean: 814.86 usec std: 337.59 usec (200 points) Ceres 2.1 mean: 1.65 msec std: 564.98 usec (1000 points) 16 thread performance Ceres 2.1 mean: 946.84 usec std: 486.15 usec (200 points) Ceres 2.1 mean: 1.63 msec std: 605.31 usec (1000 points) 23 thread performance (don't ask why 23 vs 24, nothing to do with ceres) Ceres 2.1 mean: 1.10 msec std: 428.13 usec (200 points) Ceres 2.1 mean: 1.71 msec std: 521.12 usec (1000 points) |
@sandwichmaker here are benchmark results with old and new task enqueuing: https://docs.google.com/spreadsheets/d/1Mdg6IqvqZd8cRLw69KAf8UM9WcscQmENBay6QpWflJk/edit?usp=sharing
I think it makes sense to set 2^14..2^20 as a limit; I've chosen 2^16 as a "round" number. |
@DmitriyKorchemkin I am only seeing one sheet. |
@sandwichmaker Strange, should be 8 sheets. |
ah it was a google account issue. from my official google account I only see one sheet and on sandwichmaker@gmail.com I can see all the sheets. |
As pointed out by several users, introduction of parallel operations on vectors severely impacts solver performance on small problems, with time consumption increasing with the number of threads. In order to minimize overhead of trying to execute small tasks using a large number of threads, task scheduling mechanism was changed to avoid scheduling all tasks at once. However, there is still a large difference in exectuion time because the main thread always launches the next thread before starting doing the work. This leads to several orders of magnitude slowdown when going from a single-threaded execution (which follows a fast-forward path to a single loop over all indices, without any synchronization involved) to a two-thread execution: /bin/parallel_vector_operations_benchmark ------------------------------------------- Benchmark Time ------------------------------------------- SetZero/128 12.8 ns SetZeroParallel/128/1 16.6 ns SetZeroParallel/128/2 2211 ns In order to eliminate this effect, we limit the block-size of parallel execution of vector operations to 2^16 elements (thus, starting parallel execution only for vectors of at least 2^17 elements). Threshold of 2^16 elements was choosen by evaluating thresholds from 2^10 to 2^20 (only powers of 2), with 2^14..2^20 significantly reducing worst-case runtime degradation. Details can be found in discussion of the issue at ceres-solver#1016 Change-Id: I555c882d63ee53323ceb426743b970f989b65503
Both Roger and Frank have confirmed that @DmitriyKorchemkin's two changes have fixed this issue. |
@rlabbe and @cfneuhaus report slowdown on small dense problems with multiple threads.
@rlabbe reported
v/s
He narrowed it down to commit b158515
In his tests
performance degrades pretty substantially with increasing number of threads.
@cfneuhaus also reports similar problems.
v/s
and when compared to single threaded performance on the same problem
Single threaded performance of the two versions seems to be the same but multithreaded performance has gotten worse substantially.
The text was updated successfully, but these errors were encountered: