Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CPU multithreaded version (pthreads/TBB/OpenCL/MPI etc) #126

Closed
erwincoumans opened this issue Mar 30, 2014 · 11 comments

Comments

Projects
None yet
4 participants
@erwincoumans
Copy link
Member

commented Mar 30, 2014

Right now Bullet 2 is single threaded and we discarded the obsolete BulletMultiThreaded. Bullet 3 OpenCL is mainly designed for GPU and OpenCL drivers are unreliable. It would be good to create a new multithreaded version of Bullet keeping in mind Bullet 2 and Bullet 3 OpenCL.

@georgwuensch

This comment has been minimized.

Copy link

commented Sep 2, 2014

Hi Erwin, I am also very much interested in CPU Support and might be supporting in developping a solution.

My major concerns in developing a CPU MultiThreaded version are the following:

  • I would like to make heavy use of collision callbacks into c++ application code and have the fear that using GPU this will not be as easy. (Maybe there is another way of doing this by fetching collision data from the gpu but this will also end up in consuming bandwith. Due to stability reasons i need all the data it in physics step size frequency. Sometimes i need to apply forces on the physics bodies back again and so on)
  • Secondly I would like to use the good old GImpact Concave collision detection as it is doing a quite goo job for me. My Guess is that this will not be available under OpenCL/GPU acceleration, or am I wrong?

What were the exact reasons for dropping the cpu multithreaded support in bullet? I could reactivate the code in 2.82 and it seemed to scale linearly with the number of cpus available. At least with the Many Box demos.. How about Gimpact concave meshes with this demo? The scaling factors with number of CPU would be perfectly usefull for our applications.

Can you please give me some directions on what to do to find a solution? Thank you Georg.

@lunkhound

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2014

I'm working on getting Bullet 2 running on multiple CPU threads. So far, I've got the collision detection (dispatchAllPairs) part working. I had to add a couple of locks to the collision dispatcher (around the manifold pool and collision algorithm pool allocs and deallocs). I also had to get rid of the persistent btVoroniSimplexSolver that was being shared across threads and replace it with local versions. So far the changes to Bullet have been pretty minimal.

I'm using TBB as the task scheduler, but I've avoided introducing any TBB-dependency into Bullet libraries. All of the TBB-specific code (which is very little) is residing in the MultiThreadedDemo app (which I've dusted off and stripped out the BulletMultiThreaded parts of).

Also, in order to avoid adding any overhead to the single-threaded version, the locks that I added get compiled out unless a BT_THREADSAFE macro is set to 1.

If anyone is interested I can put this up on GitHub.

@dgu123

This comment has been minimized.

Copy link

commented Nov 22, 2014

May you share it ?

@lunkhound

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2014

I'll put up a fork of it shortly. It is still a work in progress. The lock is only implemented for win32 at the moment. Also, bullet's built-in profiling isn't threadsafe, so I've disabled it. And I haven't tried to set up CMAKE to work with TBB yet.

I did a bit of profiling on my 4-core machine. My test scene has about 1300 capsules on a triangle mesh terrain. Single-threaded, the dispatchAllPairs was taking about 2.2ms, whereas multi-threaded it was taking 0.9ms--that's a speedup of about 2.4x. It seems like there is quite alot of collision algorithm objects being allocated and freed each simulation step (about 1300 in fact). So I think there is contention for the lock on the collision alg pool allocator (which may account for the lackluster speedup). I need to look into this further.

@lunkhound

This comment has been minimized.

Copy link
Contributor

commented Nov 25, 2014

My changes are here: https://github.com/lunkhound/bullet3

Instructions:

  • install TBB 4.3 (build if using the open source version)
  • download my bullet3 fork
  • run Cmake on bullet
  • look for and enable the cmake option called BULLET2_USE_THREAD_LOCKS
  • look for and enable the cmake option called BULLET2_MULTITHREADED_TBB_DEMO
  • do a cmake configure (new options should appear)
  • set the option called BULLET2_TBB_INCLUDE_DIR to the path to the TBB includes directory
  • set the option called BULLET2_TBB_LIB_DIR to the path to the TBB .lib files (needs tbb.lib and tbbmalloc.lib)
  • do a cmake generate
  • open up the resulting solution in Visual Studio (there should be a project called "AppMultiThreadedDemo")
  • build the MultiThreadedDemo project in "Release"
  • find the appropriate TBB .dlls (tbb.dll and tbbmalloc.dll) for your version of Visual Studio and manually copy them into the same directory as AppMultiThreadedDemo.exe

Now you should be able to run the demo. Pay attention to the numbers after "collision detection time" on screen. Those are the numbers that should be improved compared to a single-threaded version. To compile a single-threaded version for comparison, edit MultiThreadedDemo.cpp, and change USE_PARALLEL_DISPATCHER to 0.

Oh, and keep in mind that in this particular demo, most of the time is going towards solving constraints, and that part is still completely single threaded.

@erwincoumans

This comment has been minimized.

Copy link
Member Author

commented Dec 23, 2018

This was contributed now, CPU multithreading can be enabled. See Bullet/examples/BenchmarkDemo for an example. Thanks @lunkhound !

@georgwuensch

This comment has been minimized.

Copy link

commented Dec 25, 2018

Dear Erwin, Dear Lunkhound, great that this is finally integrated. Thank you!

One question from your experience, what differences do you see between using openMP, ThreadSupport and TBB? Will there be any differences?

I am working on a little logger to do more testing on different laptops and workstations with the different options.

Maybe I can share some results on this if you like.

Yours Georg.

@georgwuensch

This comment has been minimized.

Copy link

commented Dec 26, 2018

HI, While looking into it I found (I am still compiling with Studio 2010...) that it uses concurrency:: as namespace for the ms ppl, while it only compiles with Concurrency:: AM I missing a settings there?

@erwincoumans

This comment has been minimized.

Copy link
Member Author

commented Dec 26, 2018

You may have to fix things to get it the various multi-threading options working. Good Luck!

Maybe I can share some results on this if you like.

If you have some comparison/number for different schedulers, please share (for example running some of the benchmarks in the Bullet/ExampleBrowser BenchmarkDemo)

@georgwuensch

This comment has been minimized.

Copy link

commented Dec 26, 2018

HI Erwin, thank you, for me the sample browser is really useful to quickly judge and try the different settings.

@lunkhound

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2018

Hi,

The PPL option should compile for MSVC 2013 and 2015 as I recall. I never tried it with VC 2010, so it sounds like it only works with newer versions of PPL. I don't really recommend using PPL. Its mainly there as an example of how to implement a task scheduler for Bullet. Performance wasn't great compared to the other options.

The ThreadSupport option is probably the best choice for most cases. It's built-in (to Bullet) so there's no extra setup, and it has the best performance (on the Windows platform at least, according to my testing).

I found IntelTBB to also have good performance, but it's a bit of a hassle to setup since it's an external library.

OpenMP on windows with MSVC works, and is very easy to setup since it is built into the compiler but doesn't perform that great. With other compilers hopefully the performance is better.

Please feel free to share your results with CPU multithreading on this Forum thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.