dynamic backend implementation #126

pca006132 · 2022-05-30T12:49:17Z

By reinventing the wheels, we can switch backends dynamically depending on the workload, get better performance and run the build with CUDA on machines without CUDA (switch to OMP automatically).

This patch is large because I have to change every single thrust::ALGORITHM to a custom function that determines the backend based on the execution policy, which the execution policy is determined based on the workload size (simply comparing against some constant size). We cannot simply pass a different execution policy to thrust, because the type of different execution policy is different, and the implementation here uses macro to build the functions and a switch to choose the correct invocation.

CUDA GPU detection is done by checking cudaGetDeviceCount and setting a global variable. If there is no CUDA GPU devices available, the GPU code path will be skipped and we will only use OpenMP or sequential implementation.

This patch also changed the VecDH implementation from std::vector + thrust::universal_vector to a custom vector implementation, which allows building vectors with uninitialized memory, performs uninitialized_copy/uninitialized_fill on GPU/CPU based on data size, and also perform memory prefetching to reduce the number of cache misses. This custom vector implementation is required because the old implementation will cause tons of page faults due to the use of thrust::uninitialized_fill(thrust::device, ...) when initializing the vector, which slows down the performance a lot.

Benchmark:

For CPP and OMP backend, the difference is not very significant. Below is the result for CUDA:

Old (#121):
nTri = 512, time = 0.0123526 sec
nTri = 2048, time = 0.0141026 sec
nTri = 8192, time = 0.0195818 sec
nTri = 32768, time = 0.0374303 sec
nTri = 131072, time = 0.120768 sec
nTri = 524288, time = 0.266435 sec
nTri = 2097152, time = 0.924646 sec
nTri = 8388608, time = 4.01982 sec

New:
nTri = 512, time = 0.00276084 sec
nTri = 2048, time = 0.00633459 sec
nTri = 8192, time = 0.00877285 sec
nTri = 32768, time = 0.0524452 sec
nTri = 131072, time = 0.0994011 sec
nTri = 524288, time = 0.285462 sec
nTri = 2097152, time = 0.947242 sec
nTri = 8388608, time = 3.70755 sec

Most importantly, the time required for small operation is reduced significantly.

TODO:

Run clang-format for the cpp/h files. The indentation is messed up for now, but the diff should be readable.
Rethink about how we should specify the backend in our build script. Previously the backends are mutually exclusive, but now they are not (we can have OMP, CUDA, CUDA + OMP, CUDA + TBB, etc.).

elalish

Nice work getting 5x speedup on the small scales!

elalish · 2022-05-31T05:52:05Z

utilities/include/par.h

+THRUST_DYNAMIC_BACKEND_VOID(gather, )
+THRUST_DYNAMIC_BACKEND_VOID(gather_if, )
+THRUST_DYNAMIC_BACKEND_VOID(remove_if, _void)
+THRUST_DYNAMIC_BACKEND_VOID(unique, _void)


What's the story with just a few having _void?

To avoid name collision with the one define below with THRUST_DYNAMIC_BACKEND. The reason for having these void variants is to avoid the need of specifying the return type via remove_if<RET> if we don't need it. I did this a few days ago but not sure if this is still needed, will have a look at that and remove these variants if they are not used.

elalish · 2022-05-31T05:53:09Z

utilities/include/par.h

+THRUST_DYNAMIC_BACKEND(remove, void)
+THRUST_DYNAMIC_BACKEND(copy_if, void)
+THRUST_DYNAMIC_BACKEND(remove_if, void)
+THRUST_DYNAMIC_BACKEND(unique, void)


What's the difference between THRUST_DYNAMIC_BACKEND_VOID(copy, ) and e.g. THRUST_DYNAMIC_BACKEND(copy_if, void)?

Thanks for the simplification, but I still don't have the answer to this question. Also, I thought these function do have return values, so void seems odd?

Oh I forgot about this. I was lazy in implementing these functions and forgot to give them a return type. Will fix.

elalish · 2022-05-31T05:54:37Z

utilities/src/detect_cuda.cpp

+void check_cuda_available() {
+  int device_count = 0;
+  cudaError_t error = cudaGetDeviceCount(&device_count);
+  CUDA_ENABLED = device_count != 0;


elalish · 2022-05-31T06:00:54Z

utilities/include/par.h

+      thrust::NAME(thrust::cuda::par, args...);                                \
+      break;                                                                   \
+    case ExecutionPolicy::Par:                                                 \
+      thrust::NAME(thrust::omp::par, args...);                                 \


So, if you wanted TBB, that would replace OMP here, right? Maybe for build rules we need something like ParUnseq = CUDA | NONE and Par = OMP | TBB | NONE? I guess NONE is mostly about building for compilers that don't support these.

I guess I will update this and the CMake script after #125 is merged.

pca006132 · 2022-05-31T17:11:32Z

ah, forgot that the build now depends on omp, will fix the cmake script

pca006132 · 2022-06-01T06:52:12Z

The OMP test failure is weird, not exactly sure what is going on. Managed to reproduce it in docker with the settings in the CI, with OMP_NUM_THREADS=2.

pca006132 · 2022-06-01T11:47:32Z

I tried running the program with thread sanitizer enabled. On my machine (NixOS), I only got data race warnings for manifold/src/impl.cpp:644:33, manifold/src/impl.cpp:264:24 and manifold/src/impl.cpp:265:24 which are fine. On docker with Ubuntu however, there are too many data race warnings and it is impossible to identify what causes the nondeterministic behavior. I'm not entirely sure if this bug is caused by my changes or already exists but becomes more apparent due to this change.

as MSVC does not support them and we don't really need to use these.

elalish · 2022-06-01T21:38:13Z

This PR is looking good to me; I'd like to test it locally a bit first. Do you feel like it's ready to merge? What behavior are you seeing from the OMP build on Ubuntu?

pca006132 · 2022-06-01T22:26:53Z

I think this should be ready to merge after fixing the minor issue above (forgot to specify the return type of some functions) and clean up the formatting. For the OMP build on Ubuntu, the Boolean.Gyroid test failed with:

[ RUN      ] Boolean.Gyroid
/__w/manifold/manifold/test/mesh_test.cpp:762: Failure
Expected: (result.NumDegenerateTris()) <= (42), actual: 43 vs 42

Only with OMP_NUM_THREADS=2 and quite deterministically failed with this setting.

elalish · 2022-06-01T23:38:23Z

Oh, that happens a lot; just change the value to 43 and call it good. It just serves to check if the numbers suddenly blow up (in which case edge collapse is broken), but the actual value is arbitrary and tends to shift.

pca006132 · 2022-06-02T04:20:28Z

btw do you want to add a formatting commit to this? and if you do, do you have a clang-format file that you use or I'll just use the default.

pca006132 · 2022-06-02T04:27:00Z

Looks like an issue with GitHub API. Quite a lot of outage recently.

elalish · 2022-06-02T04:45:04Z

A format commit is a good idea. I just have my VSCode set up to do google style for clang-format every time I save. But probably better to put it in the CI. 👍

elalish · 2022-06-02T04:47:24Z

Also, should the CI just build the limited version of Assimp and avoid the other formats, now that you have that flag? In fact, that could even be the default...

pca006132 · 2022-06-02T04:50:11Z

Also, should the CI just build the limited version of Assimp and avoid the other formats, now that you have that flag? In fact, that could even be the default...

Yes it is now the default.

elalish

The tests are looking good on my side, thanks!

elalish · 2022-08-27T05:08:08Z

@pca006132 After a couple days of baffled debug, I finally found the root of a crazy problem that showed up in a test for #187. I thought you might appreciate what I found - I went to bed convincing myself the compiler was broken, but woke up realizing the truth.

I was seeing manifold errors in the Boolean (printed assertions like k = 1 and no intersection, coming from deep in the internals), which I haven't seen for a long time. It's concerning because this is the guaranteed part of the algorithm - those assertions really shouldn't be possible to hit. Stranger still, it worked fine on the CPU - only CUDA had the error.

My first guess was that my broad phase had an error so I wasn't checking for all the intersections I should be. I tested this by eliminating it and forcing a brute force check of every intersection. Sure enough, this made CUDA work fine again. But delving into the code, I couldn't find an error anywhere. Finally I got far enough down that I found a case where I was getting two slightly different results from my Interpolate() function for exactly the same input (I had to print all the floating-point digits to see the difference in the last decimal place). Sure enough, that violated the assumption of the algorithm and caused the failure, but how could this function not be deterministic?

Finally I remembered: we're using two compilers, one for CPU and one for CUDA. And the choice of which version of each function to use is determined at run time by the vector length. It turns out one pass was long, running CUDA, and one was short, running on the CPU. They must use subtly different operations or ordering that result in slightly different rounding errors occasionally. Overriding the policy to force them onto the same compiled function fixed the issue and verified my hunch. It also explained why my brute force approach seemed to fix the problem: it made all the vectors long enough to choose CUDA.

Anyway, I'm going to change the policy system in #187, so that the policy becomes a member of the Boolean class, evaluated only once or twice and then used consistently throughout the internal operations. I think this may even improve performance a touch, as the real cost of CUDA is in moving the data between the host and device; even short vectors are fast as long as the data is already on the device. This should help us be more consistent.

pca006132 · 2022-08-27T05:38:17Z

@elalish Nice! This is actually way I want to implement it, to avoid moving data many times, but I was a bit too lazy to modify the boolean class to make it store the policy.

But it seems pretty weird to me that the slight rounding difference between CPU vs GPU will cause such a failure: they should abide to IEEE 754, so the difference probably comes from the increased precision. As mentioned in CUDA documentation, there are quite a lot of low level quirks that may give slightly different rounding error (slightly more precise). I wonder if it is possible to make it more robust against this kind of errors. Can you point me to the part of the paper that have such limitation on floating point operations?

elalish · 2022-08-27T16:31:01Z

So, the sensitivity isn't really due to the paper, but the way I optimized it. The issue is Shadow 01 (vertex-edge); the first time I implemented it I calculated these values once and stored them, which means there is no sensitivity. However, that was a lot of memory and I found it quite a bit faster to recalculate them as needed rather than storing them. However, that meant I needed to get repeatable values (the value doesn't matter, only that they match). That also wasn't a problem until we started interchanging the two compilers.

pca006132 · 2022-08-27T16:41:33Z

Oh ok, this makes sense to me now. I guess we should also document it so we will be aware of this in the future.

dynamic backend implementation

dynamic backend implementation

dd931fc

pca006132 force-pushed the dynamic-backend branch from 5a0fc40 to 12d7aa5 Compare May 30, 2022 13:19

cmake fix

c6f4d9d

pca006132 force-pushed the dynamic-backend branch from 12d7aa5 to c6f4d9d Compare May 30, 2022 13:23

elalish reviewed May 31, 2022

View reviewed changes

pca006132 force-pushed the dynamic-backend branch from 40e497a to 7a2bb69 Compare May 31, 2022 16:48

pca006132 added 2 commits June 1, 2022 11:06

Merge remote-tracking branch 'upstream/master' into dynamic-backend

b186222

parallel backend settings

a5d8b0c

pca006132 force-pushed the dynamic-backend branch from 7a2bb69 to a5d8b0c Compare June 1, 2022 05:58

pca006132 added 2 commits June 1, 2022 14:01

CI updates

d4ec048

fix Windows CI

fcbb364

removed use of __builtin_expect and __builtin_unreachable

acb3cfe

as MSVC does not support them and we don't really need to use these.

pca006132 mentioned this pull request Jun 1, 2022

CPP/OMP/TBB backend execution policies missing from new documentation NVIDIA/cccl#812

Open

pca006132 added 2 commits June 2, 2022 12:15

clean up dynamic-backend

be6dd18

relaxed NumDegenerateTris for Boolean.Gyroid

452cca9

pca006132 force-pushed the dynamic-backend branch from 07e919b to 452cca9 Compare June 2, 2022 04:21

elalish approved these changes Jun 3, 2022

View reviewed changes

elalish merged commit d1e71eb into elalish:master Jun 3, 2022

elalish mentioned this pull request Aug 31, 2022

Adding SDF LevelSet #187

Merged

pca006132 deleted the dynamic-backend branch December 22, 2022 05:35

cartesian-theatrics pushed a commit to SovereignShop/manifold that referenced this pull request Mar 11, 2024

Merge pull request elalish#126 from pca006132/dynamic-backend

295fb3e

dynamic backend implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic backend implementation #126

dynamic backend implementation #126

pca006132 commented May 30, 2022 •

edited

Loading

elalish left a comment

elalish May 31, 2022

pca006132 May 31, 2022

elalish May 31, 2022

elalish Jun 1, 2022

pca006132 Jun 1, 2022

elalish May 31, 2022

elalish May 31, 2022

pca006132 May 31, 2022

pca006132 May 31, 2022

pca006132 commented May 31, 2022

pca006132 commented Jun 1, 2022 •

edited

Loading

pca006132 commented Jun 1, 2022

elalish commented Jun 1, 2022

pca006132 commented Jun 1, 2022

elalish commented Jun 1, 2022

pca006132 commented Jun 2, 2022

pca006132 commented Jun 2, 2022

elalish commented Jun 2, 2022

elalish commented Jun 2, 2022 •

edited

Loading

pca006132 commented Jun 2, 2022

elalish left a comment

elalish commented Aug 27, 2022

pca006132 commented Aug 27, 2022

elalish commented Aug 27, 2022

pca006132 commented Aug 27, 2022

dynamic backend implementation #126

dynamic backend implementation #126

Conversation

pca006132 commented May 30, 2022 • edited Loading

elalish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pca006132 commented May 31, 2022

pca006132 commented Jun 1, 2022 • edited Loading

pca006132 commented Jun 1, 2022

elalish commented Jun 1, 2022

pca006132 commented Jun 1, 2022

elalish commented Jun 1, 2022

pca006132 commented Jun 2, 2022

pca006132 commented Jun 2, 2022

elalish commented Jun 2, 2022

elalish commented Jun 2, 2022 • edited Loading

pca006132 commented Jun 2, 2022

elalish left a comment

Choose a reason for hiding this comment

elalish commented Aug 27, 2022

pca006132 commented Aug 27, 2022

elalish commented Aug 27, 2022

pca006132 commented Aug 27, 2022

pca006132 commented May 30, 2022 •

edited

Loading

pca006132 commented Jun 1, 2022 •

edited

Loading

elalish commented Jun 2, 2022 •

edited

Loading