Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic backend implementation #126

Merged
merged 9 commits into from
Jun 3, 2022
Merged

Conversation

pca006132
Copy link
Collaborator

@pca006132 pca006132 commented May 30, 2022

By reinventing the wheels, we can switch backends dynamically depending on the workload, get better performance and run the build with CUDA on machines without CUDA (switch to OMP automatically).

This patch is large because I have to change every single thrust::ALGORITHM to a custom function that determines the backend based on the execution policy, which the execution policy is determined based on the workload size (simply comparing against some constant size). We cannot simply pass a different execution policy to thrust, because the type of different execution policy is different, and the implementation here uses macro to build the functions and a switch to choose the correct invocation.

CUDA GPU detection is done by checking cudaGetDeviceCount and setting a global variable. If there is no CUDA GPU devices available, the GPU code path will be skipped and we will only use OpenMP or sequential implementation.

This patch also changed the VecDH implementation from std::vector + thrust::universal_vector to a custom vector implementation, which allows building vectors with uninitialized memory, performs uninitialized_copy/uninitialized_fill on GPU/CPU based on data size, and also perform memory prefetching to reduce the number of cache misses. This custom vector implementation is required because the old implementation will cause tons of page faults due to the use of thrust::uninitialized_fill(thrust::device, ...) when initializing the vector, which slows down the performance a lot.


Benchmark:

For CPP and OMP backend, the difference is not very significant. Below is the result for CUDA:

Old (#121):
nTri = 512, time = 0.0123526 sec
nTri = 2048, time = 0.0141026 sec
nTri = 8192, time = 0.0195818 sec
nTri = 32768, time = 0.0374303 sec
nTri = 131072, time = 0.120768 sec
nTri = 524288, time = 0.266435 sec
nTri = 2097152, time = 0.924646 sec
nTri = 8388608, time = 4.01982 sec

New:
nTri = 512, time = 0.00276084 sec
nTri = 2048, time = 0.00633459 sec
nTri = 8192, time = 0.00877285 sec
nTri = 32768, time = 0.0524452 sec
nTri = 131072, time = 0.0994011 sec
nTri = 524288, time = 0.285462 sec
nTri = 2097152, time = 0.947242 sec
nTri = 8388608, time = 3.70755 sec

Most importantly, the time required for small operation is reduced significantly.

TODO:

  • Run clang-format for the cpp/h files. The indentation is messed up for now, but the diff should be readable.
  • Rethink about how we should specify the backend in our build script. Previously the backends are mutually exclusive, but now they are not (we can have OMP, CUDA, CUDA + OMP, CUDA + TBB, etc.).

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work getting 5x speedup on the small scales!

THRUST_DYNAMIC_BACKEND_VOID(gather, )
THRUST_DYNAMIC_BACKEND_VOID(gather_if, )
THRUST_DYNAMIC_BACKEND_VOID(remove_if, _void)
THRUST_DYNAMIC_BACKEND_VOID(unique, _void)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the story with just a few having _void?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid name collision with the one define below with THRUST_DYNAMIC_BACKEND. The reason for having these void variants is to avoid the need of specifying the return type via remove_if<RET> if we don't need it. I did this a few days ago but not sure if this is still needed, will have a look at that and remove these variants if they are not used.

THRUST_DYNAMIC_BACKEND(remove, void)
THRUST_DYNAMIC_BACKEND(copy_if, void)
THRUST_DYNAMIC_BACKEND(remove_if, void)
THRUST_DYNAMIC_BACKEND(unique, void)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between THRUST_DYNAMIC_BACKEND_VOID(copy, ) and e.g. THRUST_DYNAMIC_BACKEND(copy_if, void)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the simplification, but I still don't have the answer to this question. Also, I thought these function do have return values, so void seems odd?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot about this. I was lazy in implementing these functions and forgot to give them a return type. Will fix.

void check_cuda_available() {
int device_count = 0;
cudaError_t error = cudaGetDeviceCount(&device_count);
CUDA_ENABLED = device_count != 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

thrust::NAME(thrust::cuda::par, args...); \
break; \
case ExecutionPolicy::Par: \
thrust::NAME(thrust::omp::par, args...); \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if you wanted TBB, that would replace OMP here, right? Maybe for build rules we need something like ParUnseq = CUDA | NONE and Par = OMP | TBB | NONE? I guess NONE is mostly about building for compilers that don't support these.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I will update this and the CMake script after #125 is merged.

@pca006132
Copy link
Collaborator Author

ah, forgot that the build now depends on omp, will fix the cmake script

@pca006132
Copy link
Collaborator Author

pca006132 commented Jun 1, 2022

The OMP test failure is weird, not exactly sure what is going on. Managed to reproduce it in docker with the settings in the CI, with OMP_NUM_THREADS=2.

@pca006132
Copy link
Collaborator Author

I tried running the program with thread sanitizer enabled. On my machine (NixOS), I only got data race warnings for manifold/src/impl.cpp:644:33, manifold/src/impl.cpp:264:24 and manifold/src/impl.cpp:265:24 which are fine. On docker with Ubuntu however, there are too many data race warnings and it is impossible to identify what causes the nondeterministic behavior. I'm not entirely sure if this bug is caused by my changes or already exists but becomes more apparent due to this change.

as MSVC does not support them and we don't really need to use these.
@elalish
Copy link
Owner

elalish commented Jun 1, 2022

This PR is looking good to me; I'd like to test it locally a bit first. Do you feel like it's ready to merge? What behavior are you seeing from the OMP build on Ubuntu?

@pca006132
Copy link
Collaborator Author

I think this should be ready to merge after fixing the minor issue above (forgot to specify the return type of some functions) and clean up the formatting. For the OMP build on Ubuntu, the Boolean.Gyroid test failed with:

[ RUN      ] Boolean.Gyroid
/__w/manifold/manifold/test/mesh_test.cpp:762: Failure
Expected: (result.NumDegenerateTris()) <= (42), actual: 43 vs 42

Only with OMP_NUM_THREADS=2 and quite deterministically failed with this setting.

@elalish
Copy link
Owner

elalish commented Jun 1, 2022

Oh, that happens a lot; just change the value to 43 and call it good. It just serves to check if the numbers suddenly blow up (in which case edge collapse is broken), but the actual value is arbitrary and tends to shift.

@pca006132
Copy link
Collaborator Author

btw do you want to add a formatting commit to this? and if you do, do you have a clang-format file that you use or I'll just use the default.

@pca006132
Copy link
Collaborator Author

Looks like an issue with GitHub API. Quite a lot of outage recently.

@elalish
Copy link
Owner

elalish commented Jun 2, 2022

A format commit is a good idea. I just have my VSCode set up to do google style for clang-format every time I save. But probably better to put it in the CI. 👍

@elalish
Copy link
Owner

elalish commented Jun 2, 2022

Also, should the CI just build the limited version of Assimp and avoid the other formats, now that you have that flag? In fact, that could even be the default...

@pca006132
Copy link
Collaborator Author

Also, should the CI just build the limited version of Assimp and avoid the other formats, now that you have that flag? In fact, that could even be the default...

Yes it is now the default.

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are looking good on my side, thanks!

@elalish elalish merged commit d1e71eb into elalish:master Jun 3, 2022
@elalish
Copy link
Owner

elalish commented Aug 27, 2022

@pca006132 After a couple days of baffled debug, I finally found the root of a crazy problem that showed up in a test for #187. I thought you might appreciate what I found - I went to bed convincing myself the compiler was broken, but woke up realizing the truth.

I was seeing manifold errors in the Boolean (printed assertions like k = 1 and no intersection, coming from deep in the internals), which I haven't seen for a long time. It's concerning because this is the guaranteed part of the algorithm - those assertions really shouldn't be possible to hit. Stranger still, it worked fine on the CPU - only CUDA had the error.

My first guess was that my broad phase had an error so I wasn't checking for all the intersections I should be. I tested this by eliminating it and forcing a brute force check of every intersection. Sure enough, this made CUDA work fine again. But delving into the code, I couldn't find an error anywhere. Finally I got far enough down that I found a case where I was getting two slightly different results from my Interpolate() function for exactly the same input (I had to print all the floating-point digits to see the difference in the last decimal place). Sure enough, that violated the assumption of the algorithm and caused the failure, but how could this function not be deterministic?

Finally I remembered: we're using two compilers, one for CPU and one for CUDA. And the choice of which version of each function to use is determined at run time by the vector length. It turns out one pass was long, running CUDA, and one was short, running on the CPU. They must use subtly different operations or ordering that result in slightly different rounding errors occasionally. Overriding the policy to force them onto the same compiled function fixed the issue and verified my hunch. It also explained why my brute force approach seemed to fix the problem: it made all the vectors long enough to choose CUDA.

Anyway, I'm going to change the policy system in #187, so that the policy becomes a member of the Boolean class, evaluated only once or twice and then used consistently throughout the internal operations. I think this may even improve performance a touch, as the real cost of CUDA is in moving the data between the host and device; even short vectors are fast as long as the data is already on the device. This should help us be more consistent.

@pca006132
Copy link
Collaborator Author

@elalish Nice! This is actually way I want to implement it, to avoid moving data many times, but I was a bit too lazy to modify the boolean class to make it store the policy.

But it seems pretty weird to me that the slight rounding difference between CPU vs GPU will cause such a failure: they should abide to IEEE 754, so the difference probably comes from the increased precision. As mentioned in CUDA documentation, there are quite a lot of low level quirks that may give slightly different rounding error (slightly more precise). I wonder if it is possible to make it more robust against this kind of errors. Can you point me to the part of the paper that have such limitation on floating point operations?

@elalish
Copy link
Owner

elalish commented Aug 27, 2022

So, the sensitivity isn't really due to the paper, but the way I optimized it. The issue is Shadow 01 (vertex-edge); the first time I implemented it I calculated these values once and stored them, which means there is no sensitivity. However, that was a lot of memory and I found it quite a bit faster to recalculate them as needed rather than storing them. However, that meant I needed to get repeatable values (the value doesn't matter, only that they match). That also wasn't a problem until we started interchanging the two compilers.

@pca006132
Copy link
Collaborator Author

Oh ok, this makes sense to me now. I guess we should also document it so we will be aware of this in the future.

@elalish elalish mentioned this pull request Aug 31, 2022
@pca006132 pca006132 deleted the dynamic-backend branch December 22, 2022 05:35
cartesian-theatrics pushed a commit to SovereignShop/manifold that referenced this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants