Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance tuning and disabling deprecated backends #525

Merged
merged 19 commits into from
Aug 12, 2023

Conversation

pca006132
Copy link
Collaborator

@pca006132 pca006132 commented Aug 10, 2023

There are various things in this PR:

  1. Disabled OMP and CUDA backend. The CUDA related code is still there because it will take a large cleanup to remove them. (e.g. those attributes).
  2. Added hooks for allocation/deallocation, so Tracy can be used for looking at memory allocation data (number of allocation, size of allocations, and when they are done). This is only limited to VecDH for now. I can remove this commit if we don't really need this.
  3. Performance optimization by using a single vector for SparseIndices instead of two, and use faster binary search adopted from https://github.com/scandum/binary_search/blob/master/binary_search.c#L264. I will post benchmark data later today.
  4. Use thread sanitizer to debug data races. I ignored src/third_party/thrust/thrust/system/tbb/detail/reduce.inl using ignorelist as it is known that thrust has data race for parallel reduce. I fixed some of the simpler ones by using atomics, but there are still two cases: UpdateProperties and CoplanarEdge. Note that by c++ standard, data race between threads without using atomics or memory fence is undefined behavior.

Closes #524

@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Patch coverage: 78.22% and project coverage change: +0.33% 🎉

Comparison is base (fca6638) 90.37% compared to head (1a1678c) 90.70%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #525      +/-   ##
==========================================
+ Coverage   90.37%   90.70%   +0.33%     
==========================================
  Files          35       34       -1     
  Lines        4456     4292     -164     
==========================================
- Hits         4027     3893     -134     
+ Misses        429      399      -30     
Files Changed Coverage Δ
src/collider/include/collider.h 66.66% <ø> (ø)
src/manifold/src/impl.h 72.72% <ø> (ø)
src/utilities/include/utils.h 55.88% <20.00%> (ø)
src/manifold/src/constructors.cpp 94.84% <33.33%> (+0.05%) ⬆️
src/manifold/src/shared.h 61.03% <50.00%> (+0.78%) ⬆️
src/utilities/include/sparse.h 54.38% <51.16%> (+5.20%) ⬆️
src/manifold/src/sort.cpp 88.42% <52.94%> (-0.72%) ⬇️
src/manifold/src/properties.cpp 83.91% <61.53%> (-0.90%) ⬇️
src/manifold/src/csg_tree.cpp 91.46% <66.66%> (+0.05%) ⬆️
src/manifold/src/smoothing.cpp 91.88% <66.66%> (+0.38%) ⬆️
... and 11 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pca006132
Copy link
Collaborator Author

pca006132 commented Aug 10, 2023

image

I only included benchmarks that run longer than 100ms here. This is run with a i5-13600 KF @ 3.5GHz (turbo boost disabled), and I took average over 5 runs. The two cases that got slight regression are not related to boolean so I am not sure why. For the other cases, we got about 8% to up to 30% improvement in some cases, and the improvement is larger when there are less cores (so I suspect the diminishing return is mostly due to memory bandwidth issue).

image
This shows that we probably still have around 20% of sequential things, e.g. SimplifyTopology or memory bandwidth constraints. The staircase pattern is probably due to how my benchmark script and CPU works: I was using taskset to restrict the program to CPU 0..(n-1), and the i-th p-core has id 2i and 2i+1, so half of the times I was just enabling SMT on one core instead of a real core.

Also, note that the seemingly worse scaling does not mean that this PR is somehow making running time worse when you have more cores, as we can see that there are improvements to tests running with 12 cores as well.

@pca006132 pca006132 changed the title [WIP] performance tuning and disabling deprecated backends performance tuning and disabling deprecated backends Aug 10, 2023
@pca006132
Copy link
Collaborator Author

and for some reason windows tbb build is now working, nice!

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing! I'll update our list of required tests.

src/collider/src/collider.cpp Outdated Show resolved Hide resolved
else
// https://github.com/scandum/binary_search/blob/master/README.md
// much faster than standard binary search on large arrays
int monobound_quaternary_search(const int64_t *array, unsigned int array_size,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy!

src/manifold/src/boolean3.cpp Outdated Show resolved Hide resolved
src/manifold/src/boolean3.cpp Outdated Show resolved Hide resolved
src/manifold/src/impl.cpp Show resolved Hide resolved
@@ -185,7 +186,8 @@ struct CoplanarEdge {
const int numProp;
const float precision;

__host__ __device__ void operator()(
// FIXME: race condition
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just non-determinism, or can it be worse?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it can be worse, as this is UB. In practice, it may or may not cause non-determinism depending on whether or not the assigned value is different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just put this into a new issue

src/utilities/include/sparse.h Outdated Show resolved Hide resolved
#else
#define TracyAllocS(ptr, size, n) (void)0
#define TracyFreeS(ptr, n) (void)0
#endif
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of VecDH (at least originally) was to handle the fact that CUDA needs separate host and device vectors that need to be synced. With CUDA now removed, is there a chance we can remove VecDH entirely? Doesn't have to be in this PR, but it would be great if we can simplify here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is what I am going to do, but I plan to do this in another PR as that will be a lot of changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I'm thinking if we may want to have vulkan acceleration later. Vulkan would require something similar or perhaps a bit more complicated (as we may not have unified memory, depending on the target hardware).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, as our VecDH class has specialized parallel fill and uninitialized memory routines, changing to std::vector<T> can result in more than 20% performance regression for some test cases.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with leaving the class if it's actually doing something useful, but if it can be simplified to one internal storage instead of two, that would be great. We can always redesign it later if needed for Vulkan, but I don't want to plan early for that, especially considering that TBB ended up being just as fast as CUDA. That makes me think Vulkan may be more trouble than it's worth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can simplify that in another PR. It was really surprising when I see that switching to std::vector causes a large slowdown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding Vulkan, I think the only case where GPU can be faster is SDF.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Maybe someday SDF will be used enough that that'll be worth looking into, but for now I'm thinking TBB is just fine.

@elalish
Copy link
Owner

elalish commented Aug 10, 2023

Nice job on the benchmarks! We should really add our spheres perf benchmark to our standard tests so we remember to run them. I consider that to be more indicative of common real-world usage and scaling than these degenerate-focused tests.

@pca006132
Copy link
Collaborator Author

Nice job on the benchmarks! We should really add our spheres perf benchmark to our standard tests so we remember to run them. I consider that to be more indicative of common real-world usage and scaling than these degenerate-focused tests.

Agreed. I actually checked the spheres benchmark as well, but I forgot to include it in the plots.

@pca006132
Copy link
Collaborator Author

image

image

I also did some optimization for csg_tree today, I think the result is noticeable. However, a weird thing is that after changing SparseIndices API the tests in general run a bit slower, I suspect that this is due to pointer arithmetic not being optimized enough by GCC (or not inlined enough, as we originally use pointers directly but we now use methods).

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent! A few nits, but I think this is ready when you are.

};

// common cases
if (results.size() == 0) return std::make_shared<Manifold::Impl>();
if (results.size() == 1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

endif()

if(EMSCRIPTEN)
set(TBB_GIT_TAG 8db4efacad3b8aa6eea62281a9e8444e5dc8f16a)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is the TBB on wasm thing that I accidentally commited, will remove them.

@@ -26,7 +26,7 @@ typedef Uint64 (*hash_fun_t)(Uint64);
constexpr Uint64 kOpen = std::numeric_limits<Uint64>::max();

template <typename T>
__host__ __device__ T AtomicCAS(T& target, T compare, T val) {
T AtomicCAS(T& target, T compare, T val) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some more __CUDA_ARCH__ below here we can remove.

return ExecutionPolicy::Par;
}
return ExecutionPolicy::ParUnseq;
return ExecutionPolicy::Par;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like another good constexpr candidate above.

}

#ifdef MANIFOLD_USE_CUDA
#define THRUST_DYNAMIC_BACKEND_VOID(NAME) \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@@ -138,13 +95,12 @@ inline ExecutionPolicy autoPolicy(int size) {
} \
}

#if MANIFOLD_PAR == 'T' && !(__APPLE__)
#if MANIFOLD_PAR == 'T' && __has_include(<pstl/glue_execution_defs.h>)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
std::cout << std::endl;
}
#endif

private:
VecDH<int64_t> data;
inline int* ptr() { return reinterpret_cast<int32_t*>(data.ptrD()); }
inline const int* ptr() const {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job making these private! Love the readability improvements here.

@@ -255,42 +232,19 @@ class ManagedVec {
if (bytes >= (1ull << 63)) {
throw std::bad_alloc();
}
#ifdef MANIFOLD_USE_CUDA
if (CudaEnabled())
cudaMallocManaged(reinterpret_cast<void **>(ptr), bytes);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's right - I forgot you already refactored VecDH for managed memory. Looks like you've taken care of everything I'd hoped for!

@pca006132
Copy link
Collaborator Author

and I also went ahead to implement the vec_dh refactoring.

@pca006132
Copy link
Collaborator Author

It seems that some thrust functions will trigger GCC warnings, not sure if they are false positives or not. The code should be basically identical to the old code, and so far I haven't got address sanitizer errors. I am not too worried about them for now.

@pca006132
Copy link
Collaborator Author

@elalish do you want to further review this or I will just merge it?

Copy link
Owner

@elalish elalish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@pca006132 pca006132 merged commit 9290d42 into elalish:master Aug 12, 2023
15 checks passed
@pca006132 pca006132 deleted the misc branch August 15, 2023 12:54
@elalish elalish mentioned this pull request Nov 3, 2023
cartesian-theatrics pushed a commit to SovereignShop/manifold that referenced this pull request Mar 11, 2024
* use single vector SparseIndices

* faster binary search

* tracy

* disable cuda and omp backends

* formatting

* find data races

* remove some cuda stuff

* rename indices to vertices

* fix

* better SparseIndices API

* minor fixes

* parallel boolean

* optimize csg_tree

* relax Samples.GyroidModule

* small fixes

* vec_dh refactor

* fix compiler errors

* disable some gcc warnings

* only apply new warning flags for GCC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deprecate OMP and CUDA backend
2 participants