performance tuning and disabling deprecated backends #525

pca006132 · 2023-08-10T09:05:36Z

There are various things in this PR:

Disabled OMP and CUDA backend. The CUDA related code is still there because it will take a large cleanup to remove them. (e.g. those attributes).
Added hooks for allocation/deallocation, so Tracy can be used for looking at memory allocation data (number of allocation, size of allocations, and when they are done). This is only limited to VecDH for now. I can remove this commit if we don't really need this.
Performance optimization by using a single vector for SparseIndices instead of two, and use faster binary search adopted from https://github.com/scandum/binary_search/blob/master/binary_search.c#L264. I will post benchmark data later today.
Use thread sanitizer to debug data races. I ignored src/third_party/thrust/thrust/system/tbb/detail/reduce.inl using ignorelist as it is known that thrust has data race for parallel reduce. I fixed some of the simpler ones by using atomics, but there are still two cases: UpdateProperties and CoplanarEdge. Note that by c++ standard, data race between threads without using atomics or memory fence is undefined behavior.

Closes #524

codecov · 2023-08-10T10:28:14Z

Codecov Report

Patch coverage: 78.22% and project coverage change: +0.33% 🎉

Comparison is base (fca6638) 90.37% compared to head (1a1678c) 90.70%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #525      +/-   ##
==========================================
+ Coverage   90.37%   90.70%   +0.33%     
==========================================
  Files          35       34       -1     
  Lines        4456     4292     -164     
==========================================
- Hits         4027     3893     -134     
+ Misses        429      399      -30

Files Changed	Coverage Δ
src/collider/include/collider.h	`66.66% <ø> (ø)`
src/manifold/src/impl.h	`72.72% <ø> (ø)`
src/utilities/include/utils.h	`55.88% <20.00%> (ø)`
src/manifold/src/constructors.cpp	`94.84% <33.33%> (+0.05%)`	⬆️
src/manifold/src/shared.h	`61.03% <50.00%> (+0.78%)`	⬆️
src/utilities/include/sparse.h	`54.38% <51.16%> (+5.20%)`	⬆️
src/manifold/src/sort.cpp	`88.42% <52.94%> (-0.72%)`	⬇️
src/manifold/src/properties.cpp	`83.91% <61.53%> (-0.90%)`	⬇️
src/manifold/src/csg_tree.cpp	`91.46% <66.66%> (+0.05%)`	⬆️
src/manifold/src/smoothing.cpp	`91.88% <66.66%> (+0.38%)`	⬆️
... and 11 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pca006132 · 2023-08-10T14:09:48Z

I only included benchmarks that run longer than 100ms here. This is run with a i5-13600 KF @ 3.5GHz (turbo boost disabled), and I took average over 5 runs. The two cases that got slight regression are not related to boolean so I am not sure why. For the other cases, we got about 8% to up to 30% improvement in some cases, and the improvement is larger when there are less cores (so I suspect the diminishing return is mostly due to memory bandwidth issue).

This shows that we probably still have around 20% of sequential things, e.g. SimplifyTopology or memory bandwidth constraints. The staircase pattern is probably due to how my benchmark script and CPU works: I was using taskset to restrict the program to CPU 0..(n-1), and the i-th p-core has id 2i and 2i+1, so half of the times I was just enabling SMT on one core instead of a real core.

Also, note that the seemingly worse scaling does not mean that this PR is somehow making running time worse when you have more cores, as we can see that there are improvements to tests running with 12 cores as well.

pca006132 · 2023-08-10T17:44:40Z

and for some reason windows tbb build is now working, nice!

elalish

This looks amazing! I'll update our list of required tests.

src/collider/src/collider.cpp

elalish · 2023-08-10T19:47:54Z

src/manifold/src/boolean3.cpp

-  else
+// https://github.com/scandum/binary_search/blob/master/README.md
+// much faster than standard binary search on large arrays
+int monobound_quaternary_search(const int64_t *array, unsigned int array_size,


src/manifold/src/boolean3.cpp

src/manifold/src/impl.cpp

elalish · 2023-08-10T19:58:38Z

src/manifold/src/impl.cpp

@@ -185,7 +186,8 @@ struct CoplanarEdge {
  const int numProp;
  const float precision;

-  __host__ __device__ void operator()(
+  // FIXME: race condition


Is this just non-determinism, or can it be worse?

In theory it can be worse, as this is UB. In practice, it may or may not cause non-determinism depending on whether or not the assigned value is different.

I will just put this into a new issue

src/utilities/include/sparse.h

elalish · 2023-08-10T20:11:26Z

src/utilities/include/vec_dh.h

+#else
+#define TracyAllocS(ptr, size, n) (void)0
+#define TracyFreeS(ptr, n) (void)0
+#endif


The whole point of VecDH (at least originally) was to handle the fact that CUDA needs separate host and device vectors that need to be synced. With CUDA now removed, is there a chance we can remove VecDH entirely? Doesn't have to be in this PR, but it would be great if we can simplify here.

Yes this is what I am going to do, but I plan to do this in another PR as that will be a lot of changes.

Thinking about it, I'm thinking if we may want to have vulkan acceleration later. Vulkan would require something similar or perhaps a bit more complicated (as we may not have unified memory, depending on the target hardware).

also, as our VecDH class has specialized parallel fill and uninitialized memory routines, changing to std::vector<T> can result in more than 20% performance regression for some test cases.

I have no problem with leaving the class if it's actually doing something useful, but if it can be simplified to one internal storage instead of two, that would be great. We can always redesign it later if needed for Vulkan, but I don't want to plan early for that, especially considering that TBB ended up being just as fast as CUDA. That makes me think Vulkan may be more trouble than it's worth.

Yeah I can simplify that in another PR. It was really surprising when I see that switching to std::vector causes a large slowdown.

Regarding Vulkan, I think the only case where GPU can be faster is SDF.

Good to know. Maybe someday SDF will be used enough that that'll be worth looking into, but for now I'm thinking TBB is just fine.

elalish · 2023-08-10T20:18:44Z

Nice job on the benchmarks! We should really add our spheres perf benchmark to our standard tests so we remember to run them. I consider that to be more indicative of common real-world usage and scaling than these degenerate-focused tests.

pca006132 · 2023-08-11T06:25:09Z

Nice job on the benchmarks! We should really add our spheres perf benchmark to our standard tests so we remember to run them. I consider that to be more indicative of common real-world usage and scaling than these degenerate-focused tests.

Agreed. I actually checked the spheres benchmark as well, but I forgot to include it in the plots.

pca006132 · 2023-08-11T11:23:16Z

I also did some optimization for csg_tree today, I think the result is noticeable. However, a weird thing is that after changing SparseIndices API the tests in general run a bit slower, I suspect that this is due to pointer arithmetic not being optimized enough by GCC (or not inlined enough, as we originally use pointers directly but we now use methods).

elalish

This is excellent! A few nits, but I think this is ready when you are.

elalish · 2023-08-11T16:25:40Z

src/manifold/src/csg_tree.cpp

  };

+  // common cases
+  if (results.size() == 0) return std::make_shared<Manifold::Impl>();
+  if (results.size() == 1)


elalish · 2023-08-11T16:28:11Z

src/utilities/CMakeLists.txt

+endif()
+
+if(EMSCRIPTEN)
+    set(TBB_GIT_TAG 8db4efacad3b8aa6eea62281a9e8444e5dc8f16a)


What's this?

Oh, this is the TBB on wasm thing that I accidentally commited, will remove them.

elalish · 2023-08-11T16:29:15Z

src/utilities/include/hashtable.h

@@ -26,7 +26,7 @@ typedef Uint64 (*hash_fun_t)(Uint64);
 constexpr Uint64 kOpen = std::numeric_limits<Uint64>::max();

 template <typename T>
-__host__ __device__ T AtomicCAS(T& target, T compare, T val) {
+T AtomicCAS(T& target, T compare, T val) {


Looks like some more __CUDA_ARCH__ below here we can remove.

elalish · 2023-08-11T16:30:06Z

src/utilities/include/par.h

-    return ExecutionPolicy::Par;
-  }
-  return ExecutionPolicy::ParUnseq;
+  return ExecutionPolicy::Par;


Look like another good constexpr candidate above.

elalish · 2023-08-11T16:30:34Z

src/utilities/include/par.h

 }

-#ifdef MANIFOLD_USE_CUDA
-#define THRUST_DYNAMIC_BACKEND_VOID(NAME)                    \


elalish · 2023-08-11T16:30:54Z

src/utilities/include/par.h

@@ -138,13 +95,12 @@ inline ExecutionPolicy autoPolicy(int size) {
    }                                                        \
  }

-#if MANIFOLD_PAR == 'T' && !(__APPLE__)
+#if MANIFOLD_PAR == 'T' && __has_include(<pstl/glue_execution_defs.h>)


elalish · 2023-08-11T16:32:17Z

src/utilities/include/sparse.h

    }
    std::cout << std::endl;
  }
 #endif

 private:
  VecDH<int64_t> data;
+  inline int* ptr() { return reinterpret_cast<int32_t*>(data.ptrD()); }
+  inline const int* ptr() const {


Nice job making these private! Love the readability improvements here.

elalish · 2023-08-11T16:35:28Z

src/utilities/include/vec_dh.h

@@ -255,42 +232,19 @@ class ManagedVec {
    if (bytes >= (1ull << 63)) {
      throw std::bad_alloc();
    }
-#ifdef MANIFOLD_USE_CUDA
-    if (CudaEnabled())
-      cudaMallocManaged(reinterpret_cast<void **>(ptr), bytes);


Oh, that's right - I forgot you already refactored VecDH for managed memory. Looks like you've taken care of everything I'd hoped for!

pca006132 · 2023-08-11T17:36:10Z

and I also went ahead to implement the vec_dh refactoring.

pca006132 · 2023-08-11T18:03:36Z

It seems that some thrust functions will trigger GCC warnings, not sure if they are false positives or not. The code should be basically identical to the old code, and so far I haven't got address sanitizer errors. I am not too worried about them for now.

pca006132 · 2023-08-11T18:27:17Z

@elalish do you want to further review this or I will just merge it?

elalish

Looks great, thanks!

* use single vector SparseIndices * faster binary search * tracy * disable cuda and omp backends * formatting * find data races * remove some cuda stuff * rename indices to vertices * fix * better SparseIndices API * minor fixes * parallel boolean * optimize csg_tree * relax Samples.GyroidModule * small fixes * vec_dh refactor * fix compiler errors * disable some gcc warnings * only apply new warning flags for GCC

pca006132 added 3 commits August 10, 2023 16:31

use single vector SparseIndices

01fa606

faster binary search

ef9d2df

tracy

503dff7

pca006132 added 2 commits August 10, 2023 20:00

disable cuda and omp backends

8791e37

formatting

9220890

pca006132 force-pushed the misc branch from d13ebf5 to 62934e1 Compare August 10, 2023 12:01

pca006132 marked this pull request as ready for review August 10, 2023 12:01

pca006132 requested a review from elalish August 10, 2023 12:01

find data races

fc5b07f

pca006132 force-pushed the misc branch from 62934e1 to fc5b07f Compare August 10, 2023 12:03

pca006132 changed the title ~~[WIP] performance tuning and disabling deprecated backends~~ performance tuning and disabling deprecated backends Aug 10, 2023

elalish requested changes Aug 10, 2023

View reviewed changes

pca006132 added 8 commits August 11, 2023 15:24

remove some cuda stuff

2f7464b

rename indices to vertices

5a468d6

fix

ccee1a6

better SparseIndices API

5f8fec6

minor fixes

fabb944

parallel boolean

333f781

optimize csg_tree

76effe8

relax Samples.GyroidModule

fef5e48

elalish approved these changes Aug 11, 2023

View reviewed changes

pca006132 added 2 commits August 12, 2023 00:48

small fixes

c74bd6d

vec_dh refactor

4313271

pca006132 added 3 commits August 12, 2023 01:50

fix compiler errors

6b5cec6

disable some gcc warnings

c470d3f

only apply new warning flags for GCC

1a1678c

elalish approved these changes Aug 12, 2023

View reviewed changes

pca006132 merged commit 9290d42 into elalish:master Aug 12, 2023
15 checks passed

pca006132 mentioned this pull request Aug 14, 2023

#528 causes invalid UpdateVert calls #529

Closed

pca006132 deleted the misc branch August 15, 2023 12:54

elalish mentioned this pull request Nov 3, 2023

V2.2.1 #589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance tuning and disabling deprecated backends #525

performance tuning and disabling deprecated backends #525

pca006132 commented Aug 10, 2023 •

edited

Loading

codecov bot commented Aug 10, 2023 •

edited

Loading

pca006132 commented Aug 10, 2023 •

edited

Loading

pca006132 commented Aug 10, 2023

elalish left a comment

elalish Aug 10, 2023

elalish Aug 10, 2023

pca006132 Aug 11, 2023

pca006132 Aug 11, 2023

elalish Aug 10, 2023

pca006132 Aug 11, 2023

pca006132 Aug 11, 2023

pca006132 Aug 11, 2023

elalish Aug 11, 2023

pca006132 Aug 11, 2023

pca006132 Aug 11, 2023

elalish Aug 11, 2023

elalish commented Aug 10, 2023

pca006132 commented Aug 11, 2023

pca006132 commented Aug 11, 2023

elalish left a comment

elalish Aug 11, 2023

elalish Aug 11, 2023

pca006132 Aug 11, 2023

elalish Aug 11, 2023

elalish Aug 11, 2023

elalish Aug 11, 2023

elalish Aug 11, 2023

elalish Aug 11, 2023

elalish Aug 11, 2023

pca006132 commented Aug 11, 2023

pca006132 commented Aug 11, 2023

pca006132 commented Aug 11, 2023

elalish left a comment

performance tuning and disabling deprecated backends #525

performance tuning and disabling deprecated backends #525

Conversation

pca006132 commented Aug 10, 2023 • edited Loading

codecov bot commented Aug 10, 2023 • edited Loading

Codecov Report

pca006132 commented Aug 10, 2023 • edited Loading

pca006132 commented Aug 10, 2023

elalish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elalish commented Aug 10, 2023

pca006132 commented Aug 11, 2023

pca006132 commented Aug 11, 2023

elalish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pca006132 commented Aug 11, 2023

pca006132 commented Aug 11, 2023

pca006132 commented Aug 11, 2023

elalish left a comment

Choose a reason for hiding this comment

pca006132 commented Aug 10, 2023 •

edited

Loading

codecov bot commented Aug 10, 2023 •

edited

Loading

pca006132 commented Aug 10, 2023 •

edited

Loading