Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thrust universal vector #120

Closed
pca006132 opened this issue May 13, 2022 · 22 comments · Fixed by #121
Closed

thrust universal vector #120

pca006132 opened this issue May 13, 2022 · 22 comments · Fixed by #121

Comments

@pca006132
Copy link
Collaborator

pca006132 commented May 13, 2022

Just googled for unified memory support in thrust. It seems that they have something called universal_vector which can be accessed from both the host and the device. I guess we can probably use this instead of VecDH with cleaner code and support for complex modeling with GPUs that only has a small RAM? And perhaps we can choose to run the operations on the host or device depending on the workload without much code duplication. I haven't tried this yet though, not sure about its performance or if there is any quirks.

@elalish
Copy link
Owner

elalish commented May 13, 2022

Ooh, that does look promising! I had always thought of CUB as being CUDA-specific and so not usable with other backends, but from the description that looks like just what we need. Thrust implements its CUDA backend with CUB, so I trust it to be as performant as possible.

@pca006132
Copy link
Collaborator Author

pca006132 commented May 14, 2022

I did a simple patch to try universal vector, perhaps my implementation is incorrect, but it seems that the performance is not great (actually a lot slower for CUDA). You can have a look at it here: pca006132@933eade

For CPP backend, it has a bit of performance improvement, probably due to less memory copying. For OMP backend, the performance is worse and I don't know why. For TBB backend, there is also a bit of performance improvement (and better than the OMP backend). Apart from performance improvement, it uses less memory as there is only one vector now.

@pca006132
Copy link
Collaborator Author

Did some debugging about this, it seems that the reason is having a lot of GPU page faults and causing this slowness (https://stackoverflow.com/questions/39782746/why-is-nvidia-pascal-gpus-slow-on-running-cuda-kernels-when-using-cudamallocmana/40011988#40011988). I tried doing prefetch but it does not work, perhaps I'm getting the wrong pointer or something. I think I will just leave it later.

@pca006132
Copy link
Collaborator Author

filed an issue: NVIDIA/cccl#809

@pca006132
Copy link
Collaborator Author

Tried to work around this problem by using a little cache and kind of succeeded: Got most of the tests passed on CUDA except the knot example, with significant performance improvement and perfTest no longer causes OOM on my machine. There is also some performance improvement for the CPP and OMP backend, as well as using less memory. All tests passed for other backends.

However I have no idea about the problem with the knot example. I can clean up the changes and open a PR if you are interested in it, or I can wait for thrust to fix the performance issue so no workaround will be needed.

@elalish
Copy link
Owner

elalish commented May 15, 2022

Definitely interested! And let's not wait for thrust; that could easily be a long wait.

@pca006132
Copy link
Collaborator Author

This also allows us to do dynamic backend: we can choose to run the algorithm on the host or on the device, and can run on the CPU when there is no GPU on the target machine (without recompiling with another backendk, not yet tested). A prototype implementation is here: https://github.com/pca006132/manifold/blob/dynamic-backend/utilities/include/par.h

I got some performance improvement for CUDA by running small operations on the host, although not much. Slowness due to slow malloc managed memory is a huge problem.

@elalish
Copy link
Owner

elalish commented May 16, 2022

Wow, that's pretty slick! I really like the idea of one compiled version automatically working for both CPU and GPU. Can you show a few numbers regarding the performance hit you're seeing from managed memory?

@pca006132
Copy link
Collaborator Author

pca006132 commented May 16, 2022

====== CUDA, Host Device Vector =======
nTri = 512, time = 0.00396219 sec
nTri = 2048, time = 0.00530954 sec
nTri = 8192, time = 0.00963976 sec
nTri = 32768, time = 0.0254835 sec

====== CUDA, Unified Memory =======
nTri = 512, time = 0.0123526 sec
nTri = 2048, time = 0.0141026 sec
nTri = 8192, time = 0.0195818 sec
nTri = 32768, time = 0.0374303 sec

And the tests run significantly slower.

I tried profiling it, it seems that there is a lot of page fault. malloc is actually very quick. IIRC the suggested workaround for this is to do prefetching, but it might not be easy to do prefetching without making the code very complicated.

@pca006132
Copy link
Collaborator Author

I'm thinking whether we should use the old design, i.e. having host and device vector, but use universal vector to replace device vector. That way, we will get the benefit of being able to run the algorithm on the host or on the device depending on the workload (as the device vector is now universal), but not suffer from page fault by having a host vector for most of the host operations.

I think I will try to implement that and compare their performance.

@pca006132
Copy link
Collaborator Author

Using the old design, the performance is something like this:

nTri = 512, time = 0.00869293 sec
nTri = 2048, time = 0.0102029 sec
nTri = 8192, time = 0.0150084 sec
nTri = 32768, time = 0.0343637 sec
nTri = 131072, time = 0.107823 sec
nTri = 524288, time = 0.367812 sec
nTri = 2097152, time = 1.34327 sec
nTri = 8388608, time = 5.8539 sec
	Command being timed: "./perfTest"
	User time (seconds): 8.11
	System time (seconds): 2.70
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.83
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 4777648
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 24
	Minor (reclaiming a frame) page faults: 2297348
	Voluntary context switches: 195
	Involuntary context switches: 33
	Swaps: 0
	File system inputs: 3488
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The performance for CPP and OMP is similar: a bit worse than before.

With a lot less page faults, but the performance is slightly worse than the original version (although it will not OOM when GPU memory is not sufficient). Maybe the best approach would be to allow read/write on the universal vector based on workload? e.g. act directly on the universal vector if there are not much read/write, and duplicate the vector to the host if we need to perform batch processing, But this might be a bit too complicated.

@elalish
Copy link
Owner

elalish commented May 16, 2022

And if I'm not mistaken, it's slower for smaller problems and faster for larger ones? Seems like a good trade to avoid OOM. I wonder if some of this is due to how/where the vector gets initialized?

@pca006132
Copy link
Collaborator Author

Indeed, I guess this is the typical tradeoff we get when using a GPU. I think the page fault is due to how CUDA migrates unified memory (at least in default mode): it will migrate to the device when that device touches the memory. I guess the page fault I got was due to frequent switching between GPU and CPU for small buffers, and page fault after the unified memory migrates to another device. Will see if using CPU for smaller problems can eliminate this issue.

@pca006132
Copy link
Collaborator Author

Did some profiling and performance tuning, by making small operations running on the CPU + using cudaMemAdvise to ask cuda to put the unified memory on the CPU for small buffers, I got the following results:

nTri = 512, time = 0.00757743 sec
nTri = 2048, time = 0.0128708 sec
nTri = 8192, time = 0.0375996 sec
nTri = 32768, time = 0.0620162 sec
nTri = 131072, time = 0.266587 sec
nTri = 524288, time = 0.437117 sec
nTri = 2097152, time = 1.76358 sec
nTri = 8388608, time = 3.81932 sec

... which is still suboptimal. Did some profiling with nvprof, it seems that initialization of universal vectors will call uninitialized_fill_n on the GPU, which causes page faults + synchronization overhead + migrates the universal vector to the GPU, and then I have to move it back to the CPU... so nTri = 512 took 7ms.

GPU trace (grep page faults):

282.80ms  172.93us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    24   0x7f5a72000000  [Unified Memory GPU page faults]
283.08ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x47f393   0x7f5a72000000  [Unified Memory CPU page faults]
283.34ms  62.143us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     1   0x7f5a72000000  [Unified Memory GPU page faults]
283.41ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4975db   0x7f5a72000000  [Unified Memory CPU page faults]
283.61ms  52.448us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     1   0x7f5a72000000  [Unified Memory GPU page faults]
283.95ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x540b69   0x7f5a72001000  [Unified Memory CPU page faults]
284.04ms  86.623us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     1   0x7f5a72001000  [Unified Memory GPU page faults]
284.15ms         -                    -               -         -         -         -         -           -           -           -                -         -         -         PC 0xaac9b7db   0x7f5a72001000  [Unified Memory CPU page faults]
284.30ms  46.176us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     1   0x7f5a72001000  [Unified Memory GPU page faults]
284.55ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4f18a0   0x7f5a72002000  [Unified Memory CPU page faults]
284.63ms  71.008us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    21   0x7f5a72009000  [Unified Memory GPU page faults]
284.72ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4f1cea   0x7f5a72009000  [Unified Memory CPU page faults]
284.80ms  38.943us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    24   0x7f5a7200a000  [Unified Memory GPU page faults]
284.85ms  55.264us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    27   0x7f5a72010000  [Unified Memory GPU page faults]
284.96ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x47f3bf   0x7f5a72010000  [Unified Memory CPU page faults]
285.02ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x47f3f1   0x7f5a7200b000  [Unified Memory CPU page faults]
285.22ms  55.648us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     9   0x7f5a72009000  [Unified Memory GPU page faults]
285.28ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c5dff   0x7f5a72009000  [Unified Memory CPU page faults]
285.36ms  46.368us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     9   0x7f5a72009000  [Unified Memory GPU page faults]
285.42ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4b3450   0x7f5a7200a000  [Unified Memory CPU page faults]
285.53ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6796   0x7f5a72002000  [Unified Memory CPU page faults]
285.58ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6937   0x7f5a72005000  [Unified Memory CPU page faults]
285.63ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c67a8   0x7f5a72003000  [Unified Memory CPU page faults]
285.75ms  58.656us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     8   0x7f5a72010000  [Unified Memory GPU page faults]
285.87ms  48.800us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     2   0x7f5a7200a000  [Unified Memory GPU page faults]
285.94ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6fc4   0x7f5a72010000  [Unified Memory CPU page faults]
285.98ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c703b   0x7f5a72011000  [Unified Memory CPU page faults]
286.00ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c703b   0x7f5a72014000  [Unified Memory CPU page faults]
286.02ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6fc4   0x7f5a72016000  [Unified Memory CPU page faults]
286.15ms  59.295us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     2   0x7f5a7200b000  [Unified Memory GPU page faults]
286.25ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x480d42   0x7f5a7200b000  [Unified Memory CPU page faults]
286.32ms  73.055us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    15   0x7f5a72017000  [Unified Memory GPU page faults]
286.41ms  56.928us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    11   0x7f5a7200b000  [Unified Memory GPU page faults]
286.49ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x540b69   0x7f5a7200c000  [Unified Memory CPU page faults]
286.54ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x5429a0   0x7f5a72016000  [Unified Memory CPU page faults]
286.72ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x46c260   0x7f5a72006000  [Unified Memory CPU page faults]
286.82ms  81.759us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    16   0x7f5a72010000  [Unified Memory GPU page faults]
286.90ms  18.048us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     6   0x7f5a7201d000  [Unified Memory GPU page faults]
286.92ms  16.480us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     2   0x7f5a72016000  [Unified Memory GPU page faults]
286.97ms  47.744us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     2   0x7f5a7200d000  [Unified Memory GPU page faults]
287.01ms  14.464us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     1   0x7f5a7200e000  [Unified Memory GPU page faults]
287.06ms  48.768us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     3   0x7f5a7200b000  [Unified Memory GPU page faults]
287.14ms  56.576us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    34   0x7f5a72005000  [Unified Memory GPU page faults]
287.19ms  9.6000us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     2   0x7f5a72002000  [Unified Memory GPU page faults]
287.20ms  8.8960us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     1   0x7f5a72003000  [Unified Memory GPU page faults]
287.25ms  28.320us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     9   0x7f5a72017000  [Unified Memory GPU page faults]
287.27ms  28.576us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    14   0x7f5a72018000  [Unified Memory GPU page faults]
287.34ms  8.8640us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     9   0x7f5a7200c000  [Unified Memory GPU page faults]
287.39ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x46b98d   0x7f5a72003000  [Unified Memory CPU page faults]
287.45ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x46b719   0x7f5a72004000  [Unified Memory CPU page faults]
287.50ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x47ff9d   0x7f5a7200e000  [Unified Memory CPU page faults]
287.55ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x47ffa6   0x7f5a7200f000  [Unified Memory CPU page faults]
287.57ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x542be5   0x7f5a72022000  [Unified Memory CPU page faults]
287.59ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x542c06   0x7f5a72023000  [Unified Memory CPU page faults]
287.66ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x48785b   0x7f5a72010000  [Unified Memory CPU page faults]
287.67ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x487843   0x7f5a7202a000  [Unified Memory CPU page faults]
287.69ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x48785b   0x7f5a72011000  [Unified Memory CPU page faults]
287.70ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x487866   0x7f5a7202b000  [Unified Memory CPU page faults]
287.72ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x48785b   0x7f5a72014000  [Unified Memory CPU page faults]
287.77ms  15.456us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    26   0x7f5a7202f000  [Unified Memory GPU page faults]
287.82ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x481367   0x7f5a72030000  [Unified Memory CPU page faults]
287.86ms  10.400us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     8   0x7f5a72033000  [Unified Memory GPU page faults]
287.92ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x5433eb   0x7f5a72029000  [Unified Memory CPU page faults]
287.94ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x5437a8   0x7f5a72036000  [Unified Memory CPU page faults]
287.98ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x5437f1   0x7f5a72037000  [Unified Memory CPU page faults]
288.05ms  14.720us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     8   0x7f5a72010000  [Unified Memory GPU page faults]
288.09ms  15.648us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     4   0x7f5a72036000  [Unified Memory GPU page faults]
288.11ms  7.1030us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     6   0x7f5a72037000  [Unified Memory GPU page faults]
288.13ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x46e280   0x7f5a72010000  [Unified Memory CPU page faults]
288.30ms  9.9200us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     4   0x7f5a72010000  [Unified Memory GPU page faults]
288.42ms  49.887us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    20   0x7f5a7202b000  [Unified Memory GPU page faults]
288.47ms  12.384us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     2   0x7f5a7202a000  [Unified Memory GPU page faults]
288.50ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x487843   0x7f5a7202a000  [Unified Memory CPU page faults]
288.52ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x487866   0x7f5a7202b000  [Unified Memory CPU page faults]
288.57ms  54.912us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    24   0x7f5a72030000  [Unified Memory GPU page faults]
288.65ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x481367   0x7f5a72030000  [Unified Memory CPU page faults]
288.72ms  56.352us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    30   0x7f5a72038000  [Unified Memory GPU page faults]
288.81ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x543797   0x7f5a72034000  [Unified Memory CPU page faults]
288.91ms  47.200us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     8   0x7f5a72034000  [Unified Memory GPU page faults]
289.00ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x46e280   0x7f5a7203a000  [Unified Memory CPU page faults]
289.16ms  46.624us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     4   0x7f5a7203a000  [Unified Memory GPU page faults]
289.36ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x543c23   0x7f5a72035000  [Unified Memory CPU page faults]
289.41ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x543c23   0x7f5a72036000  [Unified Memory CPU page faults]
289.66ms  51.871us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     4   0x7f5a72035000  [Unified Memory GPU page faults]
289.71ms  9.9210us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     1   0x7f5a72036000  [Unified Memory GPU page faults]
289.87ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x543c23   0x7f5a72038000  [Unified Memory CPU page faults]
290.05ms  51.615us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     4   0x7f5a72038000  [Unified Memory GPU page faults]
290.19ms  57.823us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    18   0x7f5a7202b000  [Unified Memory GPU page faults]
290.25ms  10.144us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     4   0x7f5a7202a000  [Unified Memory GPU page faults]
290.29ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x50e25e   0x7f5a7202a000  [Unified Memory CPU page faults]
290.35ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x50e286   0x7f5a7202d000  [Unified Memory CPU page faults]
290.40ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x50e2aa   0x7f5a7202e000  [Unified Memory CPU page faults]
290.51ms  44.287us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     2   0x7f5a7202a000  [Unified Memory GPU page faults]
290.84ms  47.808us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     8   0x7f5a72030000  [Unified Memory GPU page faults]
290.92ms  39.999us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    27   0x7f5a72040000  [Unified Memory GPU page faults]
290.97ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x50e862   0x7f5a72030000  [Unified Memory CPU page faults]
291.00ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x50eaeb   0x7f5a7203b000  [Unified Memory CPU page faults]
291.21ms  69.056us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    75   0x7f5a7203b000  [Unified Memory GPU page faults]
291.27ms  21.087us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    20   0x7f5a72030000  [Unified Memory GPU page faults]
292.38ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4fa31b   0x7f5a72032000  [Unified Memory CPU page faults]
292.43ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4fa31e   0x7f5a72039000  [Unified Memory CPU page faults]
292.64ms  39.551us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    18   0x7f5a72039000  [Unified Memory GPU page faults]
294.17ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x5204df   0x7f5a72042000  [Unified Memory CPU page faults]
294.35ms  52.224us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    20   0x7f5a72047000  [Unified Memory GPU page faults]
294.49ms  50.176us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     9   0x7f5a72032000  [Unified Memory GPU page faults]
294.57ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x531d6f   0x7f5a72006000  [Unified Memory CPU page faults]
294.62ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x531e1e   0x7f5a72032000  [Unified Memory CPU page faults]
294.68ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x5320fa   0x7f5a72048000  [Unified Memory CPU page faults]
294.72ms  67.648us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    10   0x7f5a72048000  [Unified Memory GPU page faults]
294.89ms  36.896us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    13   0x7f5a72032000  [Unified Memory GPU page faults]
294.98ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x53b8b9   0x7f5a7204a000  [Unified Memory CPU page faults]
295.03ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x53b824   0x7f5a72048000  [Unified Memory CPU page faults]
295.26ms  60.448us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     8   0x7f5a7204b000  [Unified Memory GPU page faults]
295.46ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x47f43e   0x7f5a7204c000  [Unified Memory CPU page faults]
295.49ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x47f3a0   0x7f5a72050000  [Unified Memory CPU page faults]
296.00ms  74.271us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    21   0x7f5a72054000  [Unified Memory GPU page faults]
296.10ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6796   0x7f5a72054000  [Unified Memory CPU page faults]
296.27ms  71.488us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    25   0x7f5a72059000  [Unified Memory GPU page faults]
296.60ms  51.072us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    32   0x7f5a72060000  [Unified Memory GPU page faults]
296.65ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6fc4   0x7f5a72058000  [Unified Memory CPU page faults]
296.71ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6f7d   0x7f5a72051000  [Unified Memory CPU page faults]
296.73ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6fc4   0x7f5a7205d000  [Unified Memory CPU page faults]
296.75ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x4c6fc4   0x7f5a72060000  [Unified Memory CPU page faults]
296.96ms  78.880us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                    14   0x7f5a7204c000  [Unified Memory GPU page faults]
297.09ms  47.039us                    -               -         -         -         -         -           -           -           -  NVIDIA GeForce          -         -                     9   0x7f5a72041000  [Unified Memory GPU page faults]
297.19ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x540b69   0x7f5a72041000  [Unified Memory CPU page faults]
297.22ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x542998   0x7f5a72054000  [Unified Memory CPU page faults]
297.27ms         -                    -               -         -         -         -         -           -           -           -                -         -         -           PC 0x542998   0x7f5a72055000  [Unified Memory CPU page faults]

Actually, for our use case, we just need a fixed size universal memory buffer. Wondering if we should just roll our own vector for this.

@elalish
Copy link
Owner

elalish commented May 17, 2022

Well, the longest run time is still coming down, so that's a good sign at least. You're welcome to roll your own; you seem pretty deep in the guts of CUDA already. Perhaps having beginD and beginH would be useful for indicating the prefetch direction? I trust you to find a good solution.

@pca006132
Copy link
Collaborator Author

Yeah I did not plan to dig into such implementation details 2 days ago. Prefetch cannont help in this case because the initializer of the universal vector will do the uninitialized fill, and that will cause GPU page faults and subsequence CPU page faults due to memory migration. I cannot insert prefetch before the uninitialized fill so can't really deal with that.

I think I probably should file an issue to thrust regarding this as well, universal vectors are supposed to be used on the cpu and gpu, and this will clearly cause performance problems in that case. (mainly for small vectors I guess)

@elalish
Copy link
Owner

elalish commented May 17, 2022

Sounds good. Anyway, you've averted a bunch of GPU OOM problems, while cutting down the run time of large problems. Even at the cost of some reduced performance on small problems, this still seems good to merge. What do you think?

@pca006132
Copy link
Collaborator Author

Yes should be good to merge, I think I can work on the performance for small problems on later PRs.

@pca006132
Copy link
Collaborator Author

FYI: using a custom vector implementation get the time to something like this:

nTri = 512, time = 0.00285064 sec
nTri = 2048, time = 0.00649843 sec
nTri = 8192, time = 0.00892476 sec
nTri = 32768, time = 0.0603074 sec
nTri = 131072, time = 0.124639 sec
nTri = 524288, time = 0.383436 sec
nTri = 2097152, time = 1.26768 sec
nTri = 8388608, time = 5.03177 sec

Not sure why complicated problems are now slower, I guess this is probably related to how I do prefetching (did not fine tune it, and seems pretty hard to fine tune anyway).

However, there are quite a few tests failures. They already failed when using the universal vector, so I guess there is some problem in implementing the dynamic backend feature. I will try to fix that and submit a PR that includes these two features (custom vector + dynamic backend)

@elalish
Copy link
Owner

elalish commented May 17, 2022

That still looks great, thanks! Might it help to do just the custom vector first and then do the dynamic backend as a follow-on? Always nice to break up the PRs if we can, especially if the later one is triggering the debug.

@pca006132
Copy link
Collaborator Author

That result is with dynamic backend enabled, I have to try putting it back to the master and see it it works and still performs this well. The problem is that the heuristics I wrote to determine where to put the vector is dependent on the dynamic backend parameters: if the vector size is less than X, put it in the cpu, and gpu otherwise. Will clean it up and open a PR tmr if everything works fine.

@pca006132
Copy link
Collaborator Author

Tuning prefetch is harder than I thought... the results are not very consistent. I guess I will just leave it as is and work on other items first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants