-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thrust universal vector #120
Comments
Ooh, that does look promising! I had always thought of CUB as being CUDA-specific and so not usable with other backends, but from the description that looks like just what we need. Thrust implements its CUDA backend with CUB, so I trust it to be as performant as possible. |
I did a simple patch to try universal vector, perhaps my implementation is incorrect, but it seems that the performance is not great (actually a lot slower for CUDA). You can have a look at it here: pca006132@933eade For CPP backend, it has a bit of performance improvement, probably due to less memory copying. For OMP backend, the performance is worse and I don't know why. For TBB backend, there is also a bit of performance improvement (and better than the OMP backend). Apart from performance improvement, it uses less memory as there is only one vector now. |
Did some debugging about this, it seems that the reason is having a lot of GPU page faults and causing this slowness (https://stackoverflow.com/questions/39782746/why-is-nvidia-pascal-gpus-slow-on-running-cuda-kernels-when-using-cudamallocmana/40011988#40011988). I tried doing prefetch but it does not work, perhaps I'm getting the wrong pointer or something. I think I will just leave it later. |
filed an issue: NVIDIA/cccl#809 |
Tried to work around this problem by using a little cache and kind of succeeded: Got most of the tests passed on CUDA except the knot example, with significant performance improvement and perfTest no longer causes OOM on my machine. There is also some performance improvement for the CPP and OMP backend, as well as using less memory. All tests passed for other backends. However I have no idea about the problem with the knot example. I can clean up the changes and open a PR if you are interested in it, or I can wait for thrust to fix the performance issue so no workaround will be needed. |
Definitely interested! And let's not wait for thrust; that could easily be a long wait. |
This also allows us to do dynamic backend: we can choose to run the algorithm on the host or on the device, and can run on the CPU when there is no GPU on the target machine (without recompiling with another backendk, not yet tested). A prototype implementation is here: https://github.com/pca006132/manifold/blob/dynamic-backend/utilities/include/par.h I got some performance improvement for CUDA by running small operations on the host, although not much. Slowness due to slow malloc managed memory is a huge problem. |
Wow, that's pretty slick! I really like the idea of one compiled version automatically working for both CPU and GPU. Can you show a few numbers regarding the performance hit you're seeing from managed memory? |
And the tests run significantly slower. I tried profiling it, it seems that there is a lot of page fault. malloc is actually very quick. IIRC the suggested workaround for this is to do prefetching, but it might not be easy to do prefetching without making the code very complicated. |
I'm thinking whether we should use the old design, i.e. having host and device vector, but use universal vector to replace device vector. That way, we will get the benefit of being able to run the algorithm on the host or on the device depending on the workload (as the device vector is now universal), but not suffer from page fault by having a host vector for most of the host operations. I think I will try to implement that and compare their performance. |
Using the old design, the performance is something like this:
The performance for CPP and OMP is similar: a bit worse than before. With a lot less page faults, but the performance is slightly worse than the original version (although it will not OOM when GPU memory is not sufficient). Maybe the best approach would be to allow read/write on the universal vector based on workload? e.g. act directly on the universal vector if there are not much read/write, and duplicate the vector to the host if we need to perform batch processing, But this might be a bit too complicated. |
And if I'm not mistaken, it's slower for smaller problems and faster for larger ones? Seems like a good trade to avoid OOM. I wonder if some of this is due to how/where the vector gets initialized? |
Indeed, I guess this is the typical tradeoff we get when using a GPU. I think the page fault is due to how CUDA migrates unified memory (at least in default mode): it will migrate to the device when that device touches the memory. I guess the page fault I got was due to frequent switching between GPU and CPU for small buffers, and page fault after the unified memory migrates to another device. Will see if using CPU for smaller problems can eliminate this issue. |
Did some profiling and performance tuning, by making small operations running on the CPU + using
... which is still suboptimal. Did some profiling with GPU trace (
Actually, for our use case, we just need a fixed size universal memory buffer. Wondering if we should just roll our own vector for this. |
Well, the longest run time is still coming down, so that's a good sign at least. You're welcome to roll your own; you seem pretty deep in the guts of CUDA already. Perhaps having |
Yeah I did not plan to dig into such implementation details 2 days ago. Prefetch cannont help in this case because the initializer of the universal vector will do the uninitialized fill, and that will cause GPU page faults and subsequence CPU page faults due to memory migration. I cannot insert prefetch before the uninitialized fill so can't really deal with that. I think I probably should file an issue to thrust regarding this as well, universal vectors are supposed to be used on the cpu and gpu, and this will clearly cause performance problems in that case. (mainly for small vectors I guess) |
Sounds good. Anyway, you've averted a bunch of GPU OOM problems, while cutting down the run time of large problems. Even at the cost of some reduced performance on small problems, this still seems good to merge. What do you think? |
Yes should be good to merge, I think I can work on the performance for small problems on later PRs. |
FYI: using a custom vector implementation get the time to something like this:
Not sure why complicated problems are now slower, I guess this is probably related to how I do prefetching (did not fine tune it, and seems pretty hard to fine tune anyway). However, there are quite a few tests failures. They already failed when using the universal vector, so I guess there is some problem in implementing the dynamic backend feature. I will try to fix that and submit a PR that includes these two features (custom vector + dynamic backend) |
That still looks great, thanks! Might it help to do just the custom vector first and then do the dynamic backend as a follow-on? Always nice to break up the PRs if we can, especially if the later one is triggering the debug. |
That result is with dynamic backend enabled, I have to try putting it back to the master and see it it works and still performs this well. The problem is that the heuristics I wrote to determine where to put the vector is dependent on the dynamic backend parameters: if the vector size is less than X, put it in the cpu, and gpu otherwise. Will clean it up and open a PR tmr if everything works fine. |
Tuning prefetch is harder than I thought... the results are not very consistent. I guess I will just leave it as is and work on other items first. |
Just googled for unified memory support in thrust. It seems that they have something called universal_vector which can be accessed from both the host and the device. I guess we can probably use this instead of VecDH with cleaner code and support for complex modeling with GPUs that only has a small RAM? And perhaps we can choose to run the operations on the host or device depending on the workload without much code duplication. I haven't tried this yet though, not sure about its performance or if there is any quirks.
The text was updated successfully, but these errors were encountered: