-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuse gpu_hist all-reduce calls where possible #7867
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why is a contiguous histogram required? We can simply use nccl group call right?
I think it's possible, but the number of nodes can grow very large and this seems likely to break nccl. |
The upper limit is 2048 as documented in https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/groups.html (last paragraph). I think one of the problems we try to solve with fusing nodes is deep trees, so it might be desirable to reuse histograms to avoid allocation. |
If I try to reuse memory I need to write some kind of mutex where histograms can't be recycled when they are being used. The memory management gets more complicated. We also have to zero the memory one by one with separate kernels if we are taking non-contiguous memory. I will think about it. I think there are still benefits to doing it this way. |
EDIT: Below results are actually for 1 GPU. The performance regression for epsilon also reversed after running again, so this is just noise. Gbm-bench results below. Minor improvements in a few places, some regression for epsilon/bosch. max_depth=8
max_depth=16
|
Any regression might be a result of not recycling memory and fewer uses of the subtraction trick. Investigating further. |
Results for depth=16 with 8 gpus this time.
Examining the output in debug mode we can see that the number of all-reduce calls for Bosch have decreased by 10x.
|
The result looks exciting! Please let me know your plan regarding this optimization. Do we want to merge the result even if there's regression? Do we want to have a better design of the histogram allocator? Do we want to implement the static fuse size (8 or 16 using stack memory) or use the heap for bookkeeping? Just curious, no need to provide any concrete plan if it hasn't been decided yet. Please let me know if it's ready for review or mark it as draft/WIP. |
There is no actual regression for single GPU, after I ran it again it was just variance. I actually like the current design better due to simplicity. I don't think much is gained by recycling nodes, and it becomes more complicated to track the status of each memory location when dealing with batches. So I think my preference is to go ahead with this, it is ready for review. I think the static fusing can work very well, in particular for updating position and evaluating splits, so those are my next goals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me. Would be great if you can investigate the perf change of epsilon depth 16 in future PRs. The variance seems significant but I assume it's a special case and overall performance is more important.
This PR changes the driver class slightly to handle identification of invalid nodes.
'gpu_hist' is changed to iterate over batches of nodes for each operation (e.g. apply split, update position, etc.).
All-reduce may then be called on batches of nodes at the layer level.
The histogram memory allocator is redesigned to allocate contiguous memory for the current batch of nodes. It is no longer able to recycle old memory, and now will simply stop caching allocations when it reaches a maximum size (it is too difficult to try to find and reuse memory blocks now that they have variable size).
Future optimisations can be made to fuse more node operations together.