Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Remove support for single process multi-GPU #4531

Closed
RAMitchell opened this issue Jun 3, 2019 · 17 comments · Fixed by #4773
Closed

[RFC] Remove support for single process multi-GPU #4531

RAMitchell opened this issue Jun 3, 2019 · 17 comments · Fixed by #4773

Comments

@RAMitchell
Copy link
Member

This issue proposes removing support for single process multi-GPU in xgboost. What this means is that we would no longer support the 'n_gpus' parameter and the ability to train with multiple GPUs without some distributed framework.

We would focus on multi-GPU support through distributed frameworks such as Dask (see #4473) and Spark, where the framework would be expected to assign a worker process for each GPU independently.

The reasoning is that it would greatly simplify all GPU code in xgboost. Currently GPU code works by accepting an input DMatrix and then manually partitioning rows across each available GPU. This partitioning as well as managing multiple GPUs via threads imposes significant complexity on developers. The result is more code that is harder to read, more bug prone (we have many issues documenting this) and less flexible when developing new features. It is difficult to overstate how beneficial this would be for developers currently working on GPU code.

By doing this we are effectively deferring the task of partitioning data for a single machine with multiple GPUs onto the higher level distributed framework.

In the short term this will be a direct removal of an existing feature, but I believe in the medium term (1 year) it will result in a significantly better product.

We would provide a path for users to transition to new multi-GPU support. This will be better for users in the long run also as these distributed solutions are arbitrarily scalable. Between making the decision to deprecate this functionality and removing code we would wait at least three months.

This is still very much up for debate and feedback is welcome from developers and users, particularly those who currently use multi-GPU support as a part of their workflow.

@TimZaman
Copy link

TimZaman commented Jun 3, 2019

Yes! This is one of reasons why in DL land horovod got so popular. The single process story of TensorFlow was too complicated, and didn't easily translate to multinode either. Nowadays most multi gpu code I see works with horovod and torch.distributed. Both are in the multiprocess+single-gpu-per-process paradigm. This also naturally scales to multiple nodes without any code editting and provides the user (and dev) with a simpler multigpu programming model.

@trivialfis
Copy link
Member

trivialfis commented Jun 3, 2019

I like the idea of decoupling data distribution from algorithm. But I insist that this is better done in c++ instead of high level libraries like dask, there are R and Julia packages that don't attract much attentions as others. I want least maintenance burden added to language bindings. Those thick bindings are much harder to maintain than current slightly bloated internal code. Also currently both Java and scala bindings are maintained by @CodingCat , we might need to join the development or at least being part of review process if language dependent external packages are used.

@trivialfis
Copy link
Member

@TimZaman I thought the way TF distributing subgraph is quite elegant. Seems I'm a little behind on this. :)

@datametrician
Copy link

Agree with @TimZaman. RAPIDS/cuML is also following a one process per GPU (OPG) paradigm and we feel it's much simpler to develop and maintain (no hard evidence but just from our initial development compared to single process multi-GPU).

@pseudotensor
Copy link
Contributor

Pros:

  1. xgboost simpler for developers
  2. xgboost multi-GPU performs poorly. Would need significant attention to make it beneficial.

Cons:

  1. Unable to easily run a single model on multiple GPUs. Required to use dask or other 3rd party libraries, so xgboost by itself becomes strongly dependent on 3rd party libraries for this functionality that it already had.

  2. No longer able to train single model for big data. But xgboost's memory usage is extreme, and lightgbm offers vastly better memory footprint making multi-GPU xgboost anyways not useful.

I would have wanted xgboost multi-GPU to work very well, using nvlink or whatever. But if that's not going to happen, then there's not much point in keeping the multi-GPU capability in xgboost.

@RAMitchell
Copy link
Member Author

I like the idea of decoupling data distribution from algorithm. But I insist that this is better done in c++ instead of high level libraries like dask, there are R and Julia packages that don't attract much attentions as others. I want least maintenance burden added to language bindings. Those thick bindings are much harder to maintain than current slightly bloated internal code. Also currently both Java and scala bindings are maintained by @CodingCat , we might need to join the development or at least being part of review process if language dependent external packages are used.

@trivialfis I don't see how to do this. Consider that HostDeviceVector is everywhere through our API, the purpose of this RFC would be to remove the complexity of multi-GPU data partitioning from data structures like this.

I look at this as a question of scope. What functionality are we as the developers with limited resources prepared to provide. It's very tempting to try to do everything but I think many of the features we currently for GPU have are not implemented well enough for my liking and the quality of the code base is not at the level I want it to be. You are correct that we will not provide out-of-the-box functionality to use multiple GPUs on R and Julia but @pseudotensor is also right that the multi-GPU solution is not particularly compelling vs. single GPU.

I would personally be satisfied if we produce a high quality building block that others could use to implement a distributed machine learning system.

@trivialfis
Copy link
Member

@RAMitchell One problem at a time.

  • So there's need to reduce memory usage, for both dense and sparse.
  • Also multiple GPUs will have different memory footprint.
  • Then we need to think of a solution for robust, scalable data distribution.

Let's devide and conquer. :)

@sh1ng
Copy link
Contributor

sh1ng commented Jun 5, 2019

I'd start with Design Doc and POC showing how to parallalize the algorithm without introducing too much overhead. Correct me if I'm wrong, but finding best splits requires (2^LEVEL)*N_FEATURES comparisons if a tree's node doesn't get partitioned on a few devices and (2^LEVEL)*N_FEATURES*N_BINS if it does. Those operation will require cross-process communication that may introduce some overhead.

At the end if it scales good enough it overweights downsides like extra 3rd party library.

@RAMitchell
Copy link
Member Author

Communication throughput via NCCL is the same across processes as it is for the threaded version. Running the algorithm in parallel is essentially a question of starting a rabit tracker and connecting workers to it. We can certainly document this more to enable others to potentially incorporate this into their distributed framework. I think some kind of blog post explaining the distributed architecture would be helpful for a lot of people.

@arnocandel
Copy link

I’m ok with this path - robustness wins over features.

@RAMitchell
Copy link
Member Author

I think we have collected enough feedback without significant objections from customers. I will file a PR enabling deprecation warnings and updating documentation. We will aim to start removing code around September.

@trivialfis
Copy link
Member

@RAMitchell Is this line a bug?

LOG(FATAL) << "Distributed training is not available with GPU algoritms";

@RAMitchell
Copy link
Member Author

I've never seen this error message and we even have distributed gpu tests now, so it must not be reaching this code block.

@RAMitchell
Copy link
Member Author

The documentation has been updated and deprecation warnings will now be issued. I will leave this issue open until we start removing this feature in case of any further feedback.

@rongou
Copy link
Contributor

rongou commented Aug 7, 2019

Looks like half the cuda code is dealing with multi-gpu. Does anyone have any idea on how to proceed? Just bite the bullet and forge ahead? :)

Some of the DMatrix refactoring might be blocked by this.

@RAMitchell @trivialfis @sriramch

@trivialfis
Copy link
Member

@rongou @RAMitchell Can we close this now? What we have to do is gradually making code clean up.

@rongou
Copy link
Contributor

rongou commented Aug 19, 2019

I'm working on #4773 which should be bulk of the changes needed. Maybe we can close this after it goes in?

@lock lock bot locked as resolved and limited conversation to collaborators Nov 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants