[RFC] Remove support for single process multi-GPU #4531

RAMitchell · 2019-06-03T21:45:41Z

This issue proposes removing support for single process multi-GPU in xgboost. What this means is that we would no longer support the 'n_gpus' parameter and the ability to train with multiple GPUs without some distributed framework.

We would focus on multi-GPU support through distributed frameworks such as Dask (see #4473) and Spark, where the framework would be expected to assign a worker process for each GPU independently.

The reasoning is that it would greatly simplify all GPU code in xgboost. Currently GPU code works by accepting an input DMatrix and then manually partitioning rows across each available GPU. This partitioning as well as managing multiple GPUs via threads imposes significant complexity on developers. The result is more code that is harder to read, more bug prone (we have many issues documenting this) and less flexible when developing new features. It is difficult to overstate how beneficial this would be for developers currently working on GPU code.

By doing this we are effectively deferring the task of partitioning data for a single machine with multiple GPUs onto the higher level distributed framework.

In the short term this will be a direct removal of an existing feature, but I believe in the medium term (1 year) it will result in a significantly better product.

We would provide a path for users to transition to new multi-GPU support. This will be better for users in the long run also as these distributed solutions are arbitrarily scalable. Between making the decision to deprecate this functionality and removing code we would wait at least three months.

This is still very much up for debate and feedback is welcome from developers and users, particularly those who currently use multi-GPU support as a part of their workflow.

TimZaman · 2019-06-03T22:39:29Z

Yes! This is one of reasons why in DL land horovod got so popular. The single process story of TensorFlow was too complicated, and didn't easily translate to multinode either. Nowadays most multi gpu code I see works with horovod and torch.distributed. Both are in the multiprocess+single-gpu-per-process paradigm. This also naturally scales to multiple nodes without any code editting and provides the user (and dev) with a simpler multigpu programming model.

trivialfis · 2019-06-03T23:30:10Z

I like the idea of decoupling data distribution from algorithm. But I insist that this is better done in c++ instead of high level libraries like dask, there are R and Julia packages that don't attract much attentions as others. I want least maintenance burden added to language bindings. Those thick bindings are much harder to maintain than current slightly bloated internal code. Also currently both Java and scala bindings are maintained by @CodingCat , we might need to join the development or at least being part of review process if language dependent external packages are used.

trivialfis · 2019-06-03T23:31:56Z

@TimZaman I thought the way TF distributing subgraph is quite elegant. Seems I'm a little behind on this. :)

datametrician · 2019-06-04T03:59:21Z

Agree with @TimZaman. RAPIDS/cuML is also following a one process per GPU (OPG) paradigm and we feel it's much simpler to develop and maintain (no hard evidence but just from our initial development compared to single process multi-GPU).

pseudotensor · 2019-06-04T20:52:57Z

Pros:

xgboost simpler for developers
xgboost multi-GPU performs poorly. Would need significant attention to make it beneficial.

Cons:

Unable to easily run a single model on multiple GPUs. Required to use dask or other 3rd party libraries, so xgboost by itself becomes strongly dependent on 3rd party libraries for this functionality that it already had.
No longer able to train single model for big data. But xgboost's memory usage is extreme, and lightgbm offers vastly better memory footprint making multi-GPU xgboost anyways not useful.

I would have wanted xgboost multi-GPU to work very well, using nvlink or whatever. But if that's not going to happen, then there's not much point in keeping the multi-GPU capability in xgboost.

RAMitchell · 2019-06-04T23:50:25Z

I like the idea of decoupling data distribution from algorithm. But I insist that this is better done in c++ instead of high level libraries like dask, there are R and Julia packages that don't attract much attentions as others. I want least maintenance burden added to language bindings. Those thick bindings are much harder to maintain than current slightly bloated internal code. Also currently both Java and scala bindings are maintained by @CodingCat , we might need to join the development or at least being part of review process if language dependent external packages are used.

@trivialfis I don't see how to do this. Consider that HostDeviceVector is everywhere through our API, the purpose of this RFC would be to remove the complexity of multi-GPU data partitioning from data structures like this.

I look at this as a question of scope. What functionality are we as the developers with limited resources prepared to provide. It's very tempting to try to do everything but I think many of the features we currently for GPU have are not implemented well enough for my liking and the quality of the code base is not at the level I want it to be. You are correct that we will not provide out-of-the-box functionality to use multiple GPUs on R and Julia but @pseudotensor is also right that the multi-GPU solution is not particularly compelling vs. single GPU.

I would personally be satisfied if we produce a high quality building block that others could use to implement a distributed machine learning system.

trivialfis · 2019-06-05T00:29:24Z

@RAMitchell One problem at a time.

So there's need to reduce memory usage, for both dense and sparse.
Also multiple GPUs will have different memory footprint.
Then we need to think of a solution for robust, scalable data distribution.

Let's devide and conquer. :)

sh1ng · 2019-06-05T09:21:29Z

I'd start with Design Doc and POC showing how to parallalize the algorithm without introducing too much overhead. Correct me if I'm wrong, but finding best splits requires (2^LEVEL)*N_FEATURES comparisons if a tree's node doesn't get partitioned on a few devices and (2^LEVEL)*N_FEATURES*N_BINS if it does. Those operation will require cross-process communication that may introduce some overhead.

At the end if it scales good enough it overweights downsides like extra 3rd party library.

RAMitchell · 2019-06-05T21:26:12Z

Communication throughput via NCCL is the same across processes as it is for the threaded version. Running the algorithm in parallel is essentially a question of starting a rabit tracker and connecting workers to it. We can certainly document this more to enable others to potentially incorporate this into their distributed framework. I think some kind of blog post explaining the distributed architecture would be helpful for a lot of people.

arnocandel · 2019-06-11T18:31:28Z

I’m ok with this path - robustness wins over features.

RAMitchell · 2019-06-11T22:29:14Z

I think we have collected enough feedback without significant objections from customers. I will file a PR enabling deprecation warnings and updating documentation. We will aim to start removing code around September.

trivialfis · 2019-06-12T15:40:45Z

@RAMitchell Is this line a bug?

xgboost/src/learner.cc

Line 615 in ea44417

LOG(FATAL) << "Distributed training is not available with GPU algoritms";

RAMitchell · 2019-06-12T20:26:47Z

I've never seen this error message and we even have distributed gpu tests now, so it must not be reaching this code block.

RAMitchell · 2019-06-19T03:54:12Z

The documentation has been updated and deprecation warnings will now be issued. I will leave this issue open until we start removing this feature in case of any further feedback.

rongou · 2019-08-07T00:53:07Z

Looks like half the cuda code is dealing with multi-gpu. Does anyone have any idea on how to proceed? Just bite the bullet and forge ahead? :)

Some of the DMatrix refactoring might be blocked by this.

@RAMitchell @trivialfis @sriramch

trivialfis · 2019-08-17T17:31:27Z

@rongou @RAMitchell Can we close this now? What we have to do is gradually making code clean up.

rongou · 2019-08-19T18:45:34Z

I'm working on #4773 which should be bulk of the changes needed. Maybe we can close this after it goes in?

RAMitchell added the status: RFC label Jun 3, 2019

trivialfis mentioned this issue Jun 6, 2019

Feature interaction for GPU Hist. #4534

Merged

3 tasks

rongou mentioned this issue Jun 18, 2019

deprecate single node multi-gpu mode #4579

Merged

CodingCat mentioned this issue Jul 23, 2019

[Roadmap] XGBoost 1.0.0 Roadmap #4680

Closed

9 tasks

RAMitchell mentioned this issue Aug 4, 2019

Automatically find available GPUs #4735

Closed

rongou mentioned this issue Aug 7, 2019

[BREAKING] prevent multi-gpu usage #4749

Merged

rongou mentioned this issue Aug 21, 2019

make HostDeviceVector single gpu only #4773

Merged

RAMitchell closed this as completed in #4773 Aug 25, 2019

rongou mentioned this issue Sep 19, 2019

remove device shards #4867

Merged

lock bot locked as resolved and limited conversation to collaborators Nov 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Remove support for single process multi-GPU #4531

[RFC] Remove support for single process multi-GPU #4531

RAMitchell commented Jun 3, 2019

TimZaman commented Jun 3, 2019 •

edited

trivialfis commented Jun 3, 2019 •

edited

trivialfis commented Jun 3, 2019

datametrician commented Jun 4, 2019

pseudotensor commented Jun 4, 2019

RAMitchell commented Jun 4, 2019

trivialfis commented Jun 5, 2019

sh1ng commented Jun 5, 2019 •

edited

RAMitchell commented Jun 5, 2019

arnocandel commented Jun 11, 2019

RAMitchell commented Jun 11, 2019

trivialfis commented Jun 12, 2019

RAMitchell commented Jun 12, 2019

RAMitchell commented Jun 19, 2019

rongou commented Aug 7, 2019

trivialfis commented Aug 17, 2019

rongou commented Aug 19, 2019

[RFC] Remove support for single process multi-GPU #4531

[RFC] Remove support for single process multi-GPU #4531

Comments

RAMitchell commented Jun 3, 2019

TimZaman commented Jun 3, 2019 • edited

trivialfis commented Jun 3, 2019 • edited

trivialfis commented Jun 3, 2019

datametrician commented Jun 4, 2019

pseudotensor commented Jun 4, 2019

RAMitchell commented Jun 4, 2019

trivialfis commented Jun 5, 2019

sh1ng commented Jun 5, 2019 • edited

RAMitchell commented Jun 5, 2019

arnocandel commented Jun 11, 2019

RAMitchell commented Jun 11, 2019

trivialfis commented Jun 12, 2019

RAMitchell commented Jun 12, 2019

RAMitchell commented Jun 19, 2019

rongou commented Aug 7, 2019

trivialfis commented Aug 17, 2019

rongou commented Aug 19, 2019

TimZaman commented Jun 3, 2019 •

edited

trivialfis commented Jun 3, 2019 •

edited

sh1ng commented Jun 5, 2019 •

edited