Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common interface for collective communication #8057

Merged
merged 58 commits into from
Sep 12, 2022
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Jul 8, 2022

Part of #7778.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great work on cleaning up the communication interface. Do you think we can have some of these abstractions under the rabit umbrella? Rabit has a Python interface that we can use for other purposes like synchronizing user-defined metrics. (used in the callback module)

@trivialfis
Copy link
Member

Another benefit is that sometimes nccl has issues with users' network setup and the error is difficult to interpret (unhandled system error) and is only raised during training. I want to have a Python interface unified with rabit for GPU communication that we can easily test and debug.

@rongou
Copy link
Contributor Author

rongou commented Jul 18, 2022

Yes I do plan to hook this up with a python interface. Do you think we should keep the existing rabit api as is and just swap out the implementation? Or deprecate rabit and have a new interface, say, collective?

@trivialfis
Copy link
Member

I think we can just do a swap. But since you have been working on it you might have better plan.

@rongou rongou changed the title [WIP] Common interface for collective communication Common interface for collective communication Sep 1, 2022
@rongou
Copy link
Contributor Author

rongou commented Sep 2, 2022

@trivialfis got rid of the factories. Please take another look. Should we try to merge this? So far this is a non-breaking change. Once I start modifying the callers, it'd probably become a breaking change.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me overall. Left my comments in the code review.

As for the interface, I hope that we can eventually merge the interfaces for CPU and GPU in the future by having some dispatching mechanism for recognizing input data device ordinal. I think it's doable in recent future.

include/xgboost/c_api.h Show resolved Hide resolved
plugin/federated/federated_server.cc Show resolved Hide resolved
python-package/xgboost/collective.py Outdated Show resolved Hide resolved
python-package/xgboost/collective.py Outdated Show resolved Hide resolved
python-package/xgboost/collective.py Outdated Show resolved Hide resolved
src/collective/communicator.cc Outdated Show resolved Hide resolved
namespace xgboost {
namespace collective {

thread_local int Communicator::device_ordinal_{-1};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to do something for this parameter, could you please take a look at #7308 ? One issue in my experimental implementation is that we need to consider multi-threaded environments like Python async functions. In such environments, it's difficult to enforce synchronization of these thread local parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed python interface in #7308 looks good.

Here the device id doesn't have to be per thread, assuming we have a global configuration that sets it, and it's available when we call Communicator::Init(). But the communicators do need to be thread local since they are not thread safe.

@rongou
Copy link
Contributor Author

rongou commented Sep 8, 2022

@trivialfis can this be merged? Is there anything else you want me to change? Thanks!

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge it once the CI is back to healh. Thank you for the great work. Could you please share more details about the future plan for federated learning?

@rongou
Copy link
Contributor Author

rongou commented Sep 9, 2022

The main goal is to include federated learning in a release so users can just pip install xgboost to get the functionality. A few steps left to do:

  • Add the JNI wrapper for the communicator api
  • Change all the callers to use communicator instead of rabit
  • Change CI to build with federated learning enabled by default
  • Release with federated learning included

Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.

For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).

@trivialfis
Copy link
Member

Restarted jenkins linux.

@lidh15
Copy link

lidh15 commented Jan 15, 2024

homomorphic encryption, differential privacy

The main goal is to include federated learning in a release so users can just pip install xgboost to get the functionality. A few steps left to do:

  • Add the JNI wrapper for the communicator api
  • Change all the callers to use communicator instead of rabit
  • Change CI to build with federated learning enabled by default
  • Release with federated learning included

Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis.

For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants).

do we have homomorphic encryption or differential privacy in the plugin now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants