-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common interface for collective communication #8057
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great work on cleaning up the communication interface. Do you think we can have some of these abstractions under the rabit umbrella? Rabit has a Python interface that we can use for other purposes like synchronizing user-defined metrics. (used in the callback module)
Another benefit is that sometimes nccl has issues with users' network setup and the error is difficult to interpret (unhandled system error) and is only raised during training. I want to have a Python interface unified with rabit for GPU communication that we can easily test and debug. |
Yes I do plan to hook this up with a python interface. Do you think we should keep the existing rabit api as is and just swap out the implementation? Or deprecate rabit and have a new interface, say, |
I think we can just do a swap. But since you have been working on it you might have better plan. |
@trivialfis got rid of the factories. Please take another look. Should we try to merge this? So far this is a non-breaking change. Once I start modifying the callers, it'd probably become a breaking change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good to me overall. Left my comments in the code review.
As for the interface, I hope that we can eventually merge the interfaces for CPU and GPU in the future by having some dispatching mechanism for recognizing input data device ordinal. I think it's doable in recent future.
namespace xgboost { | ||
namespace collective { | ||
|
||
thread_local int Communicator::device_ordinal_{-1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to do something for this parameter, could you please take a look at #7308 ? One issue in my experimental implementation is that we need to consider multi-threaded environments like Python async functions. In such environments, it's difficult to enforce synchronization of these thread local parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed python interface in #7308 looks good.
Here the device id doesn't have to be per thread, assuming we have a global configuration that sets it, and it's available when we call Communicator::Init()
. But the communicators do need to be thread local since they are not thread safe.
@trivialfis can this be merged? Is there anything else you want me to change? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge it once the CI is back to healh. Thank you for the great work. Could you please share more details about the future plan for federated learning?
The main goal is to include federated learning in a release so users can just
Once this is done, we can probably do more with Rabit (rewrite, swap out), unify host/device apis. For federated learning, we may look at things like homomorphic encryption, differential privacy, and vertical federated learning (features are split between participants). |
Restarted jenkins linux. |
do we have homomorphic encryption or differential privacy in the plugin now? |
Part of #7778.