-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for federated learning #7831
Conversation
Here is the output from running the integration test: (venv) rou@rou:~/src/xgboost/tests/distributed$ ./runtests-federated.sh
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 0
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 1
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 2
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:06] [0] eval-logloss:0.22669 train-logloss:0.23338
[15:30:07] [1] eval-logloss:0.13787 train-logloss:0.13666
[15:30:07] [2] eval-logloss:0.08046 train-logloss:0.08253
[15:30:08] [3] eval-logloss:0.05833 train-logloss:0.05647
[15:30:08] [4] eval-logloss:0.03829 train-logloss:0.04151
[15:30:09] [5] eval-logloss:0.02663 train-logloss:0.02961
[15:30:09] [6] eval-logloss:0.01388 train-logloss:0.01919
[15:30:10] [7] eval-logloss:0.01020 train-logloss:0.01332
[15:30:10] [8] eval-logloss:0.00848 train-logloss:0.01113
[15:30:11] [9] eval-logloss:0.00692 train-logloss:0.00663
[15:30:11] [10] eval-logloss:0.00544 train-logloss:0.00504
[15:30:12] [11] eval-logloss:0.00445 train-logloss:0.00420
[15:30:12] [12] eval-logloss:0.00336 train-logloss:0.00356
[15:30:13] [13] eval-logloss:0.00277 train-logloss:0.00281
[15:30:13] [14] eval-logloss:0.00252 train-logloss:0.00244
[15:30:14] [15] eval-logloss:0.00177 train-logloss:0.00194
[15:30:15] [16] eval-logloss:0.00157 train-logloss:0.00161
[15:30:15] [17] eval-logloss:0.00135 train-logloss:0.00142
[15:30:16] [18] eval-logloss:0.00123 train-logloss:0.00125
[15:30:16] [19] eval-logloss:0.00107 train-logloss:0.00107
[15:30:16] Finished training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing this as a plug-in works well for now. We don't want dependencies on protobuf etc. in main xgboost.
I assume gRPC a is a placeholder here and we want something encrypted.
I'm hoping this work also eventually leads to re-factoring, improvements, and a better understanding of the underlying rabit code.
Looks good as a first attempt. Are you wanting us to merge this and do the next steps in stages, or is this just for feedback? I would probably not want to merge it because I don't want to suggest to users that there is a functional federated learning plug-in yet.
|
||
void Accumulate(std::string& buffer, std::string const& input, DataType data_type, | ||
ReduceOperation reduce_operation) const { | ||
switch (data_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: If we update xgboost to c++17 we can do this type of switch with variant/visit in 3 lines.
plugin/federated/federated_server.cc
Outdated
} | ||
|
||
int const world_size_; | ||
AllgatherHandler allgather_handler_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These handlers don't really have members, can they just be functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed them to functors.
@@ -198,11 +198,15 @@ XGB_DLL int XGDMatrixCreateFromFile(const char *fname, | |||
DMatrixHandle *out) { | |||
API_BEGIN(); | |||
bool load_row_split = false; | |||
#if defined(XGBOOST_USE_FEDERATED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So each worker needs to call the c_api with manually specified file locations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in a federated environment, presumably all the local data on each federated worker is used for training, so it doesn't make sense to split further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the exciting feature!
Out of curiosity, is it preferred to launch a CLI application instead of exposing a C function (along with Python API) to let users launch it from somewhere within their program?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell I added SSL/TLS encryption (server and clients are mutually authenticated). I'm hoping we can merge this as a bare-bones implementation of federated learning, and improve on it with followup PRs.
plugin/federated/federated_server.cc
Outdated
} | ||
|
||
int const world_size_; | ||
AllgatherHandler allgather_handler_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed them to functors.
@@ -198,11 +198,15 @@ XGB_DLL int XGDMatrixCreateFromFile(const char *fname, | |||
DMatrixHandle *out) { | |||
API_BEGIN(); | |||
bool load_row_split = false; | |||
#if defined(XGBOOST_USE_FEDERATED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in a federated environment, presumably all the local data on each federated worker is used for training, so it doesn't make sense to split further.
@trivialfis I did the CLI because it was easier. :) We can certainly add a C API/Python wrapper if needed. Perhaps as a followup? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis I did the CLI because it was easier. :) We can certainly add a C API/Python wrapper if needed. Perhaps as a followup?
I think it would be better to avoid adding an executable by replacing the main function with a C API and integrating it into libxgboost.so
. But it's fine if you want to make it as a followup.
Could you please enable the tests on Github action? Rest looks fine to me as a bare-bone implementation.
I'm okay to merge the prototype. Any ideas on how to solve the quantile issue? |
@trivialfis I added some unit tests along with the integration test. But the federated learning plugin is disabled by default so they are not being run by the CI. Need to send a followup PR to tweak the CI pipelines to add them. Also I added the C API and the Python wrapper as you suggested. @RAMitchell what do you mean by the quantile issue? For now the quantiles are still constructed globally using allreduce. We need to do some followup work to enhance the privacy. |
I think we need a plan on how to solve distributed quantiles while preserving privacy. Its hard for me to see how this can be possible with any reasonable guarantees. For example in in small datasets or datasets with few unique values, the quantiles could capture all of the data, so even sharing the final quantiles among workers would represent a significant leakage. |
As I mentioned in the RFC, this first iteration is really about putting the basic framework in place so that federated learning can be done in a somewhat high trust, "enterprise" environment. We can then incrementally add more security and privacy features to widen the use case. For the quantile leakage issue, one possibility is to have each party compute a histogram of different bin sizes depending on the size of the local data, then fuse the histograms at the server, something like https://arxiv.org/abs/2012.06670. This doesn't rely on homomorphic encryption or differential privacy, but of course there are other approaches we can also consider. |
Federated learning plugin for xgboost:
Additional followups are need to address GPU support, better security and privacy, etc.
Part of #7778