Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for federated learning #7831

Merged
merged 28 commits into from
May 5, 2022
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Apr 21, 2022

Federated learning plugin for xgboost:

  • A gRPC server to aggregate MPI-style requests (allgather, allreduce, broadcast) from federated workers.
  • A Rabit engine for the federated environment.
  • Integration test to simulate federated learning.

Additional followups are need to address GPU support, better security and privacy, etc.

Part of #7778

@rongou
Copy link
Contributor Author

rongou commented Apr 21, 2022

@rongou
Copy link
Contributor Author

rongou commented Apr 21, 2022

Here is the output from running the integration test:

(venv) rou@rou:~/src/xgboost/tests/distributed$ ./runtests-federated.sh 
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 0
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 1
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 2
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:06] [0]	eval-logloss:0.22669	train-logloss:0.23338

[15:30:07] [1]	eval-logloss:0.13787	train-logloss:0.13666

[15:30:07] [2]	eval-logloss:0.08046	train-logloss:0.08253

[15:30:08] [3]	eval-logloss:0.05833	train-logloss:0.05647

[15:30:08] [4]	eval-logloss:0.03829	train-logloss:0.04151

[15:30:09] [5]	eval-logloss:0.02663	train-logloss:0.02961

[15:30:09] [6]	eval-logloss:0.01388	train-logloss:0.01919

[15:30:10] [7]	eval-logloss:0.01020	train-logloss:0.01332

[15:30:10] [8]	eval-logloss:0.00848	train-logloss:0.01113

[15:30:11] [9]	eval-logloss:0.00692	train-logloss:0.00663

[15:30:11] [10]	eval-logloss:0.00544	train-logloss:0.00504

[15:30:12] [11]	eval-logloss:0.00445	train-logloss:0.00420

[15:30:12] [12]	eval-logloss:0.00336	train-logloss:0.00356

[15:30:13] [13]	eval-logloss:0.00277	train-logloss:0.00281

[15:30:13] [14]	eval-logloss:0.00252	train-logloss:0.00244

[15:30:14] [15]	eval-logloss:0.00177	train-logloss:0.00194

[15:30:15] [16]	eval-logloss:0.00157	train-logloss:0.00161

[15:30:15] [17]	eval-logloss:0.00135	train-logloss:0.00142

[15:30:16] [18]	eval-logloss:0.00123	train-logloss:0.00125

[15:30:16] [19]	eval-logloss:0.00107	train-logloss:0.00107

[15:30:16] Finished training

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing this as a plug-in works well for now. We don't want dependencies on protobuf etc. in main xgboost.

I assume gRPC a is a placeholder here and we want something encrypted.

I'm hoping this work also eventually leads to re-factoring, improvements, and a better understanding of the underlying rabit code.

Looks good as a first attempt. Are you wanting us to merge this and do the next steps in stages, or is this just for feedback? I would probably not want to merge it because I don't want to suggest to users that there is a functional federated learning plug-in yet.


void Accumulate(std::string& buffer, std::string const& input, DataType data_type,
ReduceOperation reduce_operation) const {
switch (data_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: If we update xgboost to c++17 we can do this type of switch with variant/visit in 3 lines.

}

int const world_size_;
AllgatherHandler allgather_handler_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These handlers don't really have members, can they just be functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed them to functors.

@@ -198,11 +198,15 @@ XGB_DLL int XGDMatrixCreateFromFile(const char *fname,
DMatrixHandle *out) {
API_BEGIN();
bool load_row_split = false;
#if defined(XGBOOST_USE_FEDERATED)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each worker needs to call the c_api with manually specified file locations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in a federated environment, presumably all the local data on each federated worker is used for training, so it doesn't make sense to split further.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the exciting feature!

Out of curiosity, is it preferred to launch a CLI application instead of exposing a C function (along with Python API) to let users launch it from somewhere within their program?

Copy link
Contributor Author

@rongou rongou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell I added SSL/TLS encryption (server and clients are mutually authenticated). I'm hoping we can merge this as a bare-bones implementation of federated learning, and improve on it with followup PRs.

}

int const world_size_;
AllgatherHandler allgather_handler_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed them to functors.

@@ -198,11 +198,15 @@ XGB_DLL int XGDMatrixCreateFromFile(const char *fname,
DMatrixHandle *out) {
API_BEGIN();
bool load_row_split = false;
#if defined(XGBOOST_USE_FEDERATED)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in a federated environment, presumably all the local data on each federated worker is used for training, so it doesn't make sense to split further.

@rongou
Copy link
Contributor Author

rongou commented Apr 25, 2022

@trivialfis I did the CLI because it was easier. :) We can certainly add a C API/Python wrapper if needed. Perhaps as a followup?

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis I did the CLI because it was easier. :) We can certainly add a C API/Python wrapper if needed. Perhaps as a followup?

I think it would be better to avoid adding an executable by replacing the main function with a C API and integrating it into libxgboost.so. But it's fine if you want to make it as a followup.

Could you please enable the tests on Github action? Rest looks fine to me as a bare-bone implementation.

@RAMitchell
Copy link
Member

I'm okay to merge the prototype. Any ideas on how to solve the quantile issue?

@rongou
Copy link
Contributor Author

rongou commented Apr 29, 2022

@trivialfis I added some unit tests along with the integration test. But the federated learning plugin is disabled by default so they are not being run by the CI. Need to send a followup PR to tweak the CI pipelines to add them. Also I added the C API and the Python wrapper as you suggested.

@RAMitchell what do you mean by the quantile issue? For now the quantiles are still constructed globally using allreduce. We need to do some followup work to enhance the privacy.

plugin/federated/CMakeLists.txt Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
@RAMitchell
Copy link
Member

@RAMitchell what do you mean by the quantile issue? For now the quantiles are still constructed globally using allreduce. We need to do some followup work to enhance the privacy.

I think we need a plan on how to solve distributed quantiles while preserving privacy. Its hard for me to see how this can be possible with any reasonable guarantees. For example in in small datasets or datasets with few unique values, the quantiles could capture all of the data, so even sharing the final quantiles among workers would represent a significant leakage.

@rongou
Copy link
Contributor Author

rongou commented May 3, 2022

As I mentioned in the RFC, this first iteration is really about putting the basic framework in place so that federated learning can be done in a somewhat high trust, "enterprise" environment. We can then incrementally add more security and privacy features to widen the use case.

For the quantile leakage issue, one possibility is to have each party compute a histogram of different bin sizes depending on the size of the local data, then fuse the histograms at the server, something like https://arxiv.org/abs/2012.06670. This doesn't rely on homomorphic encryption or differential privacy, but of course there are other approaches we can also consider.

@rongou rongou requested a review from RAMitchell May 4, 2022 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants