Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost-comprehensive with bagging aggregation #2554

Merged
merged 35 commits into from Nov 15, 2023
Merged

xgboost-comprehensive with bagging aggregation #2554

merged 35 commits into from Nov 15, 2023

Conversation

yan-gao-GY
Copy link
Contributor

Issue

There is no easy-to-use XGBoost example with Flower.

Description

EXtreme Gradient Boosting (XGBoost) is a robust and comprehensible gradient-boosted decision tree (GBDT). Given the robustness and efficiency of XGBoost, combining it with federated learning offers a promising solution for model training with data privacy protection.

Proposal

This example demonstrates how to perform XGBoost within Flower using xgboost package on HIGGS dataset. Tree-based with bagging method is used for aggregation on the server.

Warning

Note that this example uses SizePartitioner for FL data partitioning, so this PR should be merged after fds-size-partitioner.

@adam-narozniak
Copy link
Member

adam-narozniak commented Nov 6, 2023

Hi @yan-gao-GY, here is my general review regarding the client and dataset.
I think that the abstractions created in the dataset.py do not necessarily improve the readability of the code. I'd say it would be more readable if the code from: init_higgs and load_partition were directly used in the client. We can keep the partitioner initialization in the datasets.py. I'd keep the split_train_test in the dataset.py but create a separate method that changes the Dataset produced from fds to xgb.DMatrix because it's not expected behavior that it's going to happen right now in the split_train_test.
Also, currently, only the uniform partitioning method is used. Will the code allow the use of others too? If not, we can remove the whole choice of partitioner. If yes, will it be a part of the e.g. next PR.

Here is my recommendation:

# main
from dataset import instantiate_partitioner, train_test_split

partitioner = instantiate_partitioner(partitioner_type=patitioner_type, num_partitions=num_partitions)
# alternatively not `partitioner_type` but `node_id_to_samples_correlation` or just `correlation`
fds = FederatedDataset(dataset="jxie/higgs", partitioners={"train": partitioner})
partition = fds.load_partition(idx=partition_id, split="train")
partition.set_format("numpy")
# split_rate is not informative keyword to me, I'd stick to e.g. test_size or test_fraction
# I'd also drop the size returns but I think it's more personal choice
train_data, valid_data = train_test_split(partition, test_size=test_size, seed=SEED)
# I'd rename the _reformat_data, but it'd serve the same purpose
train_dmatrix = transform_dataset_to_dmatrix(train_data)
valid_dmatrix = transform_dataset_to_dmatrix(valid_data)

Also, I'd rename the SPLIT_DICT either to CORRELATION_TO_PARTITIONER or sth similar accordingly to the parameter name chosen for the instantiate_partitioner

@adam-narozniak
Copy link
Member

Also, I'd add the train and valid data as parameters to FlowerClient and then reference via self.

@yan-gao-GY
Copy link
Contributor Author

@adam-narozniak thanks a lot for your suggestion! i think it makes sense. i'll make changes later.

@yan-gao-GY yan-gao-GY mentioned this pull request Nov 6, 2023
examples/quickstart-xgboost/pyproject.toml Outdated Show resolved Hide resolved
examples/quickstart-xgboost/requirements.txt Outdated Show resolved Hide resolved
examples/quickstart-xgboost/requirements.txt Outdated Show resolved Hide resolved
examples/quickstart-xgboost/run.sh Outdated Show resolved Hide resolved
examples/quickstart-xgboost/server.py Outdated Show resolved Hide resolved
examples/quickstart-xgboost/strategy.py Outdated Show resolved Hide resolved
examples/quickstart-xgboost/client.py Outdated Show resolved Hide resolved
examples/quickstart-xgboost/client.py Outdated Show resolved Hide resolved
@adam-narozniak
Copy link
Member

Also, one more thing. Let's make all the comments start with a capitalized letter. (I know that we don't necessarily even do full type hints in the examples, but let's make it consistent in the project)

Copy link
Member

@adam-narozniak adam-narozniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with the pyproject.toml

partition = fds.load_partition(idx=partition_id, split="train")
partition.set_format("numpy")

if args.centralised_eval:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more question. In the case of centralized eval each of the (federated) nodes also uses centralized dataset for the federated evaluation. Is that intended, or is it controlled in the server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing centralised eval or not is controlled by server with --centralised_eval. If not centralised eval, the user can still choose to use centralised test set or client test set (splitting from client's training data) to do the client evaluation. e.g., doing client.py --centralised_eval will enable the client evaluation on centralised test set.

examples/quickstart-xgboost/README.md Outdated Show resolved Hide resolved
examples/quickstart-xgboost/README.md Outdated Show resolved Hide resolved
examples/quickstart-xgboost/client.py Outdated Show resolved Hide resolved
examples/quickstart-xgboost/strategy.py Outdated Show resolved Hide resolved
examples/quickstart-xgboost/utils.py Outdated Show resolved Hide resolved
examples/quickstart-xgboost/utils.py Outdated Show resolved Hide resolved
yan-gao-GY and others added 11 commits November 15, 2023 10:51
Co-authored-by: Daniel J. Beutel <daniel@flower.dev>
Co-authored-by: Daniel J. Beutel <daniel@flower.dev>
Co-authored-by: Daniel J. Beutel <daniel@flower.dev>
Co-authored-by: Daniel J. Beutel <daniel@flower.dev>
Co-authored-by: Daniel J. Beutel <daniel@flower.dev>
Co-authored-by: Daniel J. Beutel <daniel@flower.dev>
@danieljanes danieljanes enabled auto-merge (squash) November 15, 2023 18:38
@danieljanes danieljanes changed the title Quickstart-xgboost with bagging aggregation xgboost-comprehensive with bagging aggregation Nov 15, 2023
@danieljanes danieljanes merged commit f056175 into main Nov 15, 2023
26 checks passed
@danieljanes danieljanes deleted the xgboost branch November 15, 2023 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants