Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed training #2

Open
hcho3 opened this issue Dec 19, 2019 · 6 comments
Open

Add distributed training #2

hcho3 opened this issue Dec 19, 2019 · 6 comments

Comments

@hcho3
Copy link

hcho3 commented Dec 19, 2019

No description provided.

@thvasilo
Copy link
Collaborator

I could try adding this through the scripts I have available currently.

My setup requires that the user has AWS credentials already set up (through aws-cli or as env vars I think).

Also, currently I much prefer using aws-parallelcluster but that involves running XGBoost communication over SLURM and not YARN.

If we need YARN I'd have to go back and ensure that it works as expected, or I guess we could have a Spark-based benchmark, that I assume works fine still.

@hcho3
Copy link
Author

hcho3 commented Dec 20, 2019

@thvasilo I was thinking of using dask and run benchmark locally in a big AWS machine, to make it easy to manage. But yes, it would be nice if you can put up your script in a separate directly (cluster). The more the merrier.

@trivialfis
Copy link
Member

trivialfis commented Dec 30, 2019

@hcho3 I have an initial set of scripts for running dask benchmarks. But I use cuDF the the primary backend for data handling here: https://github.com/trivialfis/dxgb_bench I will add more datasets to it as progressing.

It can be extended with other backends like CPU dask or just pandas. Would you like to take a look and see if it's suitable for merging it here?

@hcho3
Copy link
Author

hcho3 commented Dec 31, 2019

@trivialfis I will take a look, thanks! Is it fair to assume that dask will have same performance characteristics as the underlying native distributed algorithm? My impression of dask is that it is a lightweight cluster application.

@terrytangyuan
Copy link
Member

Also would be good to have distributed benchmark suite on Kubernetes cluster using XGBoost Operator if anyone is interested in contributing: https://github.com/kubeflow/xgboost-operator

@trivialfis
Copy link
Member

Yes. But it will have higher memory consumption due to pandas and partition management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants