Add distributed training #2

hcho3 · 2019-12-19T21:35:26Z

No description provided.

thvasilo · 2019-12-20T19:02:34Z

I could try adding this through the scripts I have available currently.

My setup requires that the user has AWS credentials already set up (through aws-cli or as env vars I think).

Also, currently I much prefer using aws-parallelcluster but that involves running XGBoost communication over SLURM and not YARN.

If we need YARN I'd have to go back and ensure that it works as expected, or I guess we could have a Spark-based benchmark, that I assume works fine still.

hcho3 · 2019-12-20T20:09:47Z

@thvasilo I was thinking of using dask and run benchmark locally in a big AWS machine, to make it easy to manage. But yes, it would be nice if you can put up your script in a separate directly (cluster). The more the merrier.

trivialfis · 2019-12-30T16:16:39Z

@hcho3 I have an initial set of scripts for running dask benchmarks. But I use cuDF the the primary backend for data handling here: https://github.com/trivialfis/dxgb_bench I will add more datasets to it as progressing.

It can be extended with other backends like CPU dask or just pandas. Would you like to take a look and see if it's suitable for merging it here?

hcho3 · 2019-12-31T01:14:40Z

@trivialfis I will take a look, thanks! Is it fair to assume that dask will have same performance characteristics as the underlying native distributed algorithm? My impression of dask is that it is a lightweight cluster application.

terrytangyuan · 2019-12-31T01:47:16Z

Also would be good to have distributed benchmark suite on Kubernetes cluster using XGBoost Operator if anyone is interested in contributing: https://github.com/kubeflow/xgboost-operator

trivialfis · 2019-12-31T06:01:40Z

Yes. But it will have higher memory consumption due to pandas and partition management.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed training #2

Add distributed training #2

hcho3 commented Dec 19, 2019

thvasilo commented Dec 20, 2019

hcho3 commented Dec 20, 2019

trivialfis commented Dec 30, 2019 •

edited

hcho3 commented Dec 31, 2019 •

edited

terrytangyuan commented Dec 31, 2019

trivialfis commented Dec 31, 2019

Add distributed training #2

Add distributed training #2

Comments

hcho3 commented Dec 19, 2019

thvasilo commented Dec 20, 2019

hcho3 commented Dec 20, 2019

trivialfis commented Dec 30, 2019 • edited

hcho3 commented Dec 31, 2019 • edited

terrytangyuan commented Dec 31, 2019

trivialfis commented Dec 31, 2019

trivialfis commented Dec 30, 2019 •

edited

hcho3 commented Dec 31, 2019 •

edited