[jvm-packages] add a new trainDistributed API for dataset #5950

wbo4958 · 2020-07-28T08:28:03Z

This PR adds a new API trainDistributed for dataset, and creates a
TrainIntf trait and four different implementations (train for rank with eval,
train for rank without eval, train for nonrank with eval, train for nonrank
without eval)

This PR does not change any logic of train pipeline, it's just like code-reorganization

This PR adds a new API trainDistributed for dataset, and creates a TrainIntf trait and four different implementations (train for rank with eval, train for rank without eval, train for nonrank with eval, train for nonrank without eval) This PR does not change any logical of train, it's just like code-reorganization

wbo4958 · 2020-07-28T09:54:13Z

Hi @CodingCat @trivialfis @hcho3 @sperlingxx,

Since gpu_hist #5171 has been merged, JVM can also be accelerated by GPUs in Spark3.

But Looking at the whole XGBoost workflow (loading data, ETL, train), specifically, It is only the training process now can be accelerated by GPUs. But Training models or doing prediction, that’s just a pretty small portion of the whole work (especially training has been accelerated by GPUs). Most of our work will be spent on data preparation (data loading, ETL), so the data preparation may slow down the whole pipeline.

So the question is can we also accelerate the ETL pipeline?
The answer is yes!

As of Apache Spark release 3.0 users can schedule GPU resources and can replace the backend for many SQL and dataframe operations so that they are accelerated using GPUs. Actually It is Rapids-plugin/CUDF https://github.com/NVIDIA/spark-rapids which can accelerate SQL/Dataframe.

What I am trying to say is we can leverage Rapids-plugin to accelerate XGBoost4j-Spark end to end from data loading to training. Rapids-plugin loads data into GPU memory with Columnar format and executes operations like join/filter/groupBy with GPUs, that's really pretty fast. After ETL, the data is still stored into GPU memory, So we can build Device-DMatrix in GPU memory, actually xgboost library has supported the cudf_interface way to build DMatrix.
In fact, we have some internal tests which can typically speed up XGBoost4j https://cloud.google.com/blog/products/data-analytics/ml-with-xgboost-gets-faster-with-dataproc-on-gpus and https://databricks.com/blog/2020/05/15/shrink-training-time-and-cost-using-nvidia-gpu-accelerated-xgboost-and-apache-spark-on-databricks.html.

Rapids-plugin/CUDF, I know, is in the rapidly developing. So, I would not like to add rapids-plugin into dependency directly. Instead, I'd like to add an XGBoost plugin to support this.

In order to support plugin for XGBoost. I'd like to have 2 PRs

The first PR (this PR), moving the RDD transforms closely, then we can replace the whole RDD transform in the plugin.

To achieve this,

1. deprecated the trainDistributed for RDD

  /**
   * @return A tuple of the booster and the metrics used to build training summary
   */
  @deprecated("use trainDistributed for Dataset instead of RDD", "1.2.0")
  @throws(classOf[XGBoostError])
  private[spark] def trainDistributed(
      trainingData: RDD[XGBLabeledPoint],
      params: Map[String, Any],
      hasGroup: Boolean = false,
      evalSetsMap: Map[String, RDD[XGBLabeledPoint]] = Map()):
    (Booster, Map[String, Array[Float]]) = {

2. added new trainDistributed for Dataset[_]

  /**
   * @return A tuple of the booster and the metrics used to build training summary
   */
  @throws(classOf[XGBoostError])
  private[spark] def trainDistributed(
      trainingData: Dataset[_],
      dsToRddParams: DataFrameToRDDParams,
      params: Map[String, Any],
      hasGroup: Boolean,
      evalData: Map[String, Dataset[_]]): (Booster, Map[String, Array[Float]]) = {

2.1. added a function in trainDistributed to choose the implementation of train

val trainImpl = chooseTrainImpl(hasGroup, xgbExecParams, evalData)

eg, if detecting XGBoost Plugin, then the trainImpl should be pointing to Plugin implementation, or else, falling back to default CPU pipeline

2.2. converted Dataset to RDD and build Watches

        val boostersAndMetrics = trainImpl.datasetToRDD(trainingData, dsToRddParams).mapPartitions(
          iter => {
            val watches = trainImpl.buildWatches(iter)
            buildDistributedBooster(watches, xgbExecParams, rabitEnv,
              xgbExecParams.obj, xgbExecParams.eval, prevBooster)
        }).cache()

the above code is the main framework. trainingData is a Dataset[_].

The main framework does not concern how trainImpl is implemented, by Plugin or default CPU implementation. actually, it treats everything is plugin.

So no matter that it is Plugin or CPU pipeline, Both should implement datasetToRDD and buildWatches.

BTW, This PR mainly re-architect CPU pipeline, and it will not change the original logic.

the second PR

The second PR should be plugin detecting and replacing. That should be simple

Talk is cheap, Please check the code. Many thx for your precious time.

Any suggestions and comments are appreciated. Thx in advance.

CodingCat

why this has to be done in XGBoost/XGBoost.scala? XGBoost.scala exposes RDD based API to XGBoostClassifier/Regressor to use instead of end users,

you can use various approach to accelerate the construction of DataFrame and pass the result DF to XGBoostClassifier/Regressor...and then start training with GPU-HIST

I don't understand why leaking the data preprocessing to XGBoost.scala and even XGBoost

wbo4958 · 2020-08-02T23:03:15Z

Hi @CodingCat, really thx for the review.

why this has to be done in XGBoost/XGBoost.scala? XGBoost.scala exposes RDD based API to XGBoostClassifier/Regressor to use instead of end users,

Yeah, Seems RDD based API of XGBoost.scala is only for RDD[XGBLabeledPoint], and some RDD transforms which also is based on XGBLabeledPoint. Finally, building a booster which is based on XGBLabeledPoint. So you can see, XGBoost.scala is coupling with XGBLabeledPoint.

The Plugin hopes XGBoost.scala to focus on the role of building DMatrix and pass DMatrix to XGB library and train, finally return the Model. As to how to build DMatrix, different plugins may have different implementations. For example, CPU pipeline is utilizing XGDMatrixCreateFromDataIter which based on Iterator to build DMatrix, while The GPU pipeline may need XGBoosterPredictFromArrayInterfaceColumns which is based on the array interface json format, not LabeledPoint any more.

you can use various approach to accelerate the construction of DataFrame and pass the result DF to XGBoostClassifier/Regressor...and then start training with GPU-HIST

I don't understand why leaking the data preprocessing to XGBoost.scala and even XGBoost

Yeah, If we use spark-rapids to implement the Plugin. Considering this situation, the spark-rapids (plugin of Spark can accelerate most of sql/dataframe operators) loads the data into GPU memory with Columnar format from the beginning and then does some operations like filter/agg/join in the GPU memory.

If we still use RDD[XGBLabeledPoint] which means we need to convert the Columnar to Row and construct LabeledPoint, which will cause the data copying from GPU to CPU. And then build DMatrix on CPU, finally train on GPU.

But if we use plugin way and the XGBoosterPredictFromArrayInterfaceColumns to build DMatrix, then the memory will not fallback to CPU any more and DMatrix building also happens in GPU, it's really pretty fast.

	sql/dataframe	building DMatrix	train
RDD[XGBLabeldPoint	cpu	cpu	gpu
Plugin	gpu	gpu	gpu

CodingCat · 2020-08-02T23:42:20Z

@wbo4958 thanks for the explanation! now I understand why you choose to have this PR...

regarding your last comparison

	sql/dataframe	building DMatrix	train
RDD[XGBLabeldPoint]	cpu	cpu	gpu
Plugin	gpu	gpu	gpu

if we use RDD[XGBLabeldPoint] , sql/dataframe is not necessarily in CPU? as I said, you can build dataframe with GPU and then only building DMatrix would be on CPU (tho I agree that there is an extra memory copy overhead)

the major concern here is , we SHOULD AVOID build a data preprocessing interface framework in XGBoost, it is hard to imagine what will happen after years, if any change is needed, that means another round of API deprecating and renew

why don't we just adding a new API to accept RDD[DMatrix] in XGBoost.scala. we can do some type checking in XGBoost.scala, if RDD[DMatrix]is passed in , we can skip all transformations in current implementation and directly going to training phase

in this way, we leave a complete flexibility to different hardware vendors to have their own implementation of data transformers, for XGBoost, we only care about DMatrix

CodingCat · 2020-08-02T23:47:02Z

in XGBoostClassifier/XGBoostRegressor, seems we can also accept Dataset[DMatrix] and skip all logic currently just for CPU?

CodingCat · 2020-08-03T00:02:05Z

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

+      trainImpl.cleanUpDriver()
+    }
+  }
+


a lot of code here are just copy paste from the original method?

at least we should provide some abstraction to avoid duplicate code

CodingCat · 2020-08-03T00:18:43Z

after checking the PR, I think there is some lack of understanding of the architecture of XGBoost-Spark. I would like to highlight the design principle of XGBoost.scala: this file should be kept as very low level, it integrates with Spark and XGBoost's JVM binding with their lowest level interface . Many years ago, when we started this project, it is through RDD[XGBoostLabeledPoint] which is the only option at that moment.

When the new hw surges up, it seems this level of abstraction is partial, as XGBoostLabeledPoint is essentially an abstraction for XGBoost input data on host RAM. I think we should adjust the abstraction provided by this file

Following this path, we should either improve XGBoostLabeledPoint to cover device memory which doesn't seem to be a good solution for many factors, or we should provide the integration with RDD[DMatrix]. As DMatrix is the unified interface to interact with XGBoost data.

In the current PR, Dataset is a high level abstraction of Spark's distributed structured data (higher than RDD) which means providing more restriction on how this file can be used, _ means a mixing of various layers in a layered design. Both of these are violating the original design principles

wbo4958 · 2020-08-03T14:37:07Z

Hi @CodingCat, many thx for your review in your spare time.

after checking the PR, I think there is some lack of understanding of the architecture of XGBoost-Spark. I would like to highlight the design principle of XGBoost.scala: this file should be kept as very low level, it integrates with Spark and XGBoost's JVM binding with their lowest level interface . Many years ago, when we started this project, it is through RDD[XGBoostLabeledPoint] which is the only option at that moment.

Thx for sharing the original design principle and Seem the design is quite reasonable.

When the new hw surges up, it seems this level of abstraction is partial, as XGBoostLabeledPoint is essentially an abstraction for XGBoost input data on host RAM. I think we should adjust the abstraction provided by this file

Following this path, we should either improve XGBoostLabeledPoint to cover device memory which doesn't seem to be a good solution for many factors, or we should provide the integration with RDD[DMatrix]. As DMatrix is the unified interface to interact with XGBoost data.

In the current PR, Dataset is a high level abstraction of Spark's distributed structured data (higher than RDD) which means providing more restriction on how this file can be used, _ means a mixing of various layers in a layered design. Both of these are violating the original design principles

For your two concerns, I may have my own explanation.

extract the XGBLabeledPoint related code from XGBoost.scala into a separate file (because it's the CPU pipeline implementation, Seems it does not belong to the XGBoost.scala anymore if we plan to support plugins. (I didn't extract the XGBLabeledPoint in this PR, because that's a little aggressive. it may change a lot of cod )
let XGBoost.scala behaves like a bridge that connects Spark and XGB JVM binding.

So if the XGBoost.scala is behaving like a bridge, then main framework of the bridge is converting dataset to RDD, build DMatrix in RDD, and finally, train. the bridge (XGBoost.scala) doesn't care about how to implement each step of the main framework of the bridge. it only cares about if the main framework is working.

But I would like to say, RDD[DMatrix] is a brilliant idea, I have an initial PR for RDD[DMatrix] #5972 implementation. There may be one issue, When building DMatrix, XGBoost native may need to do some sync between different workers (right now, the sync is not enabled, but I suppose it will be enabled recently, @trivialfis please correct me). So the building DMatrix should be put into Rabit env. So in this case, seems we can't move the dmatrix building into rabit env.

CodingCat · 2020-08-17T16:08:12Z

this can be closed since we are in favor of #5972?

CodingCat · 2020-08-30T21:38:59Z

close as the latest contribution is at #5972

CodingCat self-requested a review July 31, 2020 20:25

CodingCat requested changes Jul 31, 2020

View reviewed changes

CodingCat reviewed Aug 3, 2020

View reviewed changes

hcho3 mentioned this pull request Aug 3, 2020

[WIP][jvm-packages]add trainDistributed for RDD[Watches] #5972

Closed

CodingCat closed this Aug 30, 2020

trivialfis mentioned this pull request Oct 30, 2020

[QUESTION] multi-node multi-GPU training docs #6328

Closed

wbo4958 mentioned this pull request Oct 26, 2021

[jvm-packages][spark-gpu] Add GPU support for XGBoost4j-Spark-Gpu #7361

Closed

wbo4958 deleted the move-rdd-close branch December 20, 2021 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] add a new trainDistributed API for dataset #5950

[jvm-packages] add a new trainDistributed API for dataset #5950

wbo4958 commented Jul 28, 2020 •

edited

wbo4958 commented Jul 28, 2020 •

edited

CodingCat left a comment

wbo4958 commented Aug 2, 2020

CodingCat commented Aug 2, 2020 •

edited

CodingCat commented Aug 2, 2020

CodingCat Aug 3, 2020

CodingCat commented Aug 3, 2020 •

edited

wbo4958 commented Aug 3, 2020

CodingCat commented Aug 17, 2020

CodingCat commented Aug 30, 2020

[jvm-packages] add a new trainDistributed API for dataset #5950

[jvm-packages] add a new trainDistributed API for dataset #5950

Conversation

wbo4958 commented Jul 28, 2020 • edited

wbo4958 commented Jul 28, 2020 • edited

The first PR (this PR), moving the RDD transforms closely, then we can replace the whole RDD transform in the plugin.

1. deprecated the trainDistributed for RDD

2. added new trainDistributed for Dataset[_]

2.1. added a function in trainDistributed to choose the implementation of train

2.2. converted Dataset to RDD and build Watches

the second PR

CodingCat left a comment

Choose a reason for hiding this comment

wbo4958 commented Aug 2, 2020

CodingCat commented Aug 2, 2020 • edited

CodingCat commented Aug 2, 2020

CodingCat Aug 3, 2020

Choose a reason for hiding this comment

CodingCat commented Aug 3, 2020 • edited

wbo4958 commented Aug 3, 2020

CodingCat commented Aug 17, 2020

CodingCat commented Aug 30, 2020

wbo4958 commented Jul 28, 2020 •

edited

wbo4958 commented Jul 28, 2020 •

edited

CodingCat commented Aug 2, 2020 •

edited

CodingCat commented Aug 3, 2020 •

edited