[jvm-packages][spark-gpu] Add GPU support for XGBoost4j-Spark-Gpu #7361

wbo4958 · 2021-10-26T02:03:31Z

This PR is trying to bring spark-rapids to accelerate XGBoost from end to end by GPU. It is a following-up or replacement for #5950. For more history can refer to this comment

This PR first reworked CPU training and transform pipeline and then added GPU pipeline for xgboost4j-spark-gpu

Reworked CPU pipeline

Reworked CPU train/transform pipeline
Moved the data-preparations into PreXGBoost before XGBoost train
Changed the API of XGBoost to train on RDD[Watches] instead of RDD[LabeledPointed]. Watches is an collection of DMatrix

Added GPU pipeline in XGBoost4j-Spark-Gpu

The goal of XGBoost4j-Spark-Gpu is to support training/transforming on both CPU and GPU.

Soft link the CPU pipeline into XGBoost4j-Spark-Gpu, so it can fully support CPU
Add some GPU params for GPU-only
Add spark-rapids dependency to accelerate ETL by GPU.
Define PreXGBoostProvider and implement it by GPU.
add ServiceLoader to discover GPU implementation
implement the APIs of PreXGBoostProvider

Design and APIs

The whole picture can refer the above flowchart.

This PR has defined below APIs for different implementations.

buildDatasetToRDD to convert Dataset into RDD[Watches], Watches is a collection of DMatrix

This PR has moved the code for data preparation into PreXGBoost. For the built-in CPU pipeline, the data preparation first converts the Dataset into RDD[LabeledPoint] and then to RDD[Watches], finally to RDD[Booster]. The LabeledPoint way is quite CPU-related, which converts each Row into LabeledPoint which is row-wised. But GPU honors column-wised data, which means GPU do not need LabeledPoint any more. GPU can have its own way to build DMatrix from the column data.

transformSchema

the built-in transformSchema in the Spark ML framework requires the feature column and label columns. The feature column is the vectorized type of all feature columns. Just like the description as above, GPU way honors column data, and do not need to vectorize the feature columns, So for GPU, we need to intercept transformSchema and do check by itself.

transformDataset

Same as buildDatasetToRDD, CPU and GPU has different implementations.

How to discover GPU implementation

This PR makes the GPU implementation into a Plugin-way, and load it by ServiceLoader.

For xgboost4j-spark, this PR doesn't define the service in resource, so it will default not to detect GPU implementation.

For xgboost4j-spark-gpu, this PR declares the GPU implementation in the service in resource, so it can detect the GPU implementation by default. This PR also define an API to check if the GPU implementation is enabled or disabled. if it's not enabled, it will fall back to CPU implementation.

Usage for GPU

GPU version

    // user can get rid of this and specify the configs when submitting application
    val conf = new SparkConf()
      .set("spark.rapids.sql.enabled", "true")
      .set("spark.plugins", "com.nvidia.spark.SQLPlugin")

    val spark = SparkSession.builder()
      .master("local[1]")
      .config(conf)
      .appName(classOf[BobbyXGBoostSuite].getSimpleName)
      .getOrCreate()

      val schema = new StructType(Array(
        StructField("sepal length", DoubleType, true),
        StructField("sepal width", DoubleType, true),
        StructField("petal length", DoubleType, true),
        StructField("petal width", DoubleType, true),
        StructField("class", StringType, true)))
      val rawInput = spark.read.schema(schema).csv(path)

      val label = "class"
      // get all feature column names
      val featuresNames = schema.fieldNames.filterNot(f => f.equals(label)).toArray

      val xgbParam = Map("eta" -> 0.1f,
        "max_depth" -> 2,
        "objective" -> "multi:softprob",
        "num_class" -> 3,
        "num_round" -> 100,
        "num_workers" -> 1,
        "tree_method" -> "gpu_hist",
      )

      val xgbClassifier = new XGBoostClassifier(xgbParam)
        .setLabelCol(label)
        .setFeaturesCols(featuresNames)  // API for GPU-only

      val xgbClassificationModel = xgbClassifier.fit(rawInput)

      val df = xgbClassificationModel.transform(rawInput)
      df.show()

1. Add PreXGBoost to build RDD[Watches] from Dataset 2. Feed RDD[Watches] built from PreXGBoost to XGBoost to train

extract the common part of transform code from XGBoostClassifier and XGBoostRegressor

add Rapids plugin support

wbo4958 · 2021-10-26T02:58:02Z

@trivialfis @hcho3 @RAMitchell please help to review it. thx

trivialfis

Have a few preliminary questions:

Does the user need to call functions from pre-xgboost before training/prediction?
Those functions that are marked "GPU Only", is it a limitation for the interface or can be extended to CPU in the future?
What's buildUnsafeRows and why do you need it?

Assuming the API is fine, we can work with the c++ code first.

trivialfis · 2021-10-28T17:06:00Z

...kages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostGeneralSuite.scala

  }

  test("distributed training with group data") {
    val trainingRDD = sc.parallelize(Ranking.train, 5)
+    val buildTrainingRDD = PreXGBoost.buildRDDLabeledPointToRDDWatches(trainingRDD, hasGroup = true)


Is this required for users? If so that's a breaking change and can we hide it inside XGBoost?

It's not kind of breaking change. since the trainDistributed is limited to be accessed by [spark] package, meaning user should not suppose to use this function.

It's not kind of breaking change. since the trainDistributed is limited to be accessed by [spark] package, meaning user should not suppose to use this function.

Then why do you need to change the test?

trivialfis · 2021-10-28T17:07:13Z

...packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala

@@ -254,85 +253,28 @@ class XGBoostRegressionModel private[ml] (

  def setInferBatchSize(value: Int): this.type = set(inferBatchSize, value)

+  /**
+   *  This API is only used in GPU train pipeline of xgboost4j-spark-gpu, which requires
+   *  all feature columns must be numeric types.


So the CPU can have other types? Also, GPU now handles categorical data type.

trivialfis · 2021-10-28T17:12:41Z

...es/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/nvidia/spark/GpuColumnBatch.java

+          try {
+            cb.close();
+          } catch (Exception e) {
+            e.printStackTrace();


So ... is the exception being thrown?

trivialfis · 2021-10-28T17:16:13Z

...es/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/nvidia/spark/GpuColumnBatch.java

+import ml.dmlc.xgboost4j.gpu.java.CudfColumnBatch;
+
+/**
+ * CudfTable with schema for scala


Please document its functionality, like what's it for.

wbo4958 · 2021-10-28T23:29:37Z

Have a few preliminary questions:

Does the user need to call functions from pre-xgboost before training/prediction?

Those functions that are marked "GPU Only", is it a limitation for the interface or can be extended to CPU in the future?

What's buildUnsafeRows and why do you need it?

Assuming the API is fine, we can work with the c++ code first.

Does the user need to call functions from pre-xgboost before training/prediction?

No, PreXGBoost is a util to convert Dataset into RDD[DMatrix], User will not use this function directly.

Those functions that are marked "GPU Only", is it a limitation for the interface or can be extended to CPU in the future?

Yeah, it's a limitation, since GPU is column-wised while CPU is row-wised. In the future, maybe we can unify the interface for GPU and CPU.

What's buildUnsafeRows and why do you need it?

The transformed data is kind of Row-flattened in GPU. so the buildUnsafeRow is to convert the format spark needs in GPU and then copy to CPU, finally feed to Spark.

trivialfis · 2021-10-29T19:17:52Z

Sharing the offline discussion here. I think the user interface is fine since it aligns with the existing CPU implemetation. We will start with smaller and lower level infrastructure first by eliminating unnecessary code.

cc @hcho3 @RAMitchell

wbo4958 · 2021-11-03T11:21:51Z

Thx @trivialfis, I just removed the unsafe row building since it's unnecessary. Could you help to review it again?

trivialfis

Initial review. Great work that this PR seems to be a lot more cleaner than the previous version.

So let me try to summarize the PR based on my own understanding. Feel free to correct me if I'm wrong. You have extracted the pre-processing steps into individual modules for CPU and GPU as PreXGBoosXXX and added glue code for converting Dataset into RDD. That seems to be a viable approach to proceed.

Some questions are inlined in the review.

trivialfis · 2021-11-04T09:03:23Z

jvm-packages/CMakeLists.txt

@@ -18,6 +18,7 @@ endif (ENABLE_ALL_WARNINGS)
 target_link_libraries(xgboost4j PRIVATE objxgboost)
 target_include_directories(xgboost4j
  PRIVATE
+  ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}


Now that we no longer have new CUDA code, is this necessary?

this is rough PR, will clean it up

trivialfis · 2021-11-04T09:04:06Z

jvm-packages/pom.xml

        <scala.version>2.12.8</scala.version>
        <scala.binary.version>2.12</scala.binary.version>
        <hadoop.version>2.7.3</hadoop.version>
        <maven.wagon.http.retryHandler.count>5</maven.wagon.http.retryHandler.count>
        <log.capi.invocation>OFF</log.capi.invocation>
        <use.cuda>OFF</use.cuda>
+        <cudf.version>21.08.2</cudf.version>


Is it necessary to have this in CPU package?

no big deal, it's just a property not dependency

trivialfis · 2021-11-07T10:25:54Z

...es/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/nvidia/spark/GpuColumnBatch.java

@@ -0,0 +1,276 @@
+/*
+ Copyright (c) 2014 by Contributors


Is it necessary to put files under nvidia namespace? xgboost is not a nvidia project by itself.

ok, will change the package name.

trivialfis · 2021-11-07T10:28:13Z

...es/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/nvidia/spark/GpuColumnBatch.java

+
+  /** Slice the columns indicated by indices into a Table*/
+  public Table slice(List<Integer> indices) {
+    if (indices == null || indices.size() == 0) {


When will it be NULL?

trivialfis · 2021-11-07T10:29:03Z

...es/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/nvidia/spark/GpuColumnBatch.java

+  }
+
+  /** Slice the columns indicated by indices into a Table*/
+  public Table slice(List<Integer> indices) {


Does cuDF java binding support slicing?

trivialfis · 2021-11-07T20:37:35Z

...es/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/nvidia/spark/GpuColumnBatch.java

+    return schema;
+  }
+
+  public double getMaxInColumn(int colIndex) {


Where is this function being used?

no, will clean it up

trivialfis · 2021-11-07T20:42:20Z

...es/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/nvidia/spark/GpuColumnBatch.java

+   * @param weightInfo Weight information calculated from earlier batches.
+   * @return The group id of last group in current column batch.
+   */
+  public int groupAndAggregateOnColumnsHost(int groupIdx, int weightIdx, int prevTailGid,


Is this function used anywhere?

This is rough PR, will clean it up.

trivialfis · 2021-11-07T20:48:18Z

.../xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/nvidia/spark/GpuPreXGBoost.scala

+    missing: Float,
+    maxBin: Int): DMatrix = {
+    // FIXME add option or dynamic to check.
+    if (true) {


What is this?

Rough PR, will clean it up

trivialfis · 2021-11-07T20:50:36Z

.../xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/nvidia/spark/GpuPreXGBoost.scala

+
+  private[this] class RapidsIterator(base: Iterator[GpuColumnBatch],
+    indices: ColumnIndices) extends Iterator[CudfColumnBatch] {
+    var maxLabels: Double = 0.0f


Where is this used?

draft PR , will clean it up

trivialfis · 2021-11-07T21:07:19Z

...packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/GpuParams.scala

@@ -0,0 +1,53 @@
+/*
+ Copyright (c) 2014 by Contributors


Is it intended to put this file in CPU package?

Yes, the para is shared by CPU and GPU, since the entry (XGBoostClassifier/XGBoostRegressor) is located in CPU package

wbo4958 · 2021-11-08T01:22:54Z

Since this is a draft PR to make GPU pipeline implementation happen, I didn't take much care about the trivial things which will be resolved in the following small PR. So let me begin to put up the real PR. Thx @trivialfis

wbo4958 added 4 commits October 22, 2021 14:19

Rework the train pipeline

a5165da

1. Add PreXGBoost to build RDD[Watches] from Dataset 2. Feed RDD[Watches] built from PreXGBoost to XGBoost to train

Rework transform

8caaa73

extract the common part of transform code from XGBoostClassifier and XGBoostRegressor

Add DeviceQuantileDMatrix for scala supporting

93ebec8

add Serviceloader for plugin purpose

cd30218

add Rapids plugin support

This was referenced Oct 26, 2021

[WIP][jvm-packages]add trainDistributed for RDD[Watches] #5972

Closed

[WIP][jvm-packages][XGBoost4j-Spark-Gpu] Support Spark-Rapids for XGBoost4j-Spark-Gpu #7272

Closed

wbo4958 changed the title ~~Add GPU support for XGBoost4j-Spark-Gpu~~ [jvm-packages][gpu-spark]Add GPU support for XGBoost4j-Spark-Gpu Oct 26, 2021

wbo4958 changed the title ~~[jvm-packages][gpu-spark]Add GPU support for XGBoost4j-Spark-Gpu~~ [jvm-packages][gpu-spark] Add GPU support for XGBoost4j-Spark-Gpu Oct 26, 2021

wbo4958 changed the title ~~[jvm-packages][gpu-spark] Add GPU support for XGBoost4j-Spark-Gpu~~ [jvm-packages][spark-gpu] Add GPU support for XGBoost4j-Spark-Gpu Oct 26, 2021

trivialfis reviewed Oct 28, 2021

View reviewed changes

Remove unused code

19e6e06

remove RowConverter

f7856bf

trivialfis reviewed Nov 7, 2021

View reviewed changes

wbo4958 mentioned this pull request Nov 8, 2021

[jvm-packages] Rework the train pipeline #7401

Merged

This was referenced Nov 16, 2021

Rework the transform pipeline #7440

Merged

[jvm-packages] Add DeviceQuantileDMatrix for scala supporting #7459

Merged

wbo4958 mentioned this pull request Nov 30, 2021

[jvm-packages] add Rapids plugin support #7491

Merged

wbo4958 closed this Dec 20, 2021

wbo4958 deleted the xgboost-spark-gpu-3rd branch December 20, 2021 01:50

nvliyuan mentioned this pull request Mar 10, 2022

update xgboost jar to the latest version next release NVIDIA/spark-rapids-examples#126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages][spark-gpu] Add GPU support for XGBoost4j-Spark-Gpu #7361

[jvm-packages][spark-gpu] Add GPU support for XGBoost4j-Spark-Gpu #7361

wbo4958 commented Oct 26, 2021 •

edited

Loading

wbo4958 commented Oct 26, 2021 •

edited

Loading

trivialfis left a comment

trivialfis Oct 28, 2021

wbo4958 Oct 28, 2021

trivialfis Nov 7, 2021 •

edited

Loading

trivialfis Oct 28, 2021

trivialfis Oct 28, 2021 •

edited

Loading

trivialfis Oct 28, 2021

wbo4958 commented Oct 28, 2021 •

edited

Loading

trivialfis commented Oct 29, 2021

wbo4958 commented Nov 3, 2021

trivialfis left a comment

trivialfis Nov 4, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 4, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021

wbo4958 Nov 8, 2021

trivialfis Nov 7, 2021 •

edited

Loading

wbo4958 Nov 8, 2021

wbo4958 commented Nov 8, 2021

[jvm-packages][spark-gpu] Add GPU support for XGBoost4j-Spark-Gpu #7361

[jvm-packages][spark-gpu] Add GPU support for XGBoost4j-Spark-Gpu #7361

Conversation

wbo4958 commented Oct 26, 2021 • edited Loading

Reworked CPU pipeline

Added GPU pipeline in XGBoost4j-Spark-Gpu

Design and APIs

buildDatasetToRDD to convert Dataset into RDD[Watches], Watches is a collection of DMatrix

transformSchema

transformDataset

How to discover GPU implementation

Usage for GPU

GPU version

wbo4958 commented Oct 26, 2021 • edited Loading

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Nov 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented Oct 28, 2021 • edited Loading

trivialfis commented Oct 29, 2021

wbo4958 commented Nov 3, 2021

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Nov 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented Nov 8, 2021

wbo4958 commented Oct 26, 2021 •

edited

Loading

wbo4958 commented Oct 26, 2021 •

edited

Loading

trivialfis Nov 7, 2021 •

edited

Loading

trivialfis Oct 28, 2021 •

edited

Loading

wbo4958 commented Oct 28, 2021 •

edited

Loading

trivialfis Nov 7, 2021 •

edited

Loading