Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoostRegressor with GPU on the Spark+Rapids JDK8 #7994

Closed
Dartya opened this issue Jun 14, 2022 · 38 comments · Fixed by #8015
Closed

XGBoostRegressor with GPU on the Spark+Rapids JDK8 #7994

Dartya opened this issue Jun 14, 2022 · 38 comments · Fixed by #8015

Comments

@Dartya
Copy link

Dartya commented Jun 14, 2022

I am trying to implement an example using Java JDK8.
The example says that when using the GPU, you must set the value of featuresCols using the setFeaturesCols() method.

val xgbRegressor = new XGBoostRegressor(xgbParamFinal) .setLabelCol(labelName) .setFeaturesCols(featureNames)

The problem is that there is no such method setFeaturesCols(), only .setFeaturesCol("features") and .ml$dmlc$xgboost4j$scala$spark$params$HasFeaturesCols$_setter_$featuresCols_$eq() are available.

If this parameter is not specified, then the method ends with an exception:

2022-06-14 19:57:54.865 ERROR 1 --- [   scheduling-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task

java.util.NoSuchElementException: Failed to find a default value for featuresCols
        at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at scala.Option.getOrElse(Option.scala:189) ~[scala-library-2.12.15.jar!/:na]
        at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.ml.param.Params.$(params.scala:762) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.ml.param.Params.$$(params.scala:762) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at ml.dmlc.xgboost4j.scala.spark.params.HasFeaturesCols.getFeaturesCols(GeneralParams.scala:262) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.params.HasFeaturesCols.getFeaturesCols$(GeneralParams.scala:262) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.getFeaturesCols(XGBoostRegressor.scala:37) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.transformSchema(GpuPreXGBoost.scala:382) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost.transformSchema(GpuPreXGBoost.scala:90) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.PreXGBoost$.transformSchema(PreXGBoost.scala:86) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.transformSchema(XGBoostRegressor.scala:169) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:133) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at com.alekscapital.mltests.service.MLService.xgBoostTest(MLService.java:148) ~[classes!/:0.0.1-SNAPSHOT]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.3.10.jar!/:5.3.10]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.3.10.jar!/:5.3.10]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

The problem is exacerbated by the fact that I can't find a working example of converting an array of strings to an object of class StringArrayParam. I tried to implement the transformers from the example, but this is ends by error too (I'm not saved the stacktrace).

My questions:

  1. is it possible to run XGBoostRegressor in java 8 with Spark, Rapids, GPU?
  2. if yes, what method can I set the parameter?
  3. how to form an object of StringArrayParam from a Strin[] array?

Here is my code:

    public void xgBoostTest() {
        String labelName = "label";
        List<StructField> schemaFields = Arrays.asList(
                DataTypes.createStructField("vendor_id", DataTypes.DoubleType, true),
                DataTypes.createStructField("passenger_count", DataTypes.DoubleType, true),
                DataTypes.createStructField("trip_distance", DataTypes.DoubleType, true),
                DataTypes.createStructField("pickup_longitude", DataTypes.DoubleType, true),
                DataTypes.createStructField("pickup_latitude", DataTypes.DoubleType, true),
                DataTypes.createStructField("rate_code", DataTypes.DoubleType, true),
                DataTypes.createStructField("store_and_fwd", DataTypes.DoubleType, true),
                DataTypes.createStructField("dropoff_longitude", DataTypes.DoubleType, true),
                DataTypes.createStructField("dropoff_latitude", DataTypes.DoubleType, true),
                DataTypes.createStructField(labelName, DataTypes.DoubleType, true),
                DataTypes.createStructField("hour", DataTypes.DoubleType, true),
                DataTypes.createStructField("year", DataTypes.IntegerType, true),
                DataTypes.createStructField("month", DataTypes.IntegerType, true),
                DataTypes.createStructField("day", DataTypes.DoubleType, true),
                DataTypes.createStructField("day_of_week", DataTypes.DoubleType, true),
                DataTypes.createStructField("is_weekend", DataTypes.DoubleType, true)
        );
        StructType schema = DataTypes.createStructType(schemaFields);

        String trainPath = "/opt/spark/train.csv";
        //test
        String evalPath  = "/opt/spark/eval.csv";

        Dataset<Row> tdf = session.read()
                .option("inferSchema", "false")
                .option("header", true)
                .schema(schema)
                .csv(trainPath);
        Dataset<Row> edf = session.read()
                .option("inferSchema", "false")
                .option("header", true)
                .schema(schema)
                .csv(evalPath);

        String[] featureColumns = {"passenger_count", "trip_distance", "pickup_longitude", "pickup_latitude", "rate_code",
                "dropoff_longitude", "dropoff_latitude", "hour", "day_of_week", "is_weekend"};

        Map<String, Object> paramMap = new HashMap<>();
        paramMap = paramMap.updated("learning_rate", 0.05);
        paramMap = paramMap.updated("max_depth", 8);
        paramMap = paramMap.updated("subsample", 0.8);
        paramMap = paramMap.updated("gamma", 1);
        paramMap = paramMap.updated("num_round", 500);
        paramMap = paramMap.updated("tree_method", "gpu_hist");
        paramMap = paramMap.updated("num_workers", 1);

        Dataset<Row> ntdf = transform(tdf, featureColumns, labelName);
        Dataset<Row> nedf = transform(edf, featureColumns, labelName);

        XGBoostRegressor regressor = new XGBoostRegressor(paramMap)
                .setLabelCol(labelName)
                .setFeaturesCol("features");

        PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(ntdf);
        Dataset<Row> result = model.transform(nedf);
        result.show();
    }

    public Dataset<Row> transform(Dataset<Row> df, String[] strArr, String labelName) {
        Dataset<Row> ndf = df
                .withColumn("year", df.col("year").cast(DataTypes.DoubleType))
                .withColumn("month", df.col("month").cast(DataTypes.DoubleType));

        VectorAssembler assembler = new VectorAssembler()
                .setInputCols(strArr)
                .setOutputCol("features");

        return assembler
                .transform(ndf.select(
                        col("vendor_id").cast(DataTypes.FloatType),
                        col("passenger_count").cast(DataTypes.FloatType),
                        col("trip_distance").cast(DataTypes.FloatType),
                        col("pickup_longitude").cast(DataTypes.FloatType),
                        col("pickup_latitude").cast(DataTypes.FloatType),
                        col("rate_code").cast(DataTypes.FloatType),
                        col("store_and_fwd").cast(DataTypes.FloatType),
                        col("dropoff_longitude").cast(DataTypes.FloatType),
                        col("dropoff_latitude").cast(DataTypes.FloatType),
                        col(labelName).cast(DataTypes.FloatType),
                        col("hour").cast(DataTypes.FloatType),
                        col("year").cast(DataTypes.FloatType),
                        col("month").cast(DataTypes.FloatType),
                        col("day").cast(DataTypes.FloatType),
                        col("day_of_week").cast(DataTypes.FloatType),
                        col("is_weekend").cast(DataTypes.FloatType)
                    ))
                .select(col("features"), col(labelName));
    }

dependencies of pom.xml:

  <properties>
       <java.version>1.8</java.version>
       <scala.version>2.12</scala.version>
       <spark.version>3.2.1</spark.version>
   </properties>

       <!-- spark -->
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-core_${scala.version}</artifactId>
           <version>${spark.version}</version>
       </dependency>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-hive_${scala.version}</artifactId>
           <version>${spark.version}</version>
       </dependency>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-streaming_${scala.version}</artifactId>
           <version>${spark.version}</version>
       </dependency>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-sql_${scala.version}</artifactId>
           <version>${spark.version}</version>
       </dependency>
       <!-- spark-mllib -->
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-mllib_${scala.version}</artifactId>
           <version>${spark.version}</version>
       </dependency>
       <!-- https://mvnrepository.com/artifact/ml.dmlc/xgboost4j-spark -->
       <dependency>
           <groupId>ml.dmlc</groupId>
           <artifactId>xgboost4j-spark-gpu_${scala.version}</artifactId>
           <version>1.6.1</version>
       </dependency>
       <!-- https://mvnrepository.com/artifact/ai.rapids/xgboost4j-spark -->
       <dependency>
           <groupId>ai.rapids</groupId>
           <artifactId>xgboost4j-spark_2.x</artifactId>
           <version>1.0.0-Beta5</version>
       </dependency>
@trivialfis
Copy link
Member

@wbo4958 Could you please take a look when time allows?

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 15, 2022

Hi @Dartya, The sample is a little bit old. Please use "setFeaturesCol" instead of "setFeaturesCols". and there is a new sample from https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/taxi/scala/src/com/nvidia/spark/examples/taxi/Main.scala, please check it out and try it. The performance is impressive.

@Dartya
Copy link
Author

Dartya commented Jun 15, 2022

@wbo4958 Hi!
I used the "setFeaturesCol" and got "java.util.NoSuchElementException: Failed to find a default value for featuresCols". I've specified the stacktrace in the first post.

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 15, 2022

@Dartya Sorry for that,

Can you change

        Dataset<Row> ntdf = transform(tdf, featureColumns, labelName);
        Dataset<Row> nedf = transform(edf, featureColumns, labelName);

        XGBoostRegressor regressor = new XGBoostRegressor(paramMap)
                .setLabelCol(labelName)
                .setFeaturesCol("features");

        PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(ntdf);
        Dataset<Row> result = model.transform(nedf);

to

        // Dataset<Row> ntdf = transform(tdf, featureColumns, labelName);
        // Dataset<Row> nedf = transform(edf, featureColumns, labelName);

        XGBoostRegressor regressor = new XGBoostRegressor(paramMap)
                .setLabelCol(labelName)
                .setFeaturesCol(featureColumns);

        PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(tdf);
        Dataset<Row> result = model.transform(edf);

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 15, 2022

there is no need to do the vectorization, you only need to specify the feature column names.

@Dartya
Copy link
Author

Dartya commented Jun 15, 2022

@wbo4958, thanks for the tip! Now I do not use data type transformation and from the very beginning I subtract all data as a float.

Apparently, the data began to be loaded into the GPU memory, since the regressor parameters were written to the log and... new errors appeared. I identify this reason:
Type 'ai/rapids/cudf/ColumnVector' (current frame, stack[1]) is not assignable to 'ai/rapids/cudf/ColumnView'
Maybe using DataSet is incorrect way to make the logic working correctly.

A Google search hasn't turned up any results yet, but I'm continuing. Just in case, here is the resulting code and output.

I also want to say that it is now important for me to make the example work correctly - this affects the choice of technology for my project, the central technical part of which is the use of a distributed learning system on several nodes. Probably, I do not have enough in-depth knowledge of libraries and APIs, but use cases are examples to feel the technology with minimal knowledge. I just want to apologize for the confusion, and possibly stupid questions.

Code:

public void xgBoostTest() {
	String labelName = "fare_amount";
	List<StructField> schemaFields = Arrays.asList(
			DataTypes.createStructField("vendor_id", DataTypes.FloatType, true),
			DataTypes.createStructField("passenger_count", DataTypes.FloatType, true),
			DataTypes.createStructField("trip_distance", DataTypes.FloatType, true),
			DataTypes.createStructField("pickup_longitude", DataTypes.FloatType, true),
			DataTypes.createStructField("pickup_latitude", DataTypes.FloatType, true),
			DataTypes.createStructField("rate_code", DataTypes.FloatType, true),
			DataTypes.createStructField("store_and_fwd", DataTypes.FloatType, true),
			DataTypes.createStructField("dropoff_longitude", DataTypes.FloatType, true),
			DataTypes.createStructField("dropoff_latitude", DataTypes.FloatType, true),
			DataTypes.createStructField(labelName, DataTypes.FloatType, true),
			DataTypes.createStructField("hour", DataTypes.FloatType, true),
			DataTypes.createStructField("year", DataTypes.FloatType, true),
			DataTypes.createStructField("month", DataTypes.FloatType, true),
			DataTypes.createStructField("day", DataTypes.FloatType, true),
			DataTypes.createStructField("day_of_week", DataTypes.FloatType, true),
			DataTypes.createStructField("is_weekend", DataTypes.FloatType, true)
	);
	StructType schema = DataTypes.createStructType(schemaFields);

	String trainPath = "/opt/spark/train.csv";
	//test
	String evalPath  = "/opt/spark/eval.csv";

	Dataset<Row> tdf = session.read()
			.option("inferSchema", "false")
			.option("header", true)
			.schema(schema)
			.csv(trainPath);
	tdf.show();
	Dataset<Row> edf = session.read()
			.option("inferSchema", "false")
			.option("header", true)
			.schema(schema)
			.csv(evalPath);
	edf.show();

	String[] featureColumns = {"passenger_count", "trip_distance", "pickup_longitude", "pickup_latitude", "rate_code",
			"dropoff_longitude", "dropoff_latitude", "hour", "day_of_week", "is_weekend"};

	Map<String, Object> paramMap = new HashMap<>();
	paramMap = paramMap.updated("learning_rate", 0.05);
	paramMap = paramMap.updated("max_depth", 8);
	paramMap = paramMap.updated("subsample", 0.8);
	paramMap = paramMap.updated("gamma", 1);
	paramMap = paramMap.updated("num_round", 500);
	paramMap = paramMap.updated("tree_method", "gpu_hist");
	paramMap = paramMap.updated("num_workers", 1);

	XGBoostRegressor regressor = new XGBoostRegressor(paramMap);
	regressor.setLabelCol(labelName);
	regressor.setFeaturesCol(featureColumns);

	PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(tdf);
	Dataset<Row> result = model.transform(edf);
	result.show();
}

Logs:

+------------+---------------+-------------+----------------+---------------+-------------+-------------+-----------------+----------------+-----------+----+------+-----+----+-----------+----------+
|   vendor_id|passenger_count|trip_distance|pickup_longitude|pickup_latitude|    rate_code|store_and_fwd|dropoff_longitude|dropoff_latitude|fare_amount|hour|  year|month| day|day_of_week|is_weekend|
+------------+---------------+-------------+----------------+---------------+-------------+-------------+-----------------+----------------+-----------+----+------+-----+----+-----------+----------+
|1.55973043E9|            2.0|          0.7|        -73.9746|      40.759945| -6.7741894E8|    2313200.0|        -73.98473|       40.759388|        5.0|23.0|2012.0| 11.0| 7.0|        0.0|       0.0|
|1.55973043E9|            3.0|         10.7|       -73.98994|      40.756775| -6.7741894E8|    2313200.0|        -73.86525|        40.77063|       34.0|12.0|2012.0| 11.0| 7.0|        0.0|       0.0|
|1.55973043E9|            1.0|          2.3|       -73.98851|      40.774307| -6.7741894E8|    2313200.0|       -73.981094|       40.755325|       10.0| 7.0|2012.0| 11.0| 7.0|        0.0|       0.0|
|1.55973043E9|            1.0|          4.4|       -74.01039|      40.708702| -6.7741894E8|    2313200.0|        -73.98785|       40.756104|       16.5|14.0|2012.0| 11.0|10.0|        3.0|       0.0|
|1.55973043E9|            1.0|          1.5|       -73.99211|        40.6897| -6.7741894E8|  9.3351347E8|        -74.00716|       40.679295|        7.0| 8.0|2012.0| 11.0| 7.0|        0.0|       0.0|
|1.55973043E9|            1.0|          0.8|       -73.97713|       40.74831| -6.7741894E8|    2313200.0|        -73.99091|       40.751053|        7.5|11.0|2012.0| 11.0| 5.0|        5.0|       1.0|
|1.55973043E9|            1.0|          1.2|       -73.98238|        40.7521| -6.7741894E8|    2313200.0|        -73.99333|       40.736393|        5.5|18.0|2012.0| 11.0| 1.0|        1.0|       0.0|
|1.55973043E9|            1.0|          3.0|       -73.98702|      40.759373| -6.7741894E8|    2313200.0|        -73.86202|       40.768017|        2.5| 6.0|2012.0| 11.0| 2.0|        2.0|       0.0|
|4.52563168E8|            1.0|         2.34|       -73.99062|       40.75646| -6.7741894E8|         -1.0|         -73.9915|         40.7395|        9.5|10.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|         3.17|       -73.97528|       40.75582| -6.7741894E8|         -1.0|        -73.94726|       40.793488|       12.0|10.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|         1.49|       -73.99667|       40.69319| -6.7741894E8|         -1.0|         -73.9778|        40.68419|        8.0|10.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|         1.28|       -74.00064|      40.742344| -6.7741894E8|         -1.0|        -74.00345|       40.732536|        5.5|10.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            5.0|         1.82|      -73.971596|        40.7465| -6.7741894E8|         -1.0|       -73.973595|       40.765022|        8.0|10.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|         0.81|       -73.99934|       40.73405| -6.7741894E8|         -1.0|       -73.992195|       40.735596|        5.0|10.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|         1.75|       -73.98272|      40.771534| -6.7741894E8|         -1.0|        -74.00029|       40.755962|        7.0| 7.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|        17.27|       -73.78198|       40.64467|-1.97494176E8|         -1.0|       -73.978584|       40.741043|       52.0| 7.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            5.0|         0.61|       -73.95043|       40.78659| -6.7741894E8|         -1.0|        -73.95238|       40.783436|        4.0| 7.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            2.0|         0.42|       -74.00052|       40.72251| -6.7741894E8|         -1.0|        -74.00525|       40.722633|        4.0| 7.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|         1.28|      -73.990326|       40.74196| -6.7741894E8|         -1.0|       -73.993355|        40.75226|        6.5| 9.0|2012.0| 11.0|13.0|        6.0|       1.0|
|4.52563168E8|            1.0|         1.46|       -73.96373|      40.757343| -6.7741894E8|         -1.0|       -73.958694|       40.773037|        6.0| 9.0|2012.0| 11.0|13.0|        6.0|       1.0|
+------------+---------------+-------------+----------------+---------------+-------------+-------------+-----------------+----------------+-----------+----+------+-----+----+-----------+----------+
only showing top 20 rows

2022-06-15 19:30:23.810  INFO 1 --- [   scheduling-1] XGBoostSpark                             : Running XGBoost 1.6.1 with parameters:
alpha -> 0.0
learning_rate -> 0.05
min_child_weight -> 1.0
sample_type -> uniform
base_score -> 0.5
rabit_timeout -> -1
colsample_bylevel -> 1.0
grow_policy -> depthwise
skip_drop -> 0.0
lambda_bias -> 0.0
silent -> 0
scale_pos_weight -> 1.0
seed -> 0
cache_training_set -> false
handle_invalid -> error
features_col -> features
num_early_stopping_rounds -> 0
label_col -> fare_amount
num_workers -> 1
subsample -> 0.8
lambda -> 1.0
max_depth -> 8
tree_limit -> 0
custom_eval -> null
dmlc_worker_connect_retry -> 5
rate_drop -> 0.0
max_bin -> 256
train_test_ratio -> 1.0
use_external_memory -> false
objective -> reg:squarederror
features_cols -> [Ljava.lang.String;@534ae7f2
eval_metric -> rmse
num_round -> 500
timeout_request_workers -> 1800000
missing -> NaN
rabit_ring_reduce_threshold -> 32768
checkpoint_path ->
tracker_conf -> TrackerConf(0,python,,)
tree_method -> gpu_hist
max_delta_step -> 0.0
eta -> 0.3
verbosity -> 1
colsample_bytree -> 1.0
normalize_type -> tree
allow_non_zero_for_missing -> false
custom_obj -> null
gamma -> 1.0
sketch_eps -> 0.03
nthread -> 1
prediction_col -> prediction
checkpoint_interval -> -1
2022-06-15 19:30:23.811  WARN 1 --- [   scheduling-1] XGBoostSpark                             : train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
2022-06-15 19:30:23.841  INFO 1 --- [   scheduling-1] o.a.s.s.e.d.FileSourceStrategy           : Pushed Filters:
2022-06-15 19:30:23.841  INFO 1 --- [   scheduling-1] o.a.s.s.e.d.FileSourceStrategy           : Post-Scan Filters:
2022-06-15 19:30:23.842  INFO 1 --- [   scheduling-1] o.a.s.s.e.d.FileSourceStrategy           : Output Data Schema: struct<vendor_id: float, passenger_count: float, trip_distance: float, pickup_longitude: float, pickup_latitude: float ... 14 more fields>
2022-06-15 19:30:23.855 ERROR 1 --- [   scheduling-1] com.nvidia.spark.rapids.GpuOverrideUtil  : Encountered an exception applying GPU overrides java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/nvidia/spark/rapids/GpuCast.doColumnar(Lcom/nvidia/spark/rapids/GpuColumnVector;)Lai/rapids/cudf/ColumnVector; @27: invokevirtual
  Reason:
    Type 'ai/rapids/cudf/ColumnVector' (current frame, stack[1]) is not assignable to 'ai/rapids/cudf/ColumnView'
  Current Frame:
    bci: @27
    flags: { }
    locals: { 'com/nvidia/spark/rapids/GpuCast', 'com/nvidia/spark/rapids/GpuColumnVector' }
    stack: { 'com/nvidia/spark/rapids/GpuCast$', 'ai/rapids/cudf/ColumnVector', 'org/apache/spark/sql/types/DataType', 'org/apache/spark/sql/types/DataType', integer, integer, integer }
  Bytecode:
    0x0000000: b200 392b b601 a42b b601 a52a b600 d32a
    0x0000010: b600 ea2a b601 a72a b601 a9b6 0082 b0


java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/nvidia/spark/rapids/GpuCast.doColumnar(Lcom/nvidia/spark/rapids/GpuColumnVector;)Lai/rapids/cudf/ColumnVector; @27: invokevirtual
  Reason:
    Type 'ai/rapids/cudf/ColumnVector' (current frame, stack[1]) is not assignable to 'ai/rapids/cudf/ColumnView'
  Current Frame:
    bci: @27
    flags: { }
    locals: { 'com/nvidia/spark/rapids/GpuCast', 'com/nvidia/spark/rapids/GpuColumnVector' }
    stack: { 'com/nvidia/spark/rapids/GpuCast$', 'ai/rapids/cudf/ColumnVector', 'org/apache/spark/sql/types/DataType', 'org/apache/spark/sql/types/DataType', integer, integer, integer }
  Bytecode:
    0x0000000: b200 392b b601 a42b b601 a52a b600 d32a
    0x0000010: b600 ea2a b601 a72a b601 a9b6 0082 b0

        at com.nvidia.spark.rapids.CastExprMeta.convertToGpu(GpuCast.scala:155) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.$anonfun$convertToGpu$1(basicPhysicalOperators.scala:49) ~[spark3xx-common/:na]
        at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.loop$1(Growable.scala:57) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:61) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.to(TraversableLike.scala:786) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.to$(TraversableLike.scala:783) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.to(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.toList(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:49) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:41) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:723) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:112) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:42) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:723) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$convertToCpu$1(RapidsMeta.scala:602) ~[spark3xx-common/:na]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.Iterator.foreach(Iterator.scala:943) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.map(TraversableLike.scala:286) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.map(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertToCpu(RapidsMeta.scala:602) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:725) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides$.com$nvidia$spark$rapids$GpuOverrides$$doConvertPlan(GpuOverrides.scala:3937) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.applyOverrides(GpuOverrides.scala:4182) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$4(GpuOverrides.scala:4134) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides$.logDuration(GpuOverrides.scala:463) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$2(GpuOverrides.scala:4132) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4101) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.apply(GpuOverrides.scala:4144) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.$anonfun$apply$1(GpuOverrides.scala:4118) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4101) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4121) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4114) ~[spark3xx-common/:na]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:769) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.List.foldLeft(List.scala:91) ~[scala-library-2.12.15.jar!/:na]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:768) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$initialPlan$1(AdaptiveSparkPlanExec.scala:180) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.<init>(AdaptiveSparkPlanExec.scala:179) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.applyInternal(InsertAdaptiveSparkPlan.scala:63) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:43) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:40) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$1(QueryExecution.scala:449) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.List.foldLeft(List.scala:91) ~[scala-library-2.12.15.jar!/:na]
        at org.apache.spark.sql.execution.QueryExecution$.prepareForExecution(QueryExecution.scala:448) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$2(QueryExecution.scala:170) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[spark-catalyst_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:170) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:185) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:184) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3247) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3245) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.extractRDDColumnarBatch(InternalColumnarRddConverter.scala:670) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.convert(InternalColumnarRddConverter.scala:718) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter.convert(InternalColumnarRddConverter.scala) ~[spark3xx-common/:na]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
        at com.nvidia.spark.rapids.ColumnarRdd$.convert(ColumnarRdd.scala:52) ~[rapids-4-spark_2.12-22.04.0.jar!/:na]
        at com.nvidia.spark.rapids.ColumnarRdd$.apply(ColumnarRdd.scala:48) ~[rapids-4-spark_2.12-22.04.0.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuUtils$.toColumnarRdd(GpuUtils.scala:30) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildRDDWatches(GpuPreXGBoost.scala:458) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildDatasetToRDD$3(GpuPreXGBoost.scala:173) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:407) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at com.alekscapital.mltests.service.MLService.xgBoostTest(MLService.java:171) ~[classes!/:0.0.1-SNAPSHOT]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.3.10.jar!/:5.3.10]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.3.10.jar!/:5.3.10]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

2022-06-15 19:30:23.856 ERROR 1 --- [   scheduling-1] com.nvidia.spark.rapids.GpuOverrideUtil  : Encountered an exception applying GPU overrides java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/nvidia/spark/rapids/GpuCast.doColumnar(Lcom/nvidia/spark/rapids/GpuColumnVector;)Lai/rapids/cudf/ColumnVector; @27: invokevirtual
  Reason:
    Type 'ai/rapids/cudf/ColumnVector' (current frame, stack[1]) is not assignable to 'ai/rapids/cudf/ColumnView'
  Current Frame:
    bci: @27
    flags: { }
    locals: { 'com/nvidia/spark/rapids/GpuCast', 'com/nvidia/spark/rapids/GpuColumnVector' }
    stack: { 'com/nvidia/spark/rapids/GpuCast$', 'ai/rapids/cudf/ColumnVector', 'org/apache/spark/sql/types/DataType', 'org/apache/spark/sql/types/DataType', integer, integer, integer }
  Bytecode:
    0x0000000: b200 392b b601 a42b b601 a52a b600 d32a
    0x0000010: b600 ea2a b601 a72a b601 a9b6 0082 b0


java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/nvidia/spark/rapids/GpuCast.doColumnar(Lcom/nvidia/spark/rapids/GpuColumnVector;)Lai/rapids/cudf/ColumnVector; @27: invokevirtual
  Reason:
    Type 'ai/rapids/cudf/ColumnVector' (current frame, stack[1]) is not assignable to 'ai/rapids/cudf/ColumnView'
  Current Frame:
    bci: @27
    flags: { }
    locals: { 'com/nvidia/spark/rapids/GpuCast', 'com/nvidia/spark/rapids/GpuColumnVector' }
    stack: { 'com/nvidia/spark/rapids/GpuCast$', 'ai/rapids/cudf/ColumnVector', 'org/apache/spark/sql/types/DataType', 'org/apache/spark/sql/types/DataType', integer, integer, integer }
  Bytecode:
    0x0000000: b200 392b b601 a42b b601 a52a b600 d32a
    0x0000010: b600 ea2a b601 a72a b601 a9b6 0082 b0

        at com.nvidia.spark.rapids.CastExprMeta.convertToGpu(GpuCast.scala:155) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.$anonfun$convertToGpu$1(basicPhysicalOperators.scala:49) ~[spark3xx-common/:na]
        at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.loop$1(Growable.scala:57) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:61) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.to(TraversableLike.scala:786) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.to$(TraversableLike.scala:783) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.to(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.toList(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:49) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:41) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:723) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:112) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:42) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:723) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$convertToCpu$1(RapidsMeta.scala:602) ~[spark3xx-common/:na]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.Iterator.foreach(Iterator.scala:943) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.map(TraversableLike.scala:286) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.map(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertToCpu(RapidsMeta.scala:602) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:725) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides$.com$nvidia$spark$rapids$GpuOverrides$$doConvertPlan(GpuOverrides.scala:3937) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.applyOverrides(GpuOverrides.scala:4182) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$4(GpuOverrides.scala:4134) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides$.logDuration(GpuOverrides.scala:463) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$2(GpuOverrides.scala:4132) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4101) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.apply(GpuOverrides.scala:4144) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.$anonfun$apply$1(GpuOverrides.scala:4118) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4101) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4121) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4114) ~[spark3xx-common/:na]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:769) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.List.foldLeft(List.scala:91) ~[scala-library-2.12.15.jar!/:na]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:768) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$initialPlan$1(AdaptiveSparkPlanExec.scala:180) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.<init>(AdaptiveSparkPlanExec.scala:179) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.applyInternal(InsertAdaptiveSparkPlan.scala:63) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:43) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:40) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$1(QueryExecution.scala:449) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.List.foldLeft(List.scala:91) ~[scala-library-2.12.15.jar!/:na]
        at org.apache.spark.sql.execution.QueryExecution$.prepareForExecution(QueryExecution.scala:448) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$2(QueryExecution.scala:170) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[spark-catalyst_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:170) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:185) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:184) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3247) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3245) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.extractRDDColumnarBatch(InternalColumnarRddConverter.scala:670) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.convert(InternalColumnarRddConverter.scala:718) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter.convert(InternalColumnarRddConverter.scala) ~[spark3xx-common/:na]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
        at com.nvidia.spark.rapids.ColumnarRdd$.convert(ColumnarRdd.scala:52) ~[rapids-4-spark_2.12-22.04.0.jar!/:na]
        at com.nvidia.spark.rapids.ColumnarRdd$.apply(ColumnarRdd.scala:48) ~[rapids-4-spark_2.12-22.04.0.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuUtils$.toColumnarRdd(GpuUtils.scala:30) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildRDDWatches(GpuPreXGBoost.scala:458) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildDatasetToRDD$3(GpuPreXGBoost.scala:173) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:407) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at com.alekscapital.mltests.service.MLService.xgBoostTest(MLService.java:171) ~[classes!/:0.0.1-SNAPSHOT]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.3.10.jar!/:5.3.10]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.3.10.jar!/:5.3.10]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

2022-06-15 19:30:23.858 ERROR 1 --- [   scheduling-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task

java.lang.reflect.InvocationTargetException: null
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
        at com.nvidia.spark.rapids.ColumnarRdd$.convert(ColumnarRdd.scala:52) ~[rapids-4-spark_2.12-22.04.0.jar!/:na]
        at com.nvidia.spark.rapids.ColumnarRdd$.apply(ColumnarRdd.scala:48) ~[rapids-4-spark_2.12-22.04.0.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuUtils$.toColumnarRdd(GpuUtils.scala:30) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildRDDWatches(GpuPreXGBoost.scala:458) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildDatasetToRDD$3(GpuPreXGBoost.scala:173) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:407) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar!/:na]
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) ~[spark-mllib_2.12-3.2.1.jar!/:3.2.1]
        at com.alekscapital.mltests.service.MLService.xgBoostTest(MLService.java:171) ~[classes!/:0.0.1-SNAPSHOT]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.3.10.jar!/:5.3.10]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.3.10.jar!/:5.3.10]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/nvidia/spark/rapids/GpuCast.doColumnar(Lcom/nvidia/spark/rapids/GpuColumnVector;)Lai/rapids/cudf/ColumnVector; @27: invokevirtual
  Reason:
    Type 'ai/rapids/cudf/ColumnVector' (current frame, stack[1]) is not assignable to 'ai/rapids/cudf/ColumnView'
  Current Frame:
    bci: @27
    flags: { }
    locals: { 'com/nvidia/spark/rapids/GpuCast', 'com/nvidia/spark/rapids/GpuColumnVector' }
    stack: { 'com/nvidia/spark/rapids/GpuCast$', 'ai/rapids/cudf/ColumnVector', 'org/apache/spark/sql/types/DataType', 'org/apache/spark/sql/types/DataType', integer, integer, integer }
  Bytecode:
    0x0000000: b200 392b b601 a42b b601 a52a b600 d32a
    0x0000010: b600 ea2a b601 a72a b601 a9b6 0082 b0

        at com.nvidia.spark.rapids.CastExprMeta.convertToGpu(GpuCast.scala:155) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1098) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.UnaryExprMeta.convertToGpu(RapidsMeta.scala:1090) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.$anonfun$convertToGpu$1(basicPhysicalOperators.scala:49) ~[spark3xx-common/:na]
        at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.loop$1(Growable.scala:57) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:61) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.to(TraversableLike.scala:786) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.to$(TraversableLike.scala:783) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.to(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.toList(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:49) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:41) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:723) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:112) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:42) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:723) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$convertToCpu$1(RapidsMeta.scala:602) ~[spark3xx-common/:na]
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.Iterator.foreach(Iterator.scala:943) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.map(TraversableLike.scala:286) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.AbstractTraversable.map(Traversable.scala:108) ~[scala-library-2.12.15.jar!/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertToCpu(RapidsMeta.scala:602) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:725) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides$.com$nvidia$spark$rapids$GpuOverrides$$doConvertPlan(GpuOverrides.scala:3937) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.applyOverrides(GpuOverrides.scala:4182) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$4(GpuOverrides.scala:4134) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides$.logDuration(GpuOverrides.scala:463) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$2(GpuOverrides.scala:4132) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4101) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrides.apply(GpuOverrides.scala:4144) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.$anonfun$apply$1(GpuOverrides.scala:4118) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4101) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4121) ~[spark3xx-common/:na]
        at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:4114) ~[spark3xx-common/:na]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:769) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.List.foldLeft(List.scala:91) ~[scala-library-2.12.15.jar!/:na]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:768) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$initialPlan$1(AdaptiveSparkPlanExec.scala:180) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.<init>(AdaptiveSparkPlanExec.scala:179) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.applyInternal(InsertAdaptiveSparkPlan.scala:63) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:43) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:40) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$1(QueryExecution.scala:449) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) ~[scala-library-2.12.15.jar!/:na]
        at scala.collection.immutable.List.foldLeft(List.scala:91) ~[scala-library-2.12.15.jar!/:na]
        at org.apache.spark.sql.execution.QueryExecution$.prepareForExecution(QueryExecution.scala:448) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$2(QueryExecution.scala:170) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) ~[spark-catalyst_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:170) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:185) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:184) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3247) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3245) ~[spark-sql_2.12-3.2.1.jar!/:3.2.1]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.extractRDDColumnarBatch(InternalColumnarRddConverter.scala:670) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.convert(InternalColumnarRddConverter.scala:718) ~[spark3xx-common/:na]
        at org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter.convert(InternalColumnarRddConverter.scala) ~[spark3xx-common/:na]
        ... 27 common frames omitted

Perhaps the following information is relevant:

  1. I'm running the application in a docker container and using Spring and @scheduled to launch. The container is based on ubuntu 20 and oracle jdk 8
  2. Spark executor runs in a docker container based on nvcr.io/nvidia/cuda:11.6.2-devel-ubuntu20.04 with jdk8.
  3. Host OS Windows 10 pro with WSL2.
  4. Spark executor output:
sh-5.0$ nvidia-smi
Wed Jun 15 20:00:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02    Driver Version: 512.95       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:0B:00.0  On |                  N/A |
| 29%   34C    P8    10W / 190W |   5667MiB /  6144MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     20788      C   /java                           N/A      |
+-----------------------------------------------------------------------------+
sh-5.0$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 15, 2022

@Dartya, Could you share the spark configuration and which RAPIDS Acceleator was using? To be honest, I have not tested it with JAVA coding, but we can make it work with Scala coding without any issue. more details can be found at https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.06/examples/XGBoost-Examples

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 16, 2022

Hi @Dartya, I just tried the java way, it still works.

First download the taxi dataset from taxi dataset and de-compress to somewhere.

And then download the latest cudf and rapids jars from here and here

I chose the parquet format and changed some code to read parquet, here is the code

import org.apache.spark.ml.PredictionModel;
import org.apache.spark.ml.linalg.Vector;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel;
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor;

public static void testRegression() {
    String trainPath = "YOUR_PATH/taxi-small/taxi/parquet/train";
    //test
    String evalPath = "YOUR_PATH/taxi-small/taxi/parquet/eval";

    SparkSession session = SparkSession.builder()
        .master("local[*]")
        .getOrCreate();

    Dataset<Row> tdf = session.read().parquet(trainPath);
    tdf.show();
    Dataset<Row> edf = session.read().parquet(evalPath);
    edf.show();

    String labelName = "fare_amount";
    String[] featureColumns = {"passenger_count", "trip_distance", "pickup_longitude", "pickup_latitude", "rate_code",
        "dropoff_longitude", "dropoff_latitude", "hour", "day_of_week", "is_weekend"};

    scala.collection.immutable.Map map = new scala.collection.immutable.HashMap<String, Object>();
    map = map.updated("learning_rate", 0.05);
    map = map.updated("max_depth", 8);
    map = map.updated("subsample", 0.8);
    map = map.updated("gamma", 1);
    map = map.updated("num_round", 500);
    map = map.updated("tree_method", "gpu_hist");
    map = map.updated("num_workers", 1);

    XGBoostRegressor regressor = new XGBoostRegressor(map);
    regressor.setLabelCol(labelName);
    regressor.setFeaturesCol(featureColumns);

    PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(tdf);
    Dataset<Row> result = model.transform(edf);
    result.show();
}

and then submit the xgboost application with the below commands

#! /bin/bash

spark-submit \
   --master local[12] \
   --conf spark.rapids.memory.gpu.pooling.enabled=false \
   --conf spark.rapids.memory.gpu.minAllocFraction=0.0001 \
   --conf spark.rapids.memory.gpu.reserve=20 \
   --conf spark.rapids.sql.enabled=true \
   --conf spark.sql.adaptive.enabled=false \
   --conf spark.rapids.sql.explain=ALL \
   --conf spark.rapids.sql.hasNans=false \
   --conf spark.executor.cores=12 \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin \
   --conf spark.task.cpus=12 \
   --executor-memory 30G \
   --driver-memory 5G \
   --jars xgboost4j-spark-gpu_2.12-1.6.1.jar,xgboost4j-gpu_2.12-1.6.1.jar,cudf-22.04.0-cuda11.jar,rapids-4-spark_2.12-22.04.0.jar \
   --class RegressionMain \
     me.bobby.xgboost-1.0-SNAPSHOT.jar

@Dartya
Copy link
Author

Dartya commented Jun 16, 2022

Hi @wbo4958 !
I inform you that today I could not get the results - a lot of work. I'll get there tomorrow or this weekend. Thank you in advance for the example.
The data in CSV at the link you provided is the same as in the example from nVidia at the link https://github.com/rapidsai/spark-examples/blob/master/datasets/taxi-small.tar.gz.
I tried to use the parquet files and the configurations you provided and get a different exception. While I will not say which one, I want to figure it out myself. The reason may be that my remote PC rebooted and the spark configuration went wrong, my master is running on a virtual machine, and the worker is in a docker container.

My previous spark config is below:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.net.InetAddress;
import java.net.UnknownHostException;

@Configuration
public class SparkConfiguration {
    @Value("${spring.application.name}")
    private String appName;
    @Value("${spark.masterHost}")
    private String masterHost;

    @Bean
    public JavaSparkContext javaSparkContext() throws UnknownHostException {
        String host = InetAddress.getLocalHost().getHostAddress();
        SparkConf sparkConf = new SparkConf(true)
                .setAppName(appName)
                .setMaster("spark://" + masterHost)
                .setJars(new String[]{
                        "MLTests/target/service.jar",
                        "MLTests/connector/config-1.4.1.jar",
                        "MLTests/connector/cudf-22.04.0-cuda11.jar",
                        "MLTests/connector/rapids-4-spark_2.12-22.04.0.jar",
                        "MLTests/connector/spark-nlp_2.12-3.4.1.jar"})
                // Spark settings
                .set("spark.worker.cleanup.enabled", "true")
                // executors
                .set("spark.executor.cores", "4")
                .set("spark.executor.memory", "4g")
                // driver
                .set("spark.ui.enabled", "true")
                .set("spark.ui.port", "4040")
                .set("spark.driver.host", host)
                .set("spark.driver.port", "10000")
                .set("spark.sql.files.maxPartitionBytes", "512m")
                .set("spark.plugins", "com.nvidia.spark.SQLPlugin")
                .set("spark.driver.extraClassPath", "/opt/sparkRapidsPlugin/cudf-22.04.0-cuda11.jar:/opt/sparkRapidsPlugin/rapids-4-spark_2.12-22.04.0.jar");
    }

    @Bean
    public SparkSession sparkSession(JavaSparkContext context) {
        return SparkSession.builder()
                .master("spark://" + masterHost)
                .appName(appName)
                .config(context.getConf())
                .getOrCreate();
    }
}

I saw that our configs are different and added the jar files:
"MLTests/connector/xgboost4j-gpu_2.12-1.6.1.jar",
"MLTests/connector/xgboost4j-spark-gpu_2.12-1.6.1.jar"

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 16, 2022

@Dartya yeah, looks like you're using standalone mode. and then xgboost requires GPU scheduling configuration in both driver and worker.

for worker

please get getGpusResources.sh

and in ${SPARK_HOME}/conf/spark-defaults.conf add below configuration

spark.worker.resource.gpu.amount 1
spark.worker.resource.gpu.discoveryScript YOUR_PATH/getGpusResources.sh

for driver

please add below configuration,

  --conf spark.executor.resource.gpu.amount=1
  --conf spark.task.resource.gpu.amount=1

So I would like to suggest you run on the local first, and then try in on standalone.

@Dartya
Copy link
Author

Dartya commented Jun 17, 2022

Hi @wbo4958 !
In the current situation, it is difficult for me to run a Spark cluster locally, because I am using remote connection to my 2 PC with GPUs - I'm in another country now. Both PCs are running inder Windows 10 Pro. I am developing on one of the machines, running a Debian 11 virtual machine under WMWare Player (Spark master) and running Spark worker in a docker container on this machine.

Based on the fact that the target scheme is to run in Spark in Docker / Kubernetes, I used a test Spark master from my local virtual machine cluster. Also, the ML logic should be part of the Spring webservice, and I have experience running services in this form, I decided to immediately do tests in this configuration.

I'm running my Spring app with XGBoost either in Intellij Idea debug or in a docker container. The next step in my tests should be to run the training job distributed on two machines in docker containers, which matches the target scheme.

Of course, based on the nVidia example and the Rapids documentation, I used the getGpusResources.sh GPU resource discovery script, hosted it in my docker image, and applied the following setting in spark-defaults.conf:

spark.worker.resource.gpu.amount 1
spark.worker.resource.gpu.discoveryScript /opt/sparkRapidsPlugin/getGpusResources.sh

and in spark-env.sh:
SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh -Dspark.rapids.memory.pinnedPool.size=4G -Dspark.executor.resource.gpu.amount=1"

I debugged with the following example:

public void testRapids() {
	int capacity = 1000000;
	List<LongValue> list = new ArrayList<>(capacity);
	for (long i = 1; i < (capacity + 1); i++) {
		list.add(new LongValue(i));
	}

	Dataset<Row> df = session.createDataFrame(list, LongValue.class);
	Dataset<Row> df2 = session.createDataFrame(list, LongValue.class);

	long result = df.select(col("value").as("a"))
			.join(df2.select(col("value").as("b")), col("a").equalTo(col("b"))).count();

	log.info("count result {}", result);
}

I am very grateful to you for the Java code example, pointing out the use of parquet files and pointing out the configuration, and report the following:

  1. Datasets from my and your sources have the same structure, and the data set is approximately the same. But for the purity of the experiment, I use the datasets and parquet files you specified.

  2. Judging by the output:

*Exec <CollectLimitExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  !Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
    @Expression <Alias> cast(vendor_id#0 as string) AS vendor_id#51 could run on GPU
      @Expression <Cast> cast(vendor_id#0 as string) could run on GPU
        @Expression <AttributeReference> vendor_id#0 could run on GPU
    @Expression <Alias> cast(passenger_count#1 as string) AS passenger_count#52 could run on GPU
      @Expression <Cast> cast(passenger_count#1 as string) could run on GPU
        @Expression <AttributeReference> passenger_count#1 could run on GPU
    @Expression <Alias> cast(trip_distance#2 as string) AS trip_distance#53 could run on GPU
      !Expression <Cast> cast(trip_distance#2 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> trip_distance#2 could run on GPU
    @Expression <Alias> cast(pickup_longitude#3 as string) AS pickup_longitude#54 could run on GPU
      !Expression <Cast> cast(pickup_longitude#3 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> pickup_longitude#3 could run on GPU
    @Expression <Alias> cast(pickup_latitude#4 as string) AS pickup_latitude#55 could run on GPU
      !Expression <Cast> cast(pickup_latitude#4 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> pickup_latitude#4 could run on GPU
    @Expression <Alias> cast(rate_code#5 as string) AS rate_code#56 could run on GPU
      @Expression <Cast> cast(rate_code#5 as string) could run on GPU
        @Expression <AttributeReference> rate_code#5 could run on GPU
    @Expression <Alias> cast(store_and_fwd#6 as string) AS store_and_fwd#57 could run on GPU
      @Expression <Cast> cast(store_and_fwd#6 as string) could run on GPU
        @Expression <AttributeReference> store_and_fwd#6 could run on GPU
    @Expression <Alias> cast(dropoff_longitude#7 as string) AS dropoff_longitude#58 could run on GPU
      !Expression <Cast> cast(dropoff_longitude#7 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> dropoff_longitude#7 could run on GPU
    @Expression <Alias> cast(dropoff_latitude#8 as string) AS dropoff_latitude#59 could run on GPU
      !Expression <Cast> cast(dropoff_latitude#8 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> dropoff_latitude#8 could run on GPU
    @Expression <Alias> cast(fare_amount#9 as string) AS fare_amount#60 could run on GPU
      !Expression <Cast> cast(fare_amount#9 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> fare_amount#9 could run on GPU
    @Expression <Alias> cast(hour#10 as string) AS hour#61 could run on GPU
      @Expression <Cast> cast(hour#10 as string) could run on GPU
        @Expression <AttributeReference> hour#10 could run on GPU
    @Expression <Alias> cast(year#11 as string) AS year#62 could run on GPU
      @Expression <Cast> cast(year#11 as string) could run on GPU
        @Expression <AttributeReference> year#11 could run on GPU
    @Expression <Alias> cast(month#12 as string) AS month#63 could run on GPU
      @Expression <Cast> cast(month#12 as string) could run on GPU
        @Expression <AttributeReference> month#12 could run on GPU
    @Expression <Alias> cast(day#13 as string) AS day#64 could run on GPU
      @Expression <Cast> cast(day#13 as string) could run on GPU
        @Expression <AttributeReference> day#13 could run on GPU
    @Expression <Alias> cast(day_of_week#14 as string) AS day_of_week#65 could run on GPU
      !Expression <Cast> cast(day_of_week#14 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> day_of_week#14 could run on GPU
    @Expression <Alias> cast(is_weekend#15 as string) AS is_weekend#66 could run on GPU
      !Expression <Cast> cast(is_weekend#15 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> is_weekend#15 could run on GPU
    @Expression <Alias> cast(h_distance#16 as string) AS h_distance#67 could run on GPU
      !Expression <Cast> cast(h_distance#16 as string) cannot run on GPU because the GPU will use different precision than Java's toString method when converting floating point data types to strings and this can produce results that differ from the default behavior in Spark.  To enable this operation on the GPU, set spark.rapids.sql.castFloatToString.enabled to true.
        @Expression <AttributeReference> h_distance#16 could run on GPU
    *Exec <FileSourceScanExec> will run on GPU

when subtracting both parquet and csv files, you need to set the following configurations to correctly convert float to string:

.set("spark.rapids.sql.csv.read.float.enabled", "true")
.set("spark.rapids.sql.castFloatToString.enabled", "true")
.set("spark.rapids.sql.csv.read.double.enabled", "true")
.set("spark.rapids.sql.castDoubleToString.enabled", "true")
  1. In any of the two scenarios: with the configuration above, or without them; when reading csv or parquet files, I get a deserialization error:
 <DeserializeToObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.DeserializeToObjectExec
  ! <CreateExternalRow> createexternalrow(vendor_id#0, passenger_count#226, trip_distance#245, pickup_longitude#264, pickup_latitude#283, rate_code#302, store_and_fwd#6, dropoff_longitude#321, dropoff_latitude#340, fare_amount#435, hour#359, year#11, month#12, day#13, day_of_week#378, is_weekend#397, h_distance#416, StructField(vendor_id,IntegerType,true), StructField(passenger_count,FloatType,true), StructField(trip_distance,FloatType,true), StructField(pickup_longitude,FloatType,true), StructField(pickup_latitude,FloatType,true), StructField(rate_code,FloatType,true), StructField(store_and_fwd,DoubleType,true), ... 10 more fields) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
    @Expression <AttributeReference> vendor_id#0 could run on GPU
    @Expression <AttributeReference> passenger_count#226 could run on GPU
    @Expression <AttributeReference> trip_distance#245 could run on GPU
    @Expression <AttributeReference> pickup_longitude#264 could run on GPU
    @Expression <AttributeReference> pickup_latitude#283 could run on GPU
    @Expression <AttributeReference> rate_code#302 could run on GPU
    @Expression <AttributeReference> store_and_fwd#6 could run on GPU
    @Expression <AttributeReference> dropoff_longitude#321 could run on GPU
    @Expression <AttributeReference> dropoff_latitude#340 could run on GPU
    @Expression <AttributeReference> fare_amount#435 could run on GPU
    @Expression <AttributeReference> hour#359 could run on GPU
    @Expression <AttributeReference> year#11 could run on GPU
    @Expression <AttributeReference> month#12 could run on GPU
    @Expression <AttributeReference> day#13 could run on GPU
    @Expression <AttributeReference> day_of_week#378 could run on GPU
    @Expression <AttributeReference> is_weekend#397 could run on GPU
    @Expression <AttributeReference> h_distance#416 could run on GPU
  !Expression <AttributeReference> obj#453 cannot run on GPU because expression AttributeReference obj#453 produces an unsupported type ObjectType(interface org.apache.spark.sql.Row)
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <ProjectExec> will run on GPU
      *Expression <Alias> cast(passenger_count#1 as float) AS passenger_count#226 will run on GPU
        *Expression <Cast> cast(passenger_count#1 as float) will run on GPU
      *Expression <Alias> cast(trip_distance#2 as float) AS trip_distance#245 will run on GPU
        *Expression <Cast> cast(trip_distance#2 as float) will run on GPU
      *Expression <Alias> cast(pickup_longitude#3 as float) AS pickup_longitude#264 will run on GPU
        *Expression <Cast> cast(pickup_longitude#3 as float) will run on GPU
      *Expression <Alias> cast(pickup_latitude#4 as float) AS pickup_latitude#283 will run on GPU
        *Expression <Cast> cast(pickup_latitude#4 as float) will run on GPU
      *Expression <Alias> cast(rate_code#5 as float) AS rate_code#302 will run on GPU
        *Expression <Cast> cast(rate_code#5 as float) will run on GPU
      *Expression <Alias> cast(dropoff_longitude#7 as float) AS dropoff_longitude#321 will run on GPU
        *Expression <Cast> cast(dropoff_longitude#7 as float) will run on GPU
      *Expression <Alias> cast(dropoff_latitude#8 as float) AS dropoff_latitude#340 will run on GPU
        *Expression <Cast> cast(dropoff_latitude#8 as float) will run on GPU
      *Expression <Alias> cast(fare_amount#9 as float) AS fare_amount#435 will run on GPU
        *Expression <Cast> cast(fare_amount#9 as float) will run on GPU
      *Expression <Alias> cast(hour#10 as float) AS hour#359 will run on GPU
        *Expression <Cast> cast(hour#10 as float) will run on GPU
      *Expression <Alias> cast(day_of_week#14 as float) AS day_of_week#378 will run on GPU
        *Expression <Cast> cast(day_of_week#14 as float) will run on GPU
      *Expression <Alias> cast(is_weekend#15 as float) AS is_weekend#397 will run on GPU
        *Expression <Cast> cast(is_weekend#15 as float) will run on GPU
      *Expression <Alias> cast(h_distance#16 as float) AS h_distance#416 will run on GPU
        *Expression <Cast> cast(h_distance#16 as float) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

and then the following error output:

2022-06-18 00:54:50.203 ERROR 34156 --- [   scheduling-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task

org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(5, 0) finished unsuccessfully.
ml.dmlc.xgboost4j.java.XGBoostError: [20:54:50] /workspace/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [20:54:50] /workspace/src/common/device_helpers.cuh:132: NCCL failure :unhandled system error /workspace/src/common/device_helpers.cu(67)
Stack trace:
  [bt] (0) /tmp/libxgboost4j1229365283550628244.so(+0x584a3d) [0x7f80989d9a3d]
  [bt] (1) /tmp/libxgboost4j1229365283550628244.so(dh::ThrowOnNcclError(ncclResult_t, char const*, int)+0x2d9) [0x7f80989db739]
  [bt] (2) /tmp/libxgboost4j1229365283550628244.so(dh::AllReducer::Init(int)+0x8c8) [0x7f80989da998]
  [bt] (3) /tmp/libxgboost4j1229365283550628244.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x127) [0x7f8098c89e97]
  [bt] (4) /tmp/libxgboost4j1229365283550628244.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x3b6) [0x7f8098c95ee6]
  [bt] (5) /tmp/libxgboost4j1229365283550628244.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7f80988157c3]
  [bt] (6) /tmp/libxgboost4j1229365283550628244.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7f8098816367]
  [bt] (7) /tmp/libxgboost4j1229365283550628244.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7f8098852212]
  [bt] (8) /tmp/libxgboost4j1229365283550628244.so(XGBoosterUpdateOneIter+0x68) [0x7f80986f5118]



Stack trace:
  [bt] (0) /tmp/libxgboost4j1229365283550628244.so(+0x81ff39) [0x7f8098c74f39]
  [bt] (1) /tmp/libxgboost4j1229365283550628244.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x695) [0x7f8098c961c5]
  [bt] (2) /tmp/libxgboost4j1229365283550628244.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7f80988157c3]
  [bt] (3) /tmp/libxgboost4j1229365283550628244.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7f8098816367]
  [bt] (4) /tmp/libxgboost4j1229365283550628244.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7f8098852212]
  [bt] (5) /tmp/libxgboost4j1229365283550628244.so(XGBoosterUpdateOneIter+0x68) [0x7f80986f5118]
  [bt] (6) [0x7f81fd017de7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:172)
	at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:217)
	at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304)
	at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
	at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:349)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:426)
	at scala.Option.map(Option.scala:230)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:424)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:na]
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[scala-library-2.12.15.jar:na]
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[scala-library-2.12.15.jar:na]
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:2040) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2639) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) ~[spark-core_2.12-3.2.1.jar:3.2.1]
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:431) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar:na]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:190) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar:na]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor.train(XGBoostRegressor.scala:37) ~[xgboost4j-spark-gpu_2.12-1.6.1.jar:na]
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) ~[spark-mllib_2.12-3.2.1.jar:3.2.1]
	at com.alekscapital.mltests.service.MLService.xgBoostTest(MLService.java:221) ~[classes/:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_321]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_321]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_321]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_321]
	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.3.10.jar:5.3.10]
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.3.10.jar:5.3.10]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_321]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_321]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_321]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_321]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_321]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_321]
	at java.lang.Thread.run(Thread.java:750) [na:1.8.0_321]

2022-06-18 00:54:50.411  INFO 34156 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler  : Resubmitting failed stages

Just in case, here is the output of the command in the Spark executor docker container:

dpkg -l | grep nccl
hi  libnccl-dev                     2.12.10-1+cuda11.6                amd64        NVIDIA Collective Communication Library (NCCL) Development Files
hi  libnccl2                        2.12.10-1+cuda11.6                amd64        NVIDIA Collective Communication Library (NCCL) Runtime

nVidia Driver version 512.95;
Spark version 3.2.1.

@wbo4958 I want to thank you again for helping, and on the successful outcome of the experiments, I can attach all my results with the CI/CD pipeline in the form of a pull request and / or an article with an example of launching in a spring application.

Now I'm leaving to deal with the last exception, I saw a couple of links in Google.

@Dartya
Copy link
Author

Dartya commented Jun 19, 2022

I still tried to run 3.1.3 on a local cluster (3.2.1 does not run under Windows stackoverflow .

I am getting error in stacktrace:

Caused by: java.io.FileNotFoundException: Could not locate native dependency amd64/Windows 10/cudf.dll
        at ai.rapids.cudf.NativeDepsLoader.createFile(NativeDepsLoader.java:204) ~[cudf-22.04.0-cuda11.jar:?]
        at ai.rapids.cudf.NativeDepsLoader.lambda$loadNativeDeps$0(NativeDepsLoader.java:146) ~[cudf-22.04.0-cuda11.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_321]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_321]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_321]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_321]

This is some kind of catastrophe and universal pain. Still, I will continue experiments with a standalone cluster on a virtual machine and docker. I'll keep you informed.

@Dartya
Copy link
Author

Dartya commented Jun 19, 2022

I checked all the configs and dependencies again, removed the dependency that was not useful:

<dependency>
	<groupId>ai.rapids</groupId>
	<artifactId>xgboost4j-spark_2.x</artifactId>
	<version>1.0.0-Beta5</version>
</dependency>

And after job started, I saw in the stacktrace the need for installed python.

I installed python and python3 in the executor and driver images. Made one attempt with the parquet files and your example, and two attempts with my example based on a scheme with CSV file column mappings in Double and Float.

The results are the same, a stacktrace of the form is returned:

22/06/19 19:50:24 INFO XGBoostSpark: Leveraging gpu device 0 to train
22/06/19 19:50:24 ERROR XGBoostSpark: XGBooster worker 0 has failed 0 times due to 
ml.dmlc.xgboost4j.java.XGBoostError: [19:50:24] /workspace/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [19:50:24] /workspace/src/common/device_helpers.cuh:132: NCCL failure :unhandled system error /workspace/src/common/device_helpers.cu(67)
Stack trace:
  [bt] (0) /tmp/libxgboost4j7538021946772371634.so(+0x584a3d) [0x7f96409d9a3d]
  [bt] (1) /tmp/libxgboost4j7538021946772371634.so(dh::ThrowOnNcclError(ncclResult_t, char const*, int)+0x2d9) [0x7f96409db739]
  [bt] (2) /tmp/libxgboost4j7538021946772371634.so(dh::AllReducer::Init(int)+0x8c8) [0x7f96409da998]
  [bt] (3) /tmp/libxgboost4j7538021946772371634.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x127) [0x7f9640c89e97]
  [bt] (4) /tmp/libxgboost4j7538021946772371634.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x3b6) [0x7f9640c95ee6]
  [bt] (5) /tmp/libxgboost4j7538021946772371634.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7f96408157c3]
  [bt] (6) /tmp/libxgboost4j7538021946772371634.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7f9640816367]
  [bt] (7) /tmp/libxgboost4j7538021946772371634.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7f9640852212]
  [bt] (8) /tmp/libxgboost4j7538021946772371634.so(XGBoosterUpdateOneIter+0x68) [0x7f96406f5118]



Stack trace:
  [bt] (0) /tmp/libxgboost4j7538021946772371634.so(+0x81ff39) [0x7f9640c74f39]
  [bt] (1) /tmp/libxgboost4j7538021946772371634.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x695) [0x7f9640c961c5]
  [bt] (2) /tmp/libxgboost4j7538021946772371634.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7f96408157c3]
  [bt] (3) /tmp/libxgboost4j7538021946772371634.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7f9640816367]
  [bt] (4) /tmp/libxgboost4j7538021946772371634.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7f9640852212]
  [bt] (5) /tmp/libxgboost4j7538021946772371634.so(XGBoosterUpdateOneIter+0x68) [0x7f96406f5118]
  [bt] (6) [0x7f9795017de7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:172)
	at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:217)
	at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304)
	at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
	at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:349)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:426)
	at scala.Option.map(Option.scala:230)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:424)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I paid attention to the discussion here - but checking shows nvidia-smi in my container is working , and based on the rapids example of counting two Long datasets, I can say for sure that the tasks on the GPU in the container are working.

I also checked the neural network training based on the DJL library - in the spark executor container, the model is trained on the PyTorch Engine.

I saw that the root of the problem could be in different version of XGBoost and NCCL here , and this is my only remaining version of why the example is throwing an exception.

So far, I don't know where to dig anymore. I will try to complete my task without XGBoost on random forest Spark ML API.

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 20, 2022

@Dartya,

Please ignore "deserialization error:", that's print by RAPIDS accelerator. it means some Spark physical plans can't run on GPU. but for the xgboost case, the whole ETL pipeline will run on GPU, so don't worry about this.

And for the NCCL issue,

please add the below configuration and check the "stdout" log in executor side.

--conf spark.executorEnv.NCCL_DEBUG=INFO

@Dartya
Copy link
Author

Dartya commented Jun 20, 2022

Hi @wbo4958 ,

I set spark.executorEnv.NCCL_DEBUG=INFO in the driver

@Bean
    public JavaSparkContext javaSparkContext() throws UnknownHostException {
        String host = InetAddress.getLocalHost().getHostAddress();
        SparkConf sparkConf = new SparkConf(true)
                .setAppName(appName)
                .setMaster("spark://" + masterHost)
                .set("spark.rapids.memory.gpu.pooling.enabled", "false")
                .set("spark.rapids.memory.gpu.minAllocFraction", "0.0001")
                .set("spark.executorEnv.NCCL_DEBUG", "INFO");
        return new JavaSparkContext(sparkConf);
    }

, did not see the difference, so I stopped the executor and put it in conf/spark-env.sh. I also did not see a difference in the logs, I stopped the executor again and set it additionally in the conf/spark-env.sh.

Additionally, I made an NCCL_DEBUG environment variable in the executor's docker container.

I did not see any difference or additional information. Here is the executor log and some screenshots

Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64//bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx4096M" "-Dspark.driver.port=10000" "-Dspark.ui.port=4040" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@192.168.0.125:10000" "--executor-id" "0" "--hostname" "172.17.0.5" "--cores" "4" "--app-id" "app-20220620064100-0056" "--worker-url" "spark://Worker@172.17.0.5:34689" "--resourcesFile" "/opt/spark/work/app-20220620064100-0056/0/resource-executor-3802894113121577606.json"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/06/20 03:41:00 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 959@d43c9fbb175b
22/06/20 03:41:00 INFO SignalUtils: Registering signal handler for TERM
22/06/20 03:41:00 INFO SignalUtils: Registering signal handler for HUP
22/06/20 03:41:00 INFO SignalUtils: Registering signal handler for INT
22/06/20 03:41:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/06/20 03:41:01 INFO SecurityManager: Changing view acls to: appuser,alexp
22/06/20 03:41:01 INFO SecurityManager: Changing modify acls to: appuser,alexp
22/06/20 03:41:01 INFO SecurityManager: Changing view acls groups to: 
22/06/20 03:41:01 INFO SecurityManager: Changing modify acls groups to: 
22/06/20 03:41:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(appuser, alexp); groups with view permissions: Set(); users  with modify permissions: Set(appuser, alexp); groups with modify permissions: Set()
22/06/20 03:41:01 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:10000 after 49 ms (0 ms spent in bootstraps)
22/06/20 03:41:01 INFO SecurityManager: Changing view acls to: appuser,alexp
22/06/20 03:41:01 INFO SecurityManager: Changing modify acls to: appuser,alexp
22/06/20 03:41:01 INFO SecurityManager: Changing view acls groups to: 
22/06/20 03:41:01 INFO SecurityManager: Changing modify acls groups to: 
22/06/20 03:41:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(appuser, alexp); groups with view permissions: Set(); users  with modify permissions: Set(appuser, alexp); groups with modify permissions: Set()
22/06/20 03:41:01 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:10000 after 3 ms (0 ms spent in bootstraps)
22/06/20 03:41:01 INFO DiskBlockManager: Created local directory at /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/blockmgr-86d73675-26fd-4270-b818-c3c19f5d5ee7
22/06/20 03:41:01 INFO MemoryStore: MemoryStore started with capacity 2004.6 MiB
22/06/20 03:41:01 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.0.125:10000
22/06/20 03:41:01 INFO WorkerWatcher: Connecting to worker spark://Worker@172.17.0.5:34689
22/06/20 03:41:01 INFO TransportClientFactory: Successfully created connection to /172.17.0.5:34689 after 2 ms (0 ms spent in bootstraps)
22/06/20 03:41:01 INFO WorkerWatcher: Successfully connected to spark://Worker@172.17.0.5:34689
22/06/20 03:41:01 INFO ResourceUtils: ==============================================================
22/06/20 03:41:01 INFO ResourceUtils: Custom resources for spark.executor:
gpu -> [name: gpu, addresses: 0]
22/06/20 03:41:01 INFO ResourceUtils: ==============================================================
22/06/20 03:41:01 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
22/06/20 03:41:01 INFO Executor: Starting executor ID 0 on host 172.17.0.5
22/06/20 03:41:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39181.
22/06/20 03:41:01 INFO NettyBlockTransferService: Server created on 172.17.0.5:39181
22/06/20 03:41:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/06/20 03:41:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 172.17.0.5, 39181, None)
22/06/20 03:41:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 172.17.0.5, 39181, None)
22/06/20 03:41:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 172.17.0.5, 39181, None)
22/06/20 03:41:02 INFO Executor: Fetching spark://192.168.0.125:10000/jars/xgboost4j-gpu_2.12-1.6.1.jar with timestamp 1655696457800
22/06/20 03:41:02 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:10000 after 3 ms (0 ms spent in bootstraps)
22/06/20 03:41:02 INFO Utils: Fetching spark://192.168.0.125:10000/jars/xgboost4j-gpu_2.12-1.6.1.jar to /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/fetchFileTemp4323835884637724814.tmp
22/06/20 03:41:07 INFO Utils: Copying /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/-17922330801655696457800_cache to /opt/spark/work/app-20220620064100-0056/0/./xgboost4j-gpu_2.12-1.6.1.jar
22/06/20 03:41:07 INFO Executor: Adding file:/opt/spark/work/app-20220620064100-0056/0/./xgboost4j-gpu_2.12-1.6.1.jar to class loader
22/06/20 03:41:07 INFO Executor: Fetching spark://192.168.0.125:10000/jars/cudf-22.04.0-cuda11.jar with timestamp 1655696457800
22/06/20 03:41:07 INFO Utils: Fetching spark://192.168.0.125:10000/jars/cudf-22.04.0-cuda11.jar to /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/fetchFileTemp8764111347725763262.tmp
22/06/20 03:41:15 INFO Utils: Copying /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/8599044691655696457800_cache to /opt/spark/work/app-20220620064100-0056/0/./cudf-22.04.0-cuda11.jar
22/06/20 03:41:16 INFO Executor: Adding file:/opt/spark/work/app-20220620064100-0056/0/./cudf-22.04.0-cuda11.jar to class loader
22/06/20 03:41:16 INFO Executor: Fetching spark://192.168.0.125:10000/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar with timestamp 1655696457800
22/06/20 03:41:16 INFO Utils: Fetching spark://192.168.0.125:10000/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar to /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/fetchFileTemp1096880564689893858.tmp
22/06/20 03:41:16 INFO Utils: Copying /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/-4119855441655696457800_cache to /opt/spark/work/app-20220620064100-0056/0/./xgboost4j-spark-gpu_2.12-1.6.1.jar
22/06/20 03:41:16 INFO Executor: Adding file:/opt/spark/work/app-20220620064100-0056/0/./xgboost4j-spark-gpu_2.12-1.6.1.jar to class loader
22/06/20 03:41:16 INFO Executor: Fetching spark://192.168.0.125:10000/jars/rapids-4-spark_2.12-22.04.0.jar with timestamp 1655696457800
22/06/20 03:41:16 INFO Utils: Fetching spark://192.168.0.125:10000/jars/rapids-4-spark_2.12-22.04.0.jar to /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/fetchFileTemp420339122760152486.tmp
22/06/20 03:41:16 INFO Utils: Copying /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/-16637798311655696457800_cache to /opt/spark/work/app-20220620064100-0056/0/./rapids-4-spark_2.12-22.04.0.jar
22/06/20 03:41:16 INFO Executor: Adding file:/opt/spark/work/app-20220620064100-0056/0/./rapids-4-spark_2.12-22.04.0.jar to class loader
22/06/20 03:41:16 INFO Executor: Fetching spark://192.168.0.125:10000/jars/spark-nlp_2.12-3.4.1.jar with timestamp 1655696457800
22/06/20 03:41:16 INFO Utils: Fetching spark://192.168.0.125:10000/jars/spark-nlp_2.12-3.4.1.jar to /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/fetchFileTemp2379215212974648005.tmp
22/06/20 03:41:17 INFO Utils: Copying /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/-8455308651655696457800_cache to /opt/spark/work/app-20220620064100-0056/0/./spark-nlp_2.12-3.4.1.jar
22/06/20 03:41:17 INFO Executor: Adding file:/opt/spark/work/app-20220620064100-0056/0/./spark-nlp_2.12-3.4.1.jar to class loader
22/06/20 03:41:17 INFO Executor: Fetching spark://192.168.0.125:10000/jars/config-1.4.1.jar with timestamp 1655696457800
22/06/20 03:41:17 INFO Utils: Fetching spark://192.168.0.125:10000/jars/config-1.4.1.jar to /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/fetchFileTemp4772566440996669269.tmp
22/06/20 03:41:17 INFO Utils: Copying /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/-17661293231655696457800_cache to /opt/spark/work/app-20220620064100-0056/0/./config-1.4.1.jar
22/06/20 03:41:17 INFO Executor: Adding file:/opt/spark/work/app-20220620064100-0056/0/./config-1.4.1.jar to class loader
22/06/20 03:41:17 INFO Executor: Fetching spark://192.168.0.125:10000/jars/service.jar with timestamp 1655696457800
22/06/20 03:41:17 INFO Utils: Fetching spark://192.168.0.125:10000/jars/service.jar to /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/fetchFileTemp3360542029716246917.tmp
22/06/20 03:42:11 INFO Utils: Copying /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/spark-c95559bb-4f0e-4771-b9e6-c603b8f9ddd1/-967860391655696457800_cache to /opt/spark/work/app-20220620064100-0056/0/./service.jar
22/06/20 03:42:13 INFO Executor: Adding file:/opt/spark/work/app-20220620064100-0056/0/./service.jar to class loader
22/06/20 03:42:13 INFO ShimLoader: Loading shim for Spark version: 3.2.1
22/06/20 03:42:13 INFO ShimLoader: Complete Spark build info: 3.2.1, https://github.com/apache/spark, HEAD, 4f25b3f71238a00508a356591553f2dfa89f8290, 2022-01-20T19:26:14Z
22/06/20 03:42:13 INFO ShimLoader: Forcing shim caller classloader update (default behavior). If it causes issues with userClassPathFirst, set spark.rapids.force.caller.classloader to false!
22/06/20 03:42:13 INFO ShimLoader: Falling back on ShimLoader caller's classloader org.apache.spark.util.MutableURLClassLoader@5eebd5ba
22/06/20 03:42:13 INFO ShimLoader: Updating spark classloader org.apache.spark.util.MutableURLClassLoader@5eebd5ba with the URLs: jar:file:/opt/spark/work/app-20220620064100-0056/0/./rapids-4-spark_2.12-22.04.0.jar!/spark3xx-common/, jar:file:/opt/spark/work/app-20220620064100-0056/0/./rapids-4-spark_2.12-22.04.0.jar!/spark321/
22/06/20 03:42:13 INFO ShimLoader: Spark classLoader org.apache.spark.util.MutableURLClassLoader@5eebd5ba updated successfully
22/06/20 03:42:13 INFO RapidsPluginUtils: RAPIDS Accelerator build: {version=22.04.0, user=, url=https://github.com/NVIDIA/spark-rapids.git, date=2022-04-14T08:57:01Z, revision=0a6b5f4fb1aa2cc753725f81f395b18451b86433, cudf_version=22.04.0, branch=HEAD}
22/06/20 03:42:13 INFO RapidsPluginUtils: cudf build: {version=22.04.0, user=, date=2022-04-07T12:10:26Z, revision=8bf0520170bc4528bbf5896a950930e92f1dad7b, branch=HEAD}
22/06/20 03:42:13 WARN RapidsPluginUtils: RAPIDS Accelerator 22.04.0 using cudf 22.04.0.
22/06/20 03:42:13 INFO RapidsExecutorPlugin: RAPIDS Accelerator build: {version=22.04.0, user=, url=https://github.com/NVIDIA/spark-rapids.git, date=2022-04-14T08:57:01Z, revision=0a6b5f4fb1aa2cc753725f81f395b18451b86433, cudf_version=22.04.0, branch=HEAD}
22/06/20 03:42:13 INFO RapidsExecutorPlugin: cudf build: {version=22.04.0, user=, date=2022-04-07T12:10:26Z, revision=8bf0520170bc4528bbf5896a950930e92f1dad7b, branch=HEAD}
22/06/20 03:42:14 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
22/06/20 03:42:20 WARN GpuDeviceManager: RMM pool is disabled since spark.rapids.memory.gpu.pooling.enabled is set to false; however, this configuration is deprecated and the behavior may change in a future release.
22/06/20 03:42:20 INFO GpuDeviceManager: Initializing RMM  pool size = 4837.6240234375 MB on gpuId 0
22/06/20 03:42:20 INFO GpuDeviceManager: Using per-thread default stream
22/06/20 03:42:20 INFO ShimDiskBlockManager: Created local directory at /tmp/spark-d68c5338-cd9f-4854-9512-7c578965e036/executor-14015d59-b6dc-41b7-805f-7ba37818d4e3/blockmgr-ceb8e633-113a-4b2a-b93f-9a8884f2a89d
22/06/20 03:42:20 INFO RapidsBufferCatalog: Installing GPU memory handler for spill
22/06/20 03:42:20 INFO RapidsExecutorPlugin: The number of concurrent GPU tasks allowed is 1
22/06/20 03:42:20 INFO ExecutorPluginContainer: Initialized executor component for plugin com.nvidia.spark.SQLPlugin.
22/06/20 03:42:20 INFO CoarseGrainedExecutorBackend: Got assigned task 0
22/06/20 03:42:20 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/06/20 03:42:20 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
22/06/20 03:42:20 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:54186 after 3 ms (0 ms spent in bootstraps)
22/06/20 03:42:20 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.3 KiB, free 2004.6 MiB)
22/06/20 03:42:20 INFO TorrentBroadcast: Reading broadcast variable 1 took 86 ms
22/06/20 03:42:20 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 23.3 KiB, free 2004.6 MiB)
22/06/20 03:42:21 INFO FileScanRDD: Reading File path: file:///opt/spark/train.csv, range: 0-1114146, partition values: [empty row]
22/06/20 03:42:21 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
22/06/20 03:42:21 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 34.9 KiB, free 2004.5 MiB)
22/06/20 03:42:21 INFO TorrentBroadcast: Reading broadcast variable 0 took 13 ms
22/06/20 03:42:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 541.0 KiB, free 2004.0 MiB)
22/06/20 03:42:21 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 3208 bytes result sent to driver
22/06/20 03:42:21 INFO CoarseGrainedExecutorBackend: Got assigned task 1
22/06/20 03:42:21 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
22/06/20 03:42:21 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
22/06/20 03:42:21 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB)
22/06/20 03:42:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.5 KiB, free 2004.0 MiB)
22/06/20 03:42:21 INFO TorrentBroadcast: Reading broadcast variable 2 took 12 ms
22/06/20 03:42:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 14.0 KiB, free 2004.0 MiB)
22/06/20 03:42:21 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
22/06/20 03:42:21 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.0.125:10000)
22/06/20 03:42:22 INFO MapOutputTrackerWorker: Got the map output locations
22/06/20 03:42:22 INFO ShuffleBlockFetcherIterator: Getting 1 (234.8 KiB) non-empty blocks including 1 (234.8 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
22/06/20 03:42:22 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms
[03:42:24] task 0 got new rank 0
22/06/20 03:42:24 INFO XGBoostSpark: Leveraging gpu device 0 to train
22/06/20 03:42:24 ERROR XGBoostSpark: XGBooster worker 0 has failed 0 times due to 
ml.dmlc.xgboost4j.java.XGBoostError: [03:42:24] /workspace/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [03:42:24] /workspace/src/common/device_helpers.cuh:132: NCCL failure :unhandled system error /workspace/src/common/device_helpers.cu(67)
Stack trace:
  [bt] (0) /tmp/libxgboost4j4845375869119201849.so(+0x584a3d) [0x7fd9a09d9a3d]
  [bt] (1) /tmp/libxgboost4j4845375869119201849.so(dh::ThrowOnNcclError(ncclResult_t, char const*, int)+0x2d9) [0x7fd9a09db739]
  [bt] (2) /tmp/libxgboost4j4845375869119201849.so(dh::AllReducer::Init(int)+0x8c8) [0x7fd9a09da998]
  [bt] (3) /tmp/libxgboost4j4845375869119201849.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x127) [0x7fd9a0c89e97]
  [bt] (4) /tmp/libxgboost4j4845375869119201849.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x3b6) [0x7fd9a0c95ee6]
  [bt] (5) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fd9a08157c3]
  [bt] (6) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fd9a0816367]
  [bt] (7) /tmp/libxgboost4j4845375869119201849.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fd9a0852212]
  [bt] (8) /tmp/libxgboost4j4845375869119201849.so(XGBoosterUpdateOneIter+0x68) [0x7fd9a06f5118]



Stack trace:
  [bt] (0) /tmp/libxgboost4j4845375869119201849.so(+0x81ff39) [0x7fd9a0c74f39]
  [bt] (1) /tmp/libxgboost4j4845375869119201849.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x695) [0x7fd9a0c961c5]
  [bt] (2) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fd9a08157c3]
  [bt] (3) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fd9a0816367]
  [bt] (4) /tmp/libxgboost4j4845375869119201849.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fd9a0852212]
  [bt] (5) /tmp/libxgboost4j4845375869119201849.so(XGBoosterUpdateOneIter+0x68) [0x7fd9a06f5118]
  [bt] (6) [0x7fdafd017de7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:172)
	at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:217)
	at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304)
	at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
	at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:349)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:426)
	at scala.Option.map(Option.scala:230)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:424)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
22/06/20 03:42:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
ml.dmlc.xgboost4j.java.XGBoostError: [03:42:24] /workspace/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [03:42:24] /workspace/src/common/device_helpers.cuh:132: NCCL failure :unhandled system error /workspace/src/common/device_helpers.cu(67)
Stack trace:
  [bt] (0) /tmp/libxgboost4j4845375869119201849.so(+0x584a3d) [0x7fd9a09d9a3d]
  [bt] (1) /tmp/libxgboost4j4845375869119201849.so(dh::ThrowOnNcclError(ncclResult_t, char const*, int)+0x2d9) [0x7fd9a09db739]
  [bt] (2) /tmp/libxgboost4j4845375869119201849.so(dh::AllReducer::Init(int)+0x8c8) [0x7fd9a09da998]
  [bt] (3) /tmp/libxgboost4j4845375869119201849.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x127) [0x7fd9a0c89e97]
  [bt] (4) /tmp/libxgboost4j4845375869119201849.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x3b6) [0x7fd9a0c95ee6]
  [bt] (5) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fd9a08157c3]
  [bt] (6) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fd9a0816367]
  [bt] (7) /tmp/libxgboost4j4845375869119201849.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fd9a0852212]
  [bt] (8) /tmp/libxgboost4j4845375869119201849.so(XGBoosterUpdateOneIter+0x68) [0x7fd9a06f5118]



Stack trace:
  [bt] (0) /tmp/libxgboost4j4845375869119201849.so(+0x81ff39) [0x7fd9a0c74f39]
  [bt] (1) /tmp/libxgboost4j4845375869119201849.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x695) [0x7fd9a0c961c5]
  [bt] (2) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fd9a08157c3]
  [bt] (3) /tmp/libxgboost4j4845375869119201849.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fd9a0816367]
  [bt] (4) /tmp/libxgboost4j4845375869119201849.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fd9a0852212]
  [bt] (5) /tmp/libxgboost4j4845375869119201849.so(XGBoosterUpdateOneIter+0x68) [0x7fd9a06f5118]
  [bt] (6) [0x7fdafd017de7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:172)
	at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:217)
	at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304)
	at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
	at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:349)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:426)
	at scala.Option.map(Option.scala:230)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:424)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

image
image
image

@Dartya
Copy link
Author

Dartya commented Jun 20, 2022

@wbo4958 I'm sorry. Out of habit, I looked at the stderr log, and did not pay attention to stdout. Here's his output:

0d2e39cf9b5e:201:268 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
0d2e39cf9b5e:201:268 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

0d2e39cf9b5e:201:268 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
0d2e39cf9b5e:201:268 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
0d2e39cf9b5e:201:268 [0] NCCL INFO Using network Socket
NCCL version 2.8.3+cuda11.0

0d2e39cf9b5e:201:268 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0b/../../0000:0b:00.0
0d2e39cf9b5e:201:268 [0] NCCL INFO graph/xml.cc:469 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO graph/xml.cc:660 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO graph/topo.cc:522 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:627 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:878 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:914 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:926 -> 2

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 20, 2022

@Dartya Could you share all "stdout" of all executors?

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 20, 2022

Below question is from NCCL teams,

ls /sys present on the system?
If so, what is the output of ls -l /sys/class/net , ls -l /sys/class/net/eth0 , and ls -l /sys/class/net/eth0/device?
It seems NCCL it trying to find where eth0 is attached but fails to do so.

@Dartya
Copy link
Author

Dartya commented Jun 20, 2022

@wbo4958, I've only one executor and it's stdout contains only this stdout log.

0d2e39cf9b5e:201:268 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
0d2e39cf9b5e:201:268 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

0d2e39cf9b5e:201:268 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
0d2e39cf9b5e:201:268 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
0d2e39cf9b5e:201:268 [0] NCCL INFO Using network Socket
NCCL version 2.8.3+cuda11.0

0d2e39cf9b5e:201:268 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0b/../../0000:0b:00.0
0d2e39cf9b5e:201:268 [0] NCCL INFO graph/xml.cc:469 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO graph/xml.cc:660 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO graph/topo.cc:522 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:627 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:878 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:914 -> 2
0d2e39cf9b5e:201:268 [0] NCCL INFO init.cc:926 -> 2

0d2e39cf9b5e:201:284 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0b/../../0000:0b:00.0
0d2e39cf9b5e:201:284 [0] NCCL INFO graph/xml.cc:469 -> 2
0d2e39cf9b5e:201:284 [0] NCCL INFO graph/xml.cc:660 -> 2
0d2e39cf9b5e:201:284 [0] NCCL INFO graph/topo.cc:522 -> 2
0d2e39cf9b5e:201:284 [0] NCCL INFO init.cc:627 -> 2
0d2e39cf9b5e:201:284 [0] NCCL INFO init.cc:878 -> 2
0d2e39cf9b5e:201:284 [0] NCCL INFO init.cc:914 -> 2
0d2e39cf9b5e:201:284 [0] NCCL INFO init.cc:926 -> 2

I attached the full log of the stderr above.
For other questions, output of the executor docker container:

sh-5.0$ cd /sys

sh-5.0$ ls
block  bus  class  dev  devices  firmware  fs  kernel  module

sh-5.0$ ls -l /sys/class/net
total 0
-rw-r--r-- 1 root root 4096 Jun 20 08:19 bonding_masters
lrwxrwxrwx 1 root root    0 Jun 20 08:19 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx 1 root root    0 Jun 20 08:19 lo -> ../../devices/virtual/net/lo
lrwxrwxrwx 1 root root    0 Jun 20 08:19 sit0 -> ../../devices/virtual/net/sit0
lrwxrwxrwx 1 root root    0 Jun 20 08:19 tunl0 -> ../../devices/virtual/net/tunl0

sh-5.0$ ls -l /sys/class/net/eth0
lrwxrwxrwx 1 root root 0 Jun 20 08:19 /sys/class/net/eth0 -> ../../devices/virtual/net/eth0

sh-5.0$ ls -l /sys/class/net/eth0/device
ls: cannot access '/sys/class/net/eth0/device': No such file or directory

sh-5.0$ ls -l /sys/devices/virtual/net/eth0/
total 0
-r--r--r-- 1 root root 4096 Jun 20 08:20 addr_assign_type
-r--r--r-- 1 root root 4096 Jun 20 08:20 addr_len
-r--r--r-- 1 root root 4096 Jun 20 08:20 address
-r--r--r-- 1 root root 4096 Jun 20 08:20 broadcast
-rw-r--r-- 1 root root 4096 Jun 20 08:20 carrier
-r--r--r-- 1 root root 4096 Jun 20 08:20 carrier_changes
-r--r--r-- 1 root root 4096 Jun 20 08:20 carrier_down_count
-r--r--r-- 1 root root 4096 Jun 20 08:20 carrier_up_count
-r--r--r-- 1 root root 4096 Jun 20 08:20 dev_id
-r--r--r-- 1 root root 4096 Jun 20 08:20 dev_port
-r--r--r-- 1 root root 4096 Jun 20 08:20 dormant
-r--r--r-- 1 root root 4096 Jun 20 08:20 duplex
-rw-r--r-- 1 root root 4096 Jun 20 08:20 flags
-rw-r--r-- 1 root root 4096 Jun 20 08:20 gro_flush_timeout
-rw-r--r-- 1 root root 4096 Jun 20 08:20 ifalias
-r--r--r-- 1 root root 4096 Jun 20 08:20 ifindex
-r--r--r-- 1 root root 4096 Jun 20 08:20 iflink
-r--r--r-- 1 root root 4096 Jun 20 08:20 link_mode
-rw-r--r-- 1 root root 4096 Jun 20 08:20 mtu
-r--r--r-- 1 root root 4096 Jun 20 08:20 name_assign_type
-rw-r--r-- 1 root root 4096 Jun 20 08:20 napi_defer_hard_irqs
-rw-r--r-- 1 root root 4096 Jun 20 08:20 netdev_group
-r--r--r-- 1 root root 4096 Jun 20 08:20 operstate
-r--r--r-- 1 root root 4096 Jun 20 08:20 phys_port_id
-r--r--r-- 1 root root 4096 Jun 20 08:20 phys_port_name
-r--r--r-- 1 root root 4096 Jun 20 08:20 phys_switch_id
-rw-r--r-- 1 root root 4096 Jun 20 08:20 proto_down
drwxr-xr-x 4 root root    0 Jun 20 08:20 queues
-r--r--r-- 1 root root 4096 Jun 20 08:20 speed
drwxr-xr-x 2 root root    0 Jun 20 08:20 statistics
lrwxrwxrwx 1 root root    0 Jun 20 08:20 subsystem -> ../../../../class/net
-r--r--r-- 1 root root 4096 Jun 20 08:20 testing
-rw-r--r-- 1 root root 4096 Jun 20 08:20 tx_queue_len
-r--r--r-- 1 root root 4096 Jun 20 08:20 type
-rw-r--r-- 1 root root 4096 Jun 20 08:20 uevent

@sjeaugey
Copy link

sjeaugey commented Jun 20, 2022

Thanks. I believe recent versions of NCCL would no longer fail and only perhaps print a warning in this case. The new code just ignores the topology detection when a NIC is virtual and attaches it to the first CPU.

@Dartya
Copy link
Author

Dartya commented Jun 20, 2022

@sjeaugey Thank you. However, I did not understand a little bit what to do. Do I need to update nccl?

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 21, 2022

@trivialfis, would you help on this by upgrading the NCCL to the latest version?

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 21, 2022

@Dartya, Could you watch on the CI #8015 and get the jars to test?

@sjeaugey
Copy link

@sjeaugey Thank you. However, I did not understand a little bit what to do. Do I need to update nccl?

Yes, sorry if that was unclear. Upgrading NCCL to 2.12 should hopefully fix the issue.

@Dartya
Copy link
Author

Dartya commented Jun 21, 2022

@wbo4958 Of course. I already looked and saw only the dockerfile. Do you mean that I need to check my jars on the updated NCCL? Of course, I will check, and, as I said, if the scenario is successful, I am even ready to write an article for examples with a separate repository or a pull request.

I'll update the image of the executor and be back in a day.

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 21, 2022

@Dartya, yeah, you can cherry-pick the patch and build the xgboost jar locally and test in your environment.

After applying the patch #8015, you can build the xgboost jars locally by

cd xgboost
CI_DOCKER_EXTRA_PARAMS_INIT='--cpuset-cpus 0-3' tests/ci_build/ci_build.sh jvm_gpu_build nvidia-docker --build-arg CUDA_VERSION_ARG=11.0 tests/ci_build/build_jvm_packages.sh 3.0.1 -Duse.cuda=ON

@Dartya
Copy link
Author

Dartya commented Jun 21, 2022

@wbo4958 oh, I see. I be back to you with answer in a day.

@Dartya
Copy link
Author

Dartya commented Jun 21, 2022

I updated the nvidia driver to 516.40 and my version is 11.7.0.
I updated the base image with FROM nvcr.io/nvidia/cuda:11.7.0-devel-ubuntu20.04
base image Dockerfile:

FROM nvcr.io/nvidia/cuda:11.7.0-devel-ubuntu20.04

ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt install -y bash tini libc6 libpam-modules libnss3 procps nano iputils-ping net-tools

RUN apt-get update && \
	apt-get install -y openjdk-8-jdk && \
	apt-get install -y ant && \
	apt-get clean && \
	rm -rf /var/lib/apt/lists/* && \
	rm -rf /var/cache/oracle-jdk8-installer;

# Fix certificate issues, found as of
# https://bugs.launchpad.net/ubuntu/+source/ca-certificates-java/+bug/983302
RUN apt-get update && \
	apt-get install -y ca-certificates-java && \
	apt-get clean && \
	update-ca-certificates -f && \
	rm -rf /var/lib/apt/lists/* && \
	rm -rf /var/cache/oracle-jdk8-installer;

# Setup JAVA_HOME, this is useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME

CMD ["tail", "-f", "/dev/null"]

Where is no nccl by default, ok.

Then I make a docker image of the spark executor based on the one auto-generated by spark itself. As an image argument, I pass the tag of the image obtained in the previous step.
Its dockerfile is below:

ARG java_image_tag=localhost:5000/cuda-jdk8:v1
FROM ${java_image_tag}

ARG spark_uid=1001
ARG UID_GID=1001
ENV UID=${UID_GID}
ENV GID=${UID_GID}

ENV SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
ENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-22.04.0-cuda11.jar
ENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-22.04.0.jar

RUN set -ex && \
    sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \
    apt-get update && \
    ln -s /lib /lib64 && \
    apt install -y bash tini libc6 libpam-modules libnss3 procps nano iputils-ping net-tools \
    wget software-properties-common build-essential libnss3-dev zlib1g-dev libgdbm-dev libncurses5-dev \
    libssl-dev libffi-dev libreadline-dev libsqlite3-dev libbz2-dev python3 && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/conf && \
    mkdir -p /opt/spark/work-dir && \
    mkdir -p /opt/sparkRapidsPlugin && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    rm -rf /var/cache/apt/*

COPY jars /opt/spark/jars
COPY rapids /opt/sparkRapidsPlugin
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY conf /opt/spark/conf
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY kubernetes/dockerfiles/spark/decom.sh /opt/
#COPY examples /opt/spark/examples
COPY kubernetes/tests /opt/spark/tests
COPY data /opt/spark/data
COPY datasets /opt/spark/

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh

# USER
RUN groupadd --gid $UID appuser && useradd --uid $UID --gid appuser --shell /bin/bash --create-home appuser
RUN mkdir /var/logs && chown -R appuser:appuser /var/logs
RUN mkdir /opt/spark/logs && chown -R appuser:appuser /opt/spark/
RUN chown -R appuser:appuser /tmp

RUN ls -lah /home/appuser
RUN touch /home/appuser/.bashrc

RUN echo -e '\
export SPARK_HOME=/opt/spark\n\
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin\
' > /home/appuser/.bashrc

RUN chown -R appuser:appuser /home/appuser

EXPOSE 4040
EXPOSE 8081

# Specify the User that the actual main process will run as
RUN apt-get install libnccl2 libnccl-dev -y --allow-change-held-packages
USER ${spark_uid}

ENTRYPOINT [ "/opt/entrypoint.sh" ]

I draw your attention to the fact that in the image from nvidia there is no nccl, and I install it myself. I launch the container from the resulting image.
The output:

sh-5.0$ nvidia-smi
Tue Jun 21 19:22:50 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 516.40       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:0B:00.0  On |                  N/A |
| 29%   37C    P8    10W / 190W |    325MiB /  6144MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
sh-5.0$ dpkg -l | grep nccl
ii  libnccl-dev                     2.12.12-1+cuda11.7                amd64        NVIDIA Collective Communication Library (NCCL) Development Files
ii  libnccl2                        2.12.12-1+cuda11.7                amd64        NVIDIA Collective Communication Library (NCCL) Runtime

I run the application and the output is the same.
stderr:

Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64//bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx4096M" "-Dspark.driver.port=10000" "-Dspark.ui.port=4040" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@192.168.0.125:10000" "--executor-id" "0" "--hostname" "172.17.0.5" "--cores" "4" "--app-id" "app-20220621221846-0062" "--worker-url" "spark://Worker@172.17.0.5:40891" "--resourcesFile" "/opt/spark/work/app-20220621221846-0062/0/resource-executor-3446868575476822123.json"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/06/21 19:18:46 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 90@f0503ed3f584
22/06/21 19:18:46 INFO SignalUtils: Registering signal handler for TERM
22/06/21 19:18:46 INFO SignalUtils: Registering signal handler for HUP
22/06/21 19:18:46 INFO SignalUtils: Registering signal handler for INT
22/06/21 19:18:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/06/21 19:18:46 INFO SecurityManager: Changing view acls to: appuser,alexp
22/06/21 19:18:46 INFO SecurityManager: Changing modify acls to: appuser,alexp
22/06/21 19:18:46 INFO SecurityManager: Changing view acls groups to: 
22/06/21 19:18:46 INFO SecurityManager: Changing modify acls groups to: 
22/06/21 19:18:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(appuser, alexp); groups with view permissions: Set(); users  with modify permissions: Set(appuser, alexp); groups with modify permissions: Set()
22/06/21 19:18:46 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:10000 after 64 ms (0 ms spent in bootstraps)
22/06/21 19:18:47 INFO SecurityManager: Changing view acls to: appuser,alexp
22/06/21 19:18:47 INFO SecurityManager: Changing modify acls to: appuser,alexp
22/06/21 19:18:47 INFO SecurityManager: Changing view acls groups to: 
22/06/21 19:18:47 INFO SecurityManager: Changing modify acls groups to: 
22/06/21 19:18:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(appuser, alexp); groups with view permissions: Set(); users  with modify permissions: Set(appuser, alexp); groups with modify permissions: Set()
22/06/21 19:18:47 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:10000 after 4 ms (0 ms spent in bootstraps)
22/06/21 19:18:47 INFO DiskBlockManager: Created local directory at /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/blockmgr-5b78eb15-e5b2-41e1-9f17-b416ceb9328a
22/06/21 19:18:47 INFO MemoryStore: MemoryStore started with capacity 2004.6 MiB
22/06/21 19:18:47 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.0.125:10000
22/06/21 19:18:47 INFO WorkerWatcher: Connecting to worker spark://Worker@172.17.0.5:40891
22/06/21 19:18:47 INFO TransportClientFactory: Successfully created connection to /172.17.0.5:40891 after 3 ms (0 ms spent in bootstraps)
22/06/21 19:18:47 INFO WorkerWatcher: Successfully connected to spark://Worker@172.17.0.5:40891
22/06/21 19:18:47 INFO ResourceUtils: ==============================================================
22/06/21 19:18:47 INFO ResourceUtils: Custom resources for spark.executor:
gpu -> [name: gpu, addresses: 0]
22/06/21 19:18:47 INFO ResourceUtils: ==============================================================
22/06/21 19:18:47 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
22/06/21 19:18:47 INFO Executor: Starting executor ID 0 on host 172.17.0.5
22/06/21 19:18:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36263.
22/06/21 19:18:48 INFO NettyBlockTransferService: Server created on 172.17.0.5:36263
22/06/21 19:18:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/06/21 19:18:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 172.17.0.5, 36263, None)
22/06/21 19:18:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 172.17.0.5, 36263, None)
22/06/21 19:18:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 172.17.0.5, 36263, None)
22/06/21 19:18:48 INFO Executor: Fetching spark://192.168.0.125:10000/jars/xgboost4j-gpu_2.12-1.6.1.jar with timestamp 1655839122288
22/06/21 19:18:48 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:10000 after 5 ms (0 ms spent in bootstraps)
22/06/21 19:18:48 INFO Utils: Fetching spark://192.168.0.125:10000/jars/xgboost4j-gpu_2.12-1.6.1.jar to /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/fetchFileTemp4307387157446435698.tmp
22/06/21 19:18:55 INFO Utils: Copying /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/-17922330801655839122288_cache to /opt/spark/work/app-20220621221846-0062/0/./xgboost4j-gpu_2.12-1.6.1.jar
22/06/21 19:18:58 INFO Executor: Adding file:/opt/spark/work/app-20220621221846-0062/0/./xgboost4j-gpu_2.12-1.6.1.jar to class loader
22/06/21 19:18:58 INFO Executor: Fetching spark://192.168.0.125:10000/jars/cudf-22.04.0-cuda11.jar with timestamp 1655839122288
22/06/21 19:18:58 INFO Utils: Fetching spark://192.168.0.125:10000/jars/cudf-22.04.0-cuda11.jar to /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/fetchFileTemp2215826725064333494.tmp
22/06/21 19:19:04 INFO Utils: Copying /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/8599044691655839122288_cache to /opt/spark/work/app-20220621221846-0062/0/./cudf-22.04.0-cuda11.jar
22/06/21 19:19:09 INFO Executor: Adding file:/opt/spark/work/app-20220621221846-0062/0/./cudf-22.04.0-cuda11.jar to class loader
22/06/21 19:19:09 INFO Executor: Fetching spark://192.168.0.125:10000/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar with timestamp 1655839122288
22/06/21 19:19:09 INFO Utils: Fetching spark://192.168.0.125:10000/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar to /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/fetchFileTemp2540881109552251512.tmp
22/06/21 19:19:09 INFO Utils: Copying /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/-4119855441655839122288_cache to /opt/spark/work/app-20220621221846-0062/0/./xgboost4j-spark-gpu_2.12-1.6.1.jar
22/06/21 19:19:09 INFO Executor: Adding file:/opt/spark/work/app-20220621221846-0062/0/./xgboost4j-spark-gpu_2.12-1.6.1.jar to class loader
22/06/21 19:19:09 INFO Executor: Fetching spark://192.168.0.125:10000/jars/rapids-4-spark_2.12-22.04.0.jar with timestamp 1655839122288
22/06/21 19:19:09 INFO Utils: Fetching spark://192.168.0.125:10000/jars/rapids-4-spark_2.12-22.04.0.jar to /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/fetchFileTemp7690899961961015458.tmp
22/06/21 19:19:09 INFO Utils: Copying /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/-16637798311655839122288_cache to /opt/spark/work/app-20220621221846-0062/0/./rapids-4-spark_2.12-22.04.0.jar
22/06/21 19:19:10 INFO Executor: Adding file:/opt/spark/work/app-20220621221846-0062/0/./rapids-4-spark_2.12-22.04.0.jar to class loader
22/06/21 19:19:10 INFO Executor: Fetching spark://192.168.0.125:10000/jars/spark-nlp_2.12-3.4.1.jar with timestamp 1655839122288
22/06/21 19:19:10 INFO Utils: Fetching spark://192.168.0.125:10000/jars/spark-nlp_2.12-3.4.1.jar to /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/fetchFileTemp670554508690637461.tmp
22/06/21 19:19:10 INFO Utils: Copying /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/-8455308651655839122288_cache to /opt/spark/work/app-20220621221846-0062/0/./spark-nlp_2.12-3.4.1.jar
22/06/21 19:19:11 INFO Executor: Adding file:/opt/spark/work/app-20220621221846-0062/0/./spark-nlp_2.12-3.4.1.jar to class loader
22/06/21 19:19:11 INFO Executor: Fetching spark://192.168.0.125:10000/jars/config-1.4.1.jar with timestamp 1655839122288
22/06/21 19:19:11 INFO Utils: Fetching spark://192.168.0.125:10000/jars/config-1.4.1.jar to /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/fetchFileTemp7945961993200441953.tmp
22/06/21 19:19:11 INFO Utils: Copying /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/-17661293231655839122288_cache to /opt/spark/work/app-20220621221846-0062/0/./config-1.4.1.jar
22/06/21 19:19:11 INFO Executor: Adding file:/opt/spark/work/app-20220621221846-0062/0/./config-1.4.1.jar to class loader
22/06/21 19:19:11 INFO Executor: Fetching spark://192.168.0.125:10000/jars/service.jar with timestamp 1655839122288
22/06/21 19:19:11 INFO Utils: Fetching spark://192.168.0.125:10000/jars/service.jar to /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/fetchFileTemp6315743495012679741.tmp
22/06/21 19:20:26 INFO Utils: Copying /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/spark-fff074da-07d3-4bd3-9e3f-66f428bb99c8/-967860391655839122288_cache to /opt/spark/work/app-20220621221846-0062/0/./service.jar
22/06/21 19:20:53 INFO Executor: Adding file:/opt/spark/work/app-20220621221846-0062/0/./service.jar to class loader
22/06/21 19:20:53 INFO ShimLoader: Loading shim for Spark version: 3.2.1
22/06/21 19:20:53 INFO ShimLoader: Complete Spark build info: 3.2.1, https://github.com/apache/spark, HEAD, 4f25b3f71238a00508a356591553f2dfa89f8290, 2022-01-20T19:26:14Z
22/06/21 19:20:53 INFO ShimLoader: Forcing shim caller classloader update (default behavior). If it causes issues with userClassPathFirst, set spark.rapids.force.caller.classloader to false!
22/06/21 19:20:53 INFO ShimLoader: Falling back on ShimLoader caller's classloader org.apache.spark.util.MutableURLClassLoader@656d243f
22/06/21 19:20:53 INFO ShimLoader: Updating spark classloader org.apache.spark.util.MutableURLClassLoader@656d243f with the URLs: jar:file:/opt/spark/work/app-20220621221846-0062/0/./rapids-4-spark_2.12-22.04.0.jar!/spark3xx-common/, jar:file:/opt/spark/work/app-20220621221846-0062/0/./rapids-4-spark_2.12-22.04.0.jar!/spark321/
22/06/21 19:20:54 INFO ShimLoader: Spark classLoader org.apache.spark.util.MutableURLClassLoader@656d243f updated successfully
22/06/21 19:20:54 INFO RapidsPluginUtils: RAPIDS Accelerator build: {version=22.04.0, user=, url=https://github.com/NVIDIA/spark-rapids.git, date=2022-04-14T08:57:01Z, revision=0a6b5f4fb1aa2cc753725f81f395b18451b86433, cudf_version=22.04.0, branch=HEAD}
22/06/21 19:20:54 INFO RapidsPluginUtils: cudf build: {version=22.04.0, user=, date=2022-04-07T12:10:26Z, revision=8bf0520170bc4528bbf5896a950930e92f1dad7b, branch=HEAD}
22/06/21 19:20:54 WARN RapidsPluginUtils: RAPIDS Accelerator 22.04.0 using cudf 22.04.0.
22/06/21 19:20:54 INFO RapidsExecutorPlugin: RAPIDS Accelerator build: {version=22.04.0, user=, url=https://github.com/NVIDIA/spark-rapids.git, date=2022-04-14T08:57:01Z, revision=0a6b5f4fb1aa2cc753725f81f395b18451b86433, cudf_version=22.04.0, branch=HEAD}
22/06/21 19:20:54 INFO RapidsExecutorPlugin: cudf build: {version=22.04.0, user=, date=2022-04-07T12:10:26Z, revision=8bf0520170bc4528bbf5896a950930e92f1dad7b, branch=HEAD}
22/06/21 19:20:54 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
22/06/21 19:21:02 WARN GpuDeviceManager: RMM pool is disabled since spark.rapids.memory.gpu.pooling.enabled is set to false; however, this configuration is deprecated and the behavior may change in a future release.
22/06/21 19:21:02 INFO GpuDeviceManager: Initializing RMM  pool size = 4837.6240234375 MB on gpuId 0
22/06/21 19:21:02 INFO GpuDeviceManager: Using per-thread default stream
22/06/21 19:21:02 INFO ShimDiskBlockManager: Created local directory at /tmp/spark-b5ae6d67-3ed1-42c6-a266-be6140b128f5/executor-09996af7-e1b6-4c8d-9673-17ddf2124b03/blockmgr-3fdd5b0a-b5ba-4778-a211-3c037a31b124
22/06/21 19:21:02 INFO RapidsBufferCatalog: Installing GPU memory handler for spill
22/06/21 19:21:02 INFO RapidsExecutorPlugin: The number of concurrent GPU tasks allowed is 1
22/06/21 19:21:02 INFO ExecutorPluginContainer: Initialized executor component for plugin com.nvidia.spark.SQLPlugin.
22/06/21 19:21:02 INFO CoarseGrainedExecutorBackend: Got assigned task 0
22/06/21 19:21:02 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/06/21 19:21:02 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
22/06/21 19:21:02 INFO TransportClientFactory: Successfully created connection to /192.168.0.125:54078 after 3 ms (0 ms spent in bootstraps)
22/06/21 19:21:02 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 37.5 KiB, free 2004.6 MiB)
22/06/21 19:21:02 INFO TorrentBroadcast: Reading broadcast variable 0 took 92 ms
22/06/21 19:21:02 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 104.4 KiB, free 2004.5 MiB)
22/06/21 19:21:03 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2520 bytes result sent to driver
22/06/21 19:21:04 INFO CoarseGrainedExecutorBackend: Got assigned task 1
22/06/21 19:21:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
22/06/21 19:21:04 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
22/06/21 19:21:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 37.5 KiB, free 2004.6 MiB)
22/06/21 19:21:04 INFO TorrentBroadcast: Reading broadcast variable 1 took 13 ms
22/06/21 19:21:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 104.4 KiB, free 2004.5 MiB)
22/06/21 19:21:04 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2434 bytes result sent to driver
22/06/21 19:21:09 INFO CoarseGrainedExecutorBackend: Got assigned task 2
22/06/21 19:21:09 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
22/06/21 19:21:09 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
22/06/21 19:21:09 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 11.0 KiB, free 2004.5 MiB)
22/06/21 19:21:09 INFO TorrentBroadcast: Reading broadcast variable 3 took 14 ms
22/06/21 19:21:09 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 22.9 KiB, free 2004.4 MiB)
22/06/21 19:21:09 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB)
22/06/21 19:21:09 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 34.8 KiB, free 2004.4 MiB)
22/06/21 19:21:09 INFO TorrentBroadcast: Reading broadcast variable 2 took 16 ms
22/06/21 19:21:09 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 541.0 KiB, free 2003.9 MiB)
22/06/21 19:21:09 INFO GpuParquetMultiFilePartitionReaderFactory: Using the coalesce multi-file Parquet reader, files: file:///opt/spark/train/train.parquet task attemptid: 2
22/06/21 19:21:10 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 3165 bytes result sent to driver
22/06/21 19:21:10 INFO CoarseGrainedExecutorBackend: Got assigned task 3
22/06/21 19:21:10 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
22/06/21 19:21:10 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
22/06/21 19:21:10 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
22/06/21 19:21:10 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 7.5 KiB, free 2003.9 MiB)
22/06/21 19:21:10 INFO TorrentBroadcast: Reading broadcast variable 4 took 14 ms
22/06/21 19:21:10 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 14.1 KiB, free 2003.8 MiB)
22/06/21 19:21:10 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
22/06/21 19:21:10 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.0.125:10000)
22/06/21 19:21:10 INFO MapOutputTrackerWorker: Got the map output locations
22/06/21 19:21:10 INFO ShuffleBlockFetcherIterator: Getting 1 (343.8 KiB) non-empty blocks including 1 (343.8 KiB) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
22/06/21 19:21:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 13 ms
[19:21:13] task 0 got new rank 0
22/06/21 19:21:13 INFO XGBoostSpark: Leveraging gpu device 0 to train
22/06/21 19:21:13 ERROR XGBoostSpark: XGBooster worker 0 has failed 0 times due to 
ml.dmlc.xgboost4j.java.XGBoostError: [19:21:13] /workspace/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [19:21:13] /workspace/src/common/device_helpers.cuh:132: NCCL failure :unhandled system error /workspace/src/common/device_helpers.cu(67)
Stack trace:
  [bt] (0) /tmp/libxgboost4j8022417957078819105.so(+0x584a3d) [0x7fa5209daa3d]
  [bt] (1) /tmp/libxgboost4j8022417957078819105.so(dh::ThrowOnNcclError(ncclResult_t, char const*, int)+0x2d9) [0x7fa5209dc739]
  [bt] (2) /tmp/libxgboost4j8022417957078819105.so(dh::AllReducer::Init(int)+0x8c8) [0x7fa5209db998]
  [bt] (3) /tmp/libxgboost4j8022417957078819105.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x127) [0x7fa520c8ae97]
  [bt] (4) /tmp/libxgboost4j8022417957078819105.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x3b6) [0x7fa520c96ee6]
  [bt] (5) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fa5208167c3]
  [bt] (6) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fa520817367]
  [bt] (7) /tmp/libxgboost4j8022417957078819105.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fa520853212]
  [bt] (8) /tmp/libxgboost4j8022417957078819105.so(XGBoosterUpdateOneIter+0x68) [0x7fa5206f6118]



Stack trace:
  [bt] (0) /tmp/libxgboost4j8022417957078819105.so(+0x81ff39) [0x7fa520c75f39]
  [bt] (1) /tmp/libxgboost4j8022417957078819105.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x695) [0x7fa520c971c5]
  [bt] (2) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fa5208167c3]
  [bt] (3) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fa520817367]
  [bt] (4) /tmp/libxgboost4j8022417957078819105.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fa520853212]
  [bt] (5) /tmp/libxgboost4j8022417957078819105.so(XGBoosterUpdateOneIter+0x68) [0x7fa5206f6118]
  [bt] (6) [0x7fa691017de7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:172)
	at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:217)
	at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304)
	at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
	at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:349)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:426)
	at scala.Option.map(Option.scala:230)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:424)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
22/06/21 19:21:13 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
ml.dmlc.xgboost4j.java.XGBoostError: [19:21:13] /workspace/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [19:21:13] /workspace/src/common/device_helpers.cuh:132: NCCL failure :unhandled system error /workspace/src/common/device_helpers.cu(67)
Stack trace:
  [bt] (0) /tmp/libxgboost4j8022417957078819105.so(+0x584a3d) [0x7fa5209daa3d]
  [bt] (1) /tmp/libxgboost4j8022417957078819105.so(dh::ThrowOnNcclError(ncclResult_t, char const*, int)+0x2d9) [0x7fa5209dc739]
  [bt] (2) /tmp/libxgboost4j8022417957078819105.so(dh::AllReducer::Init(int)+0x8c8) [0x7fa5209db998]
  [bt] (3) /tmp/libxgboost4j8022417957078819105.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x127) [0x7fa520c8ae97]
  [bt] (4) /tmp/libxgboost4j8022417957078819105.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x3b6) [0x7fa520c96ee6]
  [bt] (5) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fa5208167c3]
  [bt] (6) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fa520817367]
  [bt] (7) /tmp/libxgboost4j8022417957078819105.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fa520853212]
  [bt] (8) /tmp/libxgboost4j8022417957078819105.so(XGBoosterUpdateOneIter+0x68) [0x7fa5206f6118]



Stack trace:
  [bt] (0) /tmp/libxgboost4j8022417957078819105.so(+0x81ff39) [0x7fa520c75f39]
  [bt] (1) /tmp/libxgboost4j8022417957078819105.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x695) [0x7fa520c971c5]
  [bt] (2) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0x7e3) [0x7fa5208167c3]
  [bt] (3) /tmp/libxgboost4j8022417957078819105.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x317) [0x7fa520817367]
  [bt] (4) /tmp/libxgboost4j8022417957078819105.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x312) [0x7fa520853212]
  [bt] (5) /tmp/libxgboost4j8022417957078819105.so(XGBoosterUpdateOneIter+0x68) [0x7fa5206f6118]
  [bt] (6) [0x7fa691017de7]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:172)
	at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:217)
	at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:304)
	at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:66)
	at scala.Option.getOrElse(Option.scala:189)
	at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:62)
	at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:106)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:349)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:426)
	at scala.Option.map(Option.scala:230)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:424)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

stdout:

f0503ed3f584:90:159 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
f0503ed3f584:90:159 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

f0503ed3f584:90:159 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
f0503ed3f584:90:159 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
f0503ed3f584:90:159 [0] NCCL INFO Using network Socket
NCCL version 2.8.3+cuda11.0

f0503ed3f584:90:159 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0b/../../0000:0b:00.0
f0503ed3f584:90:159 [0] NCCL INFO graph/xml.cc:469 -> 2
f0503ed3f584:90:159 [0] NCCL INFO graph/xml.cc:660 -> 2
f0503ed3f584:90:159 [0] NCCL INFO graph/topo.cc:522 -> 2
f0503ed3f584:90:159 [0] NCCL INFO init.cc:627 -> 2
f0503ed3f584:90:159 [0] NCCL INFO init.cc:878 -> 2
f0503ed3f584:90:159 [0] NCCL INFO init.cc:914 -> 2
f0503ed3f584:90:159 [0] NCCL INFO init.cc:926 -> 2

I don't know what else to do.

I'll try on Spark ML with random forest model.

Maybe it will work in future versions, or my configuration with Windows 10 pro and wsl2 is not good, and native linux is needed to run the logic in docker containers.

BTW thanks @wbo4958 @sjeaugey

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 22, 2022

@Dartya, Looks like the xgboost jar is still using NCCL version 2.8.3+cuda11.0 ? Have you compiled the xgboost jar by your self? XGBoost is using stack-link to nccl library, so you don't need to install it manually.

So please try to compile the newest jar from master branch.

git clone git@github.com:dmlc/xgboost.git 
cd xgboost 
git submodule init
git submodule update
CI_DOCKER_EXTRA_PARAMS_INIT='--cpuset-cpus 0-3' tests/ci_build/ci_build.sh jvm_gpu_build nvidia-docker --build-arg CUDA_VERSION_ARG=11.0 tests/ci_build/build_jvm_packages.sh 3.0.1 -Duse.cuda=ON

Note that, please compile the xgboost in the machine with GPU installed.

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 22, 2022

@trivialfis do we have the snapshot jars?

@Dartya
Copy link
Author

Dartya commented Jun 22, 2022

The three of us assembled the image on Windows: CTO, Java/DevOps teamlead, Java senior. Until we put an echo debug in the sh files and launched it under wsl2, the build did not go. 3 hours left.

Then maven package crashed when accessing symlinks

[INFO]
[INFO] --- maven-resources-plugin:3.1.0:resources (default-resources) @ xgboost4j-gpu_2.12 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (default) @ xgboost4j-gpu_2.12 ---
[WARNING]  Expected all dependencies to require Scala version: 2.12.8
[WARNING]  com.typesafe.akka:akka-actor_2.12:2.5.23 requires scala version: 2.12.8
[WARNING]  org.scala-lang.modules:scala-java8-compat_2.12:0.8.0 requires scala version: 2.12.0
[WARNING] Multiple versions of scala libraries detected!
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for XGBoost JVM Package 2.0.0-SNAPSHOT:
[INFO]
[INFO] XGBoost JVM Package ................................ SUCCESS [ 50.797 s]
[INFO] xgboost4j-gpu_2.12 ................................. FAILURE [16:56 min]
[INFO] xgboost4j-spark-gpu_2.12 ........................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  17:57 min
[INFO] Finished at: 2022-06-22T14:15:31Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project xgboost4j-gpu_2.12: Execution default of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed: basedir /workspace/jvm-packages/xgboost4j-gpu/src/main/scala is not a directory -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :xgboost4j-gpu_2.12

We renamed the files with symlinks and copied the directories, we are waiting for another 15 minutes.

Maybe you have public repo with snapshots?

@Dartya
Copy link
Author

Dartya commented Jun 22, 2022

We tried to use https://xgboost.readthedocs.io/en/latest/install.html#id9, but repo not found.

Update: it's seems to AWS banned KZ and RU IP's -_-

@Dartya
Copy link
Author

Dartya commented Jun 22, 2022

We've downloaded 3 latest jars which compiled today from snapshots maven repo https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/list.html:
xgboost4j_2.12-2.0.0-SNAPSHOT.jar
xgboost4j-gpu_2.12-2.0.0-SNAPSHOT.jar
xgboost4j-spark-gpu_2.12-2.0.0-SNAPSHOT.jar

/workspace/jvm-packages/xgboost4j-gpu/src/native/../../../../src/common/common.h:239: XGBoost version not compiled with GPU support.
Stack trace:
  [bt] (0) /tmp/libxgboost4j7293428402978184512.so(+0x118b5d) [0x7f3dc1582b5d]
  [bt] (1) /tmp/libxgboost4j7293428402978184512.so(XGDeviceQuantileDMatrixCreateFromCallbackImpl+0x3f) [0x7f3dc1582cff]
  [bt] (2) [0x7f3e89017de7]

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 23, 2022

@Dartya, please delete xgboost4j_2.12-2.0.0-SNAPSHOT.jar and just use the two below jars and re-try

xgboost4j-gpu_2.12-2.0.0-SNAPSHOT.jar
xgboost4j-spark-gpu_2.12-2.0.0-SNAPSHOT.jar

@Dartya
Copy link
Author

Dartya commented Jun 23, 2022

Dear @wbo4958 !
Thank you very much for your invaluable help! Everything worked out for me, the driver is launched in the container, the job is launched in the spark executor in the container, the work is done on the GPU.

I want to put all this into an article with sample code and a repo, I will attach the link here, and if necessary, I can make a pull request to the repository with examples.

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 24, 2022

@Dartya glad to see you have it worked. expecting your article and PR.

@Dartya
Copy link
Author

Dartya commented Jul 29, 2022

Hi @wbo4958 !
Here is my article.
If the information seems useful for publication, tell me where to make a pull reguest.

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 1, 2022

Thx @Dartya, really amazing article.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants