New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XGBoostRegressor with GPU on the Spark+Rapids JDK8 #7994
Comments
@wbo4958 Could you please take a look when time allows? |
Hi @Dartya, The sample is a little bit old. Please use "setFeaturesCol" instead of "setFeaturesCols". and there is a new sample from https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/taxi/scala/src/com/nvidia/spark/examples/taxi/Main.scala, please check it out and try it. The performance is impressive. |
@wbo4958 Hi! |
@Dartya Sorry for that, Can you change Dataset<Row> ntdf = transform(tdf, featureColumns, labelName);
Dataset<Row> nedf = transform(edf, featureColumns, labelName);
XGBoostRegressor regressor = new XGBoostRegressor(paramMap)
.setLabelCol(labelName)
.setFeaturesCol("features");
PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(ntdf);
Dataset<Row> result = model.transform(nedf); to // Dataset<Row> ntdf = transform(tdf, featureColumns, labelName);
// Dataset<Row> nedf = transform(edf, featureColumns, labelName);
XGBoostRegressor regressor = new XGBoostRegressor(paramMap)
.setLabelCol(labelName)
.setFeaturesCol(featureColumns);
PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(tdf);
Dataset<Row> result = model.transform(edf); |
there is no need to do the vectorization, you only need to specify the feature column names. |
@wbo4958, thanks for the tip! Now I do not use data type transformation and from the very beginning I subtract all data as a float. Apparently, the data began to be loaded into the GPU memory, since the regressor parameters were written to the log and... new errors appeared. I identify this reason: A Google search hasn't turned up any results yet, but I'm continuing. Just in case, here is the resulting code and output. I also want to say that it is now important for me to make the example work correctly - this affects the choice of technology for my project, the central technical part of which is the use of a distributed learning system on several nodes. Probably, I do not have enough in-depth knowledge of libraries and APIs, but use cases are examples to feel the technology with minimal knowledge. I just want to apologize for the confusion, and possibly stupid questions. Code:
Logs:
Perhaps the following information is relevant:
|
@Dartya, Could you share the spark configuration and which RAPIDS Acceleator was using? To be honest, I have not tested it with JAVA coding, but we can make it work with Scala coding without any issue. more details can be found at https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.06/examples/XGBoost-Examples |
Hi @Dartya, I just tried the java way, it still works. First download the taxi dataset from taxi dataset and de-compress to somewhere. And then download the latest cudf and rapids jars from here and here I chose the parquet format and changed some code to read parquet, here is the code import org.apache.spark.ml.PredictionModel;
import org.apache.spark.ml.linalg.Vector;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel;
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor;
public static void testRegression() {
String trainPath = "YOUR_PATH/taxi-small/taxi/parquet/train";
//test
String evalPath = "YOUR_PATH/taxi-small/taxi/parquet/eval";
SparkSession session = SparkSession.builder()
.master("local[*]")
.getOrCreate();
Dataset<Row> tdf = session.read().parquet(trainPath);
tdf.show();
Dataset<Row> edf = session.read().parquet(evalPath);
edf.show();
String labelName = "fare_amount";
String[] featureColumns = {"passenger_count", "trip_distance", "pickup_longitude", "pickup_latitude", "rate_code",
"dropoff_longitude", "dropoff_latitude", "hour", "day_of_week", "is_weekend"};
scala.collection.immutable.Map map = new scala.collection.immutable.HashMap<String, Object>();
map = map.updated("learning_rate", 0.05);
map = map.updated("max_depth", 8);
map = map.updated("subsample", 0.8);
map = map.updated("gamma", 1);
map = map.updated("num_round", 500);
map = map.updated("tree_method", "gpu_hist");
map = map.updated("num_workers", 1);
XGBoostRegressor regressor = new XGBoostRegressor(map);
regressor.setLabelCol(labelName);
regressor.setFeaturesCol(featureColumns);
PredictionModel<Vector, XGBoostRegressionModel> model = regressor.fit(tdf);
Dataset<Row> result = model.transform(edf);
result.show();
} and then submit the xgboost application with the below commands #! /bin/bash
spark-submit \
--master local[12] \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.rapids.memory.gpu.minAllocFraction=0.0001 \
--conf spark.rapids.memory.gpu.reserve=20 \
--conf spark.rapids.sql.enabled=true \
--conf spark.sql.adaptive.enabled=false \
--conf spark.rapids.sql.explain=ALL \
--conf spark.rapids.sql.hasNans=false \
--conf spark.executor.cores=12 \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.task.cpus=12 \
--executor-memory 30G \
--driver-memory 5G \
--jars xgboost4j-spark-gpu_2.12-1.6.1.jar,xgboost4j-gpu_2.12-1.6.1.jar,cudf-22.04.0-cuda11.jar,rapids-4-spark_2.12-22.04.0.jar \
--class RegressionMain \
me.bobby.xgboost-1.0-SNAPSHOT.jar |
Hi @wbo4958 ! My previous spark config is below:
I saw that our configs are different and added the jar files: |
@Dartya yeah, looks like you're using standalone mode. and then xgboost requires GPU scheduling configuration in both driver and worker. for workerplease get getGpusResources.sh and in ${SPARK_HOME}/conf/spark-defaults.conf add below configuration spark.worker.resource.gpu.amount 1
spark.worker.resource.gpu.discoveryScript YOUR_PATH/getGpusResources.sh for driverplease add below configuration, --conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=1 So I would like to suggest you run on the local first, and then try in on standalone. |
Hi @wbo4958 ! Based on the fact that the target scheme is to run in Spark in Docker / Kubernetes, I used a test Spark master from my local virtual machine cluster. Also, the ML logic should be part of the Spring webservice, and I have experience running services in this form, I decided to immediately do tests in this configuration. I'm running my Spring app with XGBoost either in Intellij Idea debug or in a docker container. The next step in my tests should be to run the training job distributed on two machines in docker containers, which matches the target scheme. Of course, based on the nVidia example and the Rapids documentation, I used the getGpusResources.sh GPU resource discovery script, hosted it in my docker image, and applied the following setting in spark-defaults.conf:
and in spark-env.sh: I debugged with the following example:
I am very grateful to you for the Java code example, pointing out the use of parquet files and pointing out the configuration, and report the following:
when subtracting both parquet and csv files, you need to set the following configurations to correctly convert float to string:
and then the following error output:
Just in case, here is the output of the command in the Spark executor docker container:
nVidia Driver version 512.95; @wbo4958 I want to thank you again for helping, and on the successful outcome of the experiments, I can attach all my results with the CI/CD pipeline in the form of a pull request and / or an article with an example of launching in a spring application. Now I'm leaving to deal with the last exception, I saw a couple of links in Google. |
I still tried to run 3.1.3 on a local cluster (3.2.1 does not run under Windows stackoverflow . I am getting error in stacktrace:
This is some kind of catastrophe and universal pain. Still, I will continue experiments with a standalone cluster on a virtual machine and docker. I'll keep you informed. |
I checked all the configs and dependencies again, removed the dependency that was not useful:
And after job started, I saw in the stacktrace the need for installed python. I installed python and python3 in the executor and driver images. Made one attempt with the parquet files and your example, and two attempts with my example based on a scheme with CSV file column mappings in Double and Float. The results are the same, a stacktrace of the form is returned:
I paid attention to the discussion here - but checking shows nvidia-smi in my container is working , and based on the rapids example of counting two Long datasets, I can say for sure that the tasks on the GPU in the container are working. I also checked the neural network training based on the DJL library - in the spark executor container, the model is trained on the PyTorch Engine. I saw that the root of the problem could be in different version of XGBoost and NCCL here , and this is my only remaining version of why the example is throwing an exception. So far, I don't know where to dig anymore. I will try to complete my task without XGBoost on random forest Spark ML API. |
Please ignore "deserialization error:", that's print by RAPIDS accelerator. it means some Spark physical plans can't run on GPU. but for the xgboost case, the whole ETL pipeline will run on GPU, so don't worry about this. And for the NCCL issue, please add the below configuration and check the "stdout" log in executor side. --conf spark.executorEnv.NCCL_DEBUG=INFO |
Hi @wbo4958 , I set spark.executorEnv.NCCL_DEBUG=INFO in the driver
, did not see the difference, so I stopped the executor and put it in conf/spark-env.sh. I also did not see a difference in the logs, I stopped the executor again and set it additionally in the conf/spark-env.sh. Additionally, I made an NCCL_DEBUG environment variable in the executor's docker container. I did not see any difference or additional information. Here is the executor log and some screenshots
|
@wbo4958 I'm sorry. Out of habit, I looked at the stderr log, and did not pay attention to stdout. Here's his output:
|
@Dartya Could you share all "stdout" of all executors? |
Below question is from NCCL teams, ls /sys present on the system?
If so, what is the output of ls -l /sys/class/net , ls -l /sys/class/net/eth0 , and ls -l /sys/class/net/eth0/device? It seems NCCL it trying to find where eth0 is attached but fails to do so. |
@wbo4958, I've only one executor and it's stdout contains only this stdout log.
I attached the full log of the stderr above.
|
Thanks. I believe recent versions of NCCL would no longer fail and only perhaps print a warning in this case. The new code just ignores the topology detection when a NIC is virtual and attaches it to the first CPU. |
@sjeaugey Thank you. However, I did not understand a little bit what to do. Do I need to update nccl? |
@trivialfis, would you help on this by upgrading the NCCL to the latest version? |
Yes, sorry if that was unclear. Upgrading NCCL to 2.12 should hopefully fix the issue. |
@wbo4958 Of course. I already looked and saw only the dockerfile. Do you mean that I need to check my jars on the updated NCCL? Of course, I will check, and, as I said, if the scenario is successful, I am even ready to write an article for examples with a separate repository or a pull request. I'll update the image of the executor and be back in a day. |
@Dartya, yeah, you can cherry-pick the patch and build the xgboost jar locally and test in your environment. After applying the patch #8015, you can build the xgboost jars locally by cd xgboost
CI_DOCKER_EXTRA_PARAMS_INIT='--cpuset-cpus 0-3' tests/ci_build/ci_build.sh jvm_gpu_build nvidia-docker --build-arg CUDA_VERSION_ARG=11.0 tests/ci_build/build_jvm_packages.sh 3.0.1 -Duse.cuda=ON
|
@wbo4958 oh, I see. I be back to you with answer in a day. |
I updated the nvidia driver to 516.40 and my version is 11.7.0.
Where is no nccl by default, ok. Then I make a docker image of the spark executor based on the one auto-generated by spark itself. As an image argument, I pass the tag of the image obtained in the previous step.
I draw your attention to the fact that in the image from nvidia there is no nccl, and I install it myself. I launch the container from the resulting image.
I run the application and the output is the same.
stdout:
I don't know what else to do. I'll try on Spark ML with random forest model. Maybe it will work in future versions, or my configuration with Windows 10 pro and wsl2 is not good, and native linux is needed to run the logic in docker containers. |
@Dartya, Looks like the xgboost jar is still using So please try to compile the newest jar from master branch. git clone git@github.com:dmlc/xgboost.git
cd xgboost
git submodule init
git submodule update
CI_DOCKER_EXTRA_PARAMS_INIT='--cpuset-cpus 0-3' tests/ci_build/ci_build.sh jvm_gpu_build nvidia-docker --build-arg CUDA_VERSION_ARG=11.0 tests/ci_build/build_jvm_packages.sh 3.0.1 -Duse.cuda=ON Note that, please compile the xgboost in the machine with GPU installed. |
@trivialfis do we have the snapshot jars? |
The three of us assembled the image on Windows: CTO, Java/DevOps teamlead, Java senior. Until we put an echo debug in the sh files and launched it under wsl2, the build did not go. 3 hours left. Then maven package crashed when accessing symlinks
We renamed the files with symlinks and copied the directories, we are waiting for another 15 minutes. Maybe you have public repo with snapshots? |
We tried to use https://xgboost.readthedocs.io/en/latest/install.html#id9, but repo not found. Update: it's seems to AWS banned KZ and RU IP's -_- |
We've downloaded 3 latest jars which compiled today from snapshots maven repo https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/list.html:
|
@Dartya, please delete xgboost4j_2.12-2.0.0-SNAPSHOT.jar and just use the two below jars and re-try
|
Dear @wbo4958 ! I want to put all this into an article with sample code and a repo, I will attach the link here, and if necessary, I can make a pull request to the repository with examples. |
@Dartya glad to see you have it worked. expecting your article and PR. |
Thx @Dartya, really amazing article. |
I am trying to implement an example using Java JDK8.
The example says that when using the GPU, you must set the value of
featuresCols
using thesetFeaturesCols()
method.val xgbRegressor = new XGBoostRegressor(xgbParamFinal) .setLabelCol(labelName) .setFeaturesCols(featureNames)
The problem is that there is no such method
setFeaturesCols()
, only.setFeaturesCol("features")
and.ml$dmlc$xgboost4j$scala$spark$params$HasFeaturesCols$_setter_$featuresCols_$eq()
are available.If this parameter is not specified, then the method ends with an exception:
The problem is exacerbated by the fact that I can't find a working example of converting an array of strings to an object of class StringArrayParam. I tried to implement the transformers from the example, but this is ends by error too (I'm not saved the stacktrace).
My questions:
Here is my code:
dependencies of pom.xml:
The text was updated successfully, but these errors were encountered: