[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR #9512

nawidsayed · 2023-08-22T17:58:23Z

Hello everybody,

I am trying to implement XGBoost4J-Spark in a scala project. Everything works fine locally (on an intel MacBook), however when deploying to EMR, I receive the following error (running on EMR 6.12.0 and Spark 3.4.0 with Scala 2.12.17):

java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI

For my build.sbt I added the following lines to libraryDependencies, as suggested by the tutorial (running with sbt 1.9.2):

"ml.dmlc" %% "xgboost4j" % "1.7.6",
"ml.dmlc" %% "xgboost4j-spark" % "1.7.6"

I packaged everything up into a single JAR via the sbt-assembly plugin. I believe that this would pack all the dependencies into the JAR that is needed to run the Spark Application on EMR, so I am really out of ideas about this error. Not sure if this is an error on my end or an actual bug. Help is appreciated!

The text was updated successfully, but these errors were encountered:

hcho3 · 2023-08-22T19:41:33Z

Can you install XGBoost4J-Spark from Maven Central? Locally building JARs is more complex, as you might have issues with bundling the native library (libxgboost4j.so).

nawidsayed · 2023-08-22T23:21:45Z

I assume you mean to build the packages from source?

I tried that on the master of the EMR cluster, but ran into errors. I ran the following steps:

Installing maven:

wget https://dlcdn.apache.org/maven/maven-3/3.9.4/binaries/apache-maven-3.9.4-bin.tar.gz
tar xzvf apache-maven-3.9.4-bin.tar.gz
PATH=$PATH:/home/hadoop/apache-maven-3.9.4/bin

Cloning Repo and switching to 1.7.6 and then packaging it up (following steps in tutorial):

git clone --recursive https://github.com/dmlc/xgboost

cd xgboost
git checkout 36eb41c
cd jvm-packages
mvn package

This ended up in the following error:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for XGBoost JVM Package 1.7.6:
[INFO]
[INFO] XGBoost JVM Package ................................ SUCCESS [  2.298 s]
[INFO] xgboost4j_2.12 ..................................... FAILURE [  1.973 s]
[INFO] xgboost4j-spark_2.12 ............................... SKIPPED
[INFO] xgboost4j-flink_2.12 ............................... SKIPPED
[INFO] xgboost4j-example_2.12 ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  4.651 s
[INFO] Finished at: 2023-08-22T23:09:24Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:exec (native) on project xgboost4j_2.12: Command execution failed.: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :xgboost4j_2.12

which seems to be caused by this:

  File "create_jni.py", line 125
    run(f'"{sys.executable}" mapfeat.py')
                                       ^
SyntaxError: invalid syntax
[ERROR] Command execution failed.

Similarly I tried to install it via running ./xgboost/jvm-packages/dev/build-linux.sh as suggested by the README in jvm-packages. This too fails somewhere down the line with:

docker: Error response from daemon: pull access denied for dmlc/xgboost4j-build, repository does not exist or may require 'docker login'

I feel like I am well of the beaten path here and probably miss something quite obvious...

hcho3 · 2023-08-23T03:05:36Z

Do you have a working Python 3 installation?

I didn't realize you have to build from the source when using EMR. Do you need an uber-JAR where all dependencies are included? I found it hard to build such a JAR.

nawidsayed · 2023-08-23T07:10:26Z

Yes I do have a python3 installation, but it seems like this error is caused by a python2 invocation on the "create_jni.py" file. Invoking python from the command line opens python 3.7.16 shell. I am not sure how and why python2 is invoked.

I just want something that reliably works in production, building these uber-JARs hasn't failed me so far.

wbo4958 · 2023-08-23T07:38:00Z

@nawidsayed, I guess you can hack the python path from here https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/pom.xml#L88

nawidsayed · 2023-08-23T09:38:45Z

Thanks for your help so far everybody! I noticed that I am running on EMR Graviton 2 processors (r6gd instance) which are ARM based and I believe that might not be well supported by XGBoost4J. Switched to r5d instances (with intel xeon) and all the dependencies now seem to be present. However I am still encountering a very generic error (from the master node):

23/08/23 08:55:07 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
23/08/23 08:55:07 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 08:55:08 ERROR XGBoostSpark: the job was aborted due to 
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418) ~[test.jar:1.0.0]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202) ~[test.jar:1.0.0]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34) ~[test.jar:1.0.0]
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114) ~[spark-mllib_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65) ~[test.jar:1.0.0]
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18) ~[test.jar:1.0.0]
	at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34) ~[test.jar:1.0.0]
	at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala) ~[test.jar:1.0.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_382]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_382]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65)
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18)
	at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34)
	at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'

I haven't found a remedy for this. Almost all references for this error are usually caused when using the GPU implementation. This however is not the case for me here. It's confusing because if I check stderr on the executor, it's clear that the training is actually happening and there is no indication of an error there.

[09:34:54] [0]	train-mlogloss:0.98398036223191476
[09:34:54] [0]	train-mlogloss:0.97309246502424540


[09:34:54] [1]	train-mlogloss:0.88586563941759944

[09:34:54] [1]	train-mlogloss:0.86604528834945282

[09:34:54] [2]	train-mlogloss:0.80109514334262943

[09:34:54] [2]	train-mlogloss:0.77383518846411459

[09:34:54] [3]	train-mlogloss:0.72730388396825540

[09:34:54] [4]	train-mlogloss:0.66267788104521919

[09:34:54] [3]	train-mlogloss:0.69377712826979787

[09:34:54] [5]	train-mlogloss:0.60579290756812465

[09:34:54] [4]	train-mlogloss:0.62382606613008595

[09:34:54] [6]	train-mlogloss:0.55597261434946310
....


23/08/23 09:34:56 INFO Executor: 1 block locks were not released by task 0.0 in stage 4.0 (TID 4)
[rdd_25_0]
23/08/23 09:34:56 INFO MemoryStore: Block taskresult_4 stored as bytes in memory (estimated size 6.1 MiB, free 6.1 GiB)
23/08/23 09:34:56 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 6442557 bytes result sent via BlockManager)
23/08/23 09:34:56 INFO YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
23/08/23 09:34:56 INFO MemoryStore: MemoryStore cleared
23/08/23 09:34:56 INFO BlockManager: BlockManager stopped
23/08/23 09:34:56 INFO ShutdownHookManager: Shutdown hook called
23/08/23 09:34:56 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1692778153848_0006/spark-d1d9df8a-dde2-47ce-8ca7-fc09fde80055

nawidsayed · 2023-08-23T15:05:34Z

So it seems like it's related to Spark & XGBoost versioning. Using Spark 3.4.0 on Scala 2.12 and XGBoost packages versions 1.7.6 I get the aforementioned error which is probably related to the Rabbit Tracker. StdOut prints Tracker started, with env={} just before erroring out.

However I don't have any issues when running with Spark 2.4.8 on Scala 2.11 and using XGBoost4j and XGBoost4j-spark with version 1.1.2 . In that case just before the training routine, stdOut read: Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.31.89.29, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=36}.

nawidsayed · 2023-08-23T15:40:43Z

Is there any way to make it properly work for Spark 3.4 ?

hcho3 · 2023-08-23T15:47:27Z

XGBoost 1.7.6 supports Spark 3.0.1:

xgboost/jvm-packages/pom.xml

Line 37 in 36eb41c

<spark.version>3.0.1</spark.version>

You can use XGBoost 2.0.0 to use Spark 3.4.0:

xgboost/jvm-packages/pom.xml

Line 38 in 4301558

<spark.version>3.4.0</spark.version>

nawidsayed · 2023-08-23T16:38:12Z

Thanks for pointing this out. Unfortunately adding the library according to instructions here, fails in the following way when running sbt compile:

[error] (update) java.net.URISyntaxException: Illegal character in path at index 106: https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/ml/dmlc/xgboost4j_2.12/2.0.0-RC1/xgboost4j_${scala.binary.version}-2.0.0-RC1.jar

nawidsayed · 2023-08-23T16:58:37Z

Even when manually adding the 2.0.0-RC1 packages to the jar, we run into the Rabbit Tracker error:

Tracker started, with env={}
23/08/23 16:53:11 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 16:53:12 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

Even after this error, the executors still commence with training, according to their log:

[16:59:23] [97]	train-mlogloss:0.63060436905694506
[16:59:23] [97]	train-mlogloss:0.63249886897005347
[16:59:24] [98]	train-mlogloss:0.63300104089375020
...

trivialfis · 2023-08-23T22:03:53Z

I think we should prioritize the refactoring of the tracker, otherwise JVM related issues are quite difficult to resolve

wbo4958 · 2023-08-23T23:41:14Z

is it possible the tracker is also running with python 2?

nawidsayed · 2023-08-24T08:14:49Z

I don't know, isn't it written in C? The default python command resolves to python 3.7.16 on EMR tho. Anyways, I was able to run Xgboost-Spark 1.1.2 on EMR 5.36.1 (Spark 2.4.8) successfully and I didn't change anything besides EMR and XGboost version to get it running.

If it helps, I could write out a minimal example that leads to aforementioned success and failure respectively.

djmarti · 2023-11-29T06:28:35Z

I bumped into the exact same generic error reported by the OP, using a very similar setup (EMR 6.5.0, Spark 3.1.2). Even if I am using Scala Spark, there is a python dependence through RabitTracker, which requires python >= 3.8. But EMR 6.5.0 provides python3.7. Setting up a virtual environtment that allows the cluster to use a higher python version solved the problem for me.

djmarti · 2024-05-20T04:16:43Z

Coming back again, after the solution I suggested in my post in Nov 28 didn't seem to work out on a second attempt. For me it was important to activate the virtual environment with the right python version before starting my spark-shell session in the master node.

So in the master node I would run

source pyspark_venv_python_3.9.9/bin/activate

and then I would launch my spark-shell session with:

MASTER=yarn-client /usr/bin/spark-shell \
  --name my_static_shell \
  --queue default \
  --driver-memory 20G \
  --executor-memory 16G \
  --executor-cores 1 \
  --num-executors 90 \
  --archives s3://mypath/pyspark_venv_python_3.9.9.tar.gz#environment \
  --conf spark.yarn.maxAppAttempts=0 \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.task.cpus=1 \
  --conf spark.kryoserializer.buffer.max=2047mb \
  --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python \
  --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python \
  --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python \
  --jars s3://path_to_one.jar

Only then is the tracker able to start with a sensible environment:

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.x.x.x, DMLC_TRACKER_PORT=xxxxx, DMLC_NUM_WORKER=80}

If I am not in the virtual environment before launching the shell, the tracker fails.

trivialfis · 2024-05-20T05:04:39Z

That's caused by the Python dependency. We have removed the use of Python in the master branch.

djmarti · 2024-05-20T21:10:18Z

Thanks @trivialfis. I am bound to use version 1.7.3, but it's great to hear the python dependency has been removed in recent versions. It was really a pain to deal with.

trivialfis changed the title ~~XGBoost4J-Spark fails on EMR~~ [jvm-packages] XGBoost4J-Spark fails on EMR Aug 23, 2023

nawidsayed changed the title ~~[jvm-packages] XGBoost4J-Spark fails on EMR~~ [jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 Aug 23, 2023

nawidsayed changed the title ~~[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0~~ [jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR #9512

[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR #9512

nawidsayed commented Aug 22, 2023 •

edited

hcho3 commented Aug 22, 2023

nawidsayed commented Aug 22, 2023 •

edited

hcho3 commented Aug 23, 2023

nawidsayed commented Aug 23, 2023 •

edited

wbo4958 commented Aug 23, 2023

nawidsayed commented Aug 23, 2023 •

edited

nawidsayed commented Aug 23, 2023

nawidsayed commented Aug 23, 2023

hcho3 commented Aug 23, 2023

nawidsayed commented Aug 23, 2023 •

edited

nawidsayed commented Aug 23, 2023 •

edited

trivialfis commented Aug 23, 2023

wbo4958 commented Aug 23, 2023

nawidsayed commented Aug 24, 2023 •

edited

djmarti commented Nov 29, 2023 •

edited

djmarti commented May 20, 2024

trivialfis commented May 20, 2024

djmarti commented May 20, 2024

[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR #9512

[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR #9512

Comments

nawidsayed commented Aug 22, 2023 • edited

hcho3 commented Aug 22, 2023

nawidsayed commented Aug 22, 2023 • edited

hcho3 commented Aug 23, 2023

nawidsayed commented Aug 23, 2023 • edited

wbo4958 commented Aug 23, 2023

nawidsayed commented Aug 23, 2023 • edited

nawidsayed commented Aug 23, 2023

nawidsayed commented Aug 23, 2023

hcho3 commented Aug 23, 2023

nawidsayed commented Aug 23, 2023 • edited

nawidsayed commented Aug 23, 2023 • edited

trivialfis commented Aug 23, 2023

wbo4958 commented Aug 23, 2023

nawidsayed commented Aug 24, 2023 • edited

djmarti commented Nov 29, 2023 • edited

djmarti commented May 20, 2024

trivialfis commented May 20, 2024

djmarti commented May 20, 2024

nawidsayed commented Aug 22, 2023 •

edited

nawidsayed commented Aug 22, 2023 •

edited

nawidsayed commented Aug 23, 2023 •

edited

nawidsayed commented Aug 23, 2023 •

edited

nawidsayed commented Aug 23, 2023 •

edited

nawidsayed commented Aug 23, 2023 •

edited

nawidsayed commented Aug 24, 2023 •

edited

djmarti commented Nov 29, 2023 •

edited