Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR #9512

Open
nawidsayed opened this issue Aug 22, 2023 · 18 comments
Open

Comments

@nawidsayed
Copy link

nawidsayed commented Aug 22, 2023

Hello everybody,

I am trying to implement XGBoost4J-Spark in a scala project. Everything works fine locally (on an intel MacBook), however when deploying to EMR, I receive the following error (running on EMR 6.12.0 and Spark 3.4.0 with Scala 2.12.17):

java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI

For my build.sbt I added the following lines to libraryDependencies, as suggested by the tutorial (running with sbt 1.9.2):

"ml.dmlc" %% "xgboost4j" % "1.7.6",
"ml.dmlc" %% "xgboost4j-spark" % "1.7.6"

I packaged everything up into a single JAR via the sbt-assembly plugin. I believe that this would pack all the dependencies into the JAR that is needed to run the Spark Application on EMR, so I am really out of ideas about this error. Not sure if this is an error on my end or an actual bug. Help is appreciated!

@hcho3
Copy link
Collaborator

hcho3 commented Aug 22, 2023

Can you install XGBoost4J-Spark from Maven Central? Locally building JARs is more complex, as you might have issues with bundling the native library (libxgboost4j.so).

@nawidsayed
Copy link
Author

nawidsayed commented Aug 22, 2023

I assume you mean to build the packages from source?

I tried that on the master of the EMR cluster, but ran into errors. I ran the following steps:

Installing maven:

wget https://dlcdn.apache.org/maven/maven-3/3.9.4/binaries/apache-maven-3.9.4-bin.tar.gz
tar xzvf apache-maven-3.9.4-bin.tar.gz
PATH=$PATH:/home/hadoop/apache-maven-3.9.4/bin

Cloning Repo and switching to 1.7.6 and then packaging it up (following steps in tutorial):

git clone --recursive https://github.com/dmlc/xgboost

cd xgboost
git checkout 36eb41c
cd jvm-packages
mvn package

This ended up in the following error:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for XGBoost JVM Package 1.7.6:
[INFO]
[INFO] XGBoost JVM Package ................................ SUCCESS [  2.298 s]
[INFO] xgboost4j_2.12 ..................................... FAILURE [  1.973 s]
[INFO] xgboost4j-spark_2.12 ............................... SKIPPED
[INFO] xgboost4j-flink_2.12 ............................... SKIPPED
[INFO] xgboost4j-example_2.12 ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  4.651 s
[INFO] Finished at: 2023-08-22T23:09:24Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:exec (native) on project xgboost4j_2.12: Command execution failed.: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :xgboost4j_2.12

which seems to be caused by this:

  File "create_jni.py", line 125
    run(f'"{sys.executable}" mapfeat.py')
                                       ^
SyntaxError: invalid syntax
[ERROR] Command execution failed.

Similarly I tried to install it via running ./xgboost/jvm-packages/dev/build-linux.sh as suggested by the README in jvm-packages. This too fails somewhere down the line with:

docker: Error response from daemon: pull access denied for dmlc/xgboost4j-build, repository does not exist or may require 'docker login'

I feel like I am well of the beaten path here and probably miss something quite obvious...

@hcho3
Copy link
Collaborator

hcho3 commented Aug 23, 2023

Do you have a working Python 3 installation?

I didn't realize you have to build from the source when using EMR. Do you need an uber-JAR where all dependencies are included? I found it hard to build such a JAR.

@nawidsayed
Copy link
Author

nawidsayed commented Aug 23, 2023

Yes I do have a python3 installation, but it seems like this error is caused by a python2 invocation on the "create_jni.py" file. Invoking python from the command line opens python 3.7.16 shell. I am not sure how and why python2 is invoked.

I just want something that reliably works in production, building these uber-JARs hasn't failed me so far.

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 23, 2023

@nawidsayed, I guess you can hack the python path from here https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/pom.xml#L88

@nawidsayed
Copy link
Author

nawidsayed commented Aug 23, 2023

Thanks for your help so far everybody! I noticed that I am running on EMR Graviton 2 processors (r6gd instance) which are ARM based and I believe that might not be well supported by XGBoost4J. Switched to r5d instances (with intel xeon) and all the dependencies now seem to be present. However I am still encountering a very generic error (from the master node):

23/08/23 08:55:07 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing 'eval_sets' and 'eval_set_names'
23/08/23 08:55:07 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 08:55:08 ERROR XGBoostSpark: the job was aborted due to 
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418) ~[test.jar:1.0.0]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202) ~[test.jar:1.0.0]
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34) ~[test.jar:1.0.0]
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114) ~[spark-mllib_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65) ~[test.jar:1.0.0]
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18) ~[test.jar:1.0.0]
	at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34) ~[test.jar:1.0.0]
	at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala) ~[test.jar:1.0.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_382]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_382]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_382]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_382]
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed.
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:418)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:202)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:34)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:65)
	at com.jobs.PrototypeSparkJob$.run(PrototypeSparkJob.scala:18)
	at com.jobs.SparkJobWithJson.main(SparkJobWithJson.scala:34)
	at com.jobs.PrototypeSparkJob.main(PrototypeSparkJob.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'

I haven't found a remedy for this. Almost all references for this error are usually caused when using the GPU implementation. This however is not the case for me here. It's confusing because if I check stderr on the executor, it's clear that the training is actually happening and there is no indication of an error there.

[09:34:54] [0]	train-mlogloss:0.98398036223191476
[09:34:54] [0]	train-mlogloss:0.97309246502424540


[09:34:54] [1]	train-mlogloss:0.88586563941759944

[09:34:54] [1]	train-mlogloss:0.86604528834945282

[09:34:54] [2]	train-mlogloss:0.80109514334262943

[09:34:54] [2]	train-mlogloss:0.77383518846411459

[09:34:54] [3]	train-mlogloss:0.72730388396825540

[09:34:54] [4]	train-mlogloss:0.66267788104521919

[09:34:54] [3]	train-mlogloss:0.69377712826979787

[09:34:54] [5]	train-mlogloss:0.60579290756812465

[09:34:54] [4]	train-mlogloss:0.62382606613008595

[09:34:54] [6]	train-mlogloss:0.55597261434946310
....


23/08/23 09:34:56 INFO Executor: 1 block locks were not released by task 0.0 in stage 4.0 (TID 4)
[rdd_25_0]
23/08/23 09:34:56 INFO MemoryStore: Block taskresult_4 stored as bytes in memory (estimated size 6.1 MiB, free 6.1 GiB)
23/08/23 09:34:56 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 6442557 bytes result sent via BlockManager)
23/08/23 09:34:56 INFO YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
23/08/23 09:34:56 INFO MemoryStore: MemoryStore cleared
23/08/23 09:34:56 INFO BlockManager: BlockManager stopped
23/08/23 09:34:56 INFO ShutdownHookManager: Shutdown hook called
23/08/23 09:34:56 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1692778153848_0006/spark-d1d9df8a-dde2-47ce-8ca7-fc09fde80055

@trivialfis trivialfis changed the title XGBoost4J-Spark fails on EMR [jvm-packages] XGBoost4J-Spark fails on EMR Aug 23, 2023
@nawidsayed
Copy link
Author

So it seems like it's related to Spark & XGBoost versioning. Using Spark 3.4.0 on Scala 2.12 and XGBoost packages versions 1.7.6 I get the aforementioned error which is probably related to the Rabbit Tracker. StdOut prints Tracker started, with env={} just before erroring out.

However I don't have any issues when running with Spark 2.4.8 on Scala 2.11 and using XGBoost4j and XGBoost4j-spark with version 1.1.2 . In that case just before the training routine, stdOut read: Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.31.89.29, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=36}.

@nawidsayed
Copy link
Author

Is there any way to make it properly work for Spark 3.4 ?

@hcho3
Copy link
Collaborator

hcho3 commented Aug 23, 2023

XGBoost 1.7.6 supports Spark 3.0.1:

<spark.version>3.0.1</spark.version>

You can use XGBoost 2.0.0 to use Spark 3.4.0:

<spark.version>3.4.0</spark.version>

@nawidsayed
Copy link
Author

nawidsayed commented Aug 23, 2023

Thanks for pointing this out. Unfortunately adding the library according to instructions here, fails in the following way when running sbt compile:

[error] (update) java.net.URISyntaxException: Illegal character in path at index 106: https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/release/ml/dmlc/xgboost4j_2.12/2.0.0-RC1/xgboost4j_${scala.binary.version}-2.0.0-RC1.jar

@nawidsayed
Copy link
Author

nawidsayed commented Aug 23, 2023

Even when manually adding the 2.0.0-RC1 packages to the jar, we run into the Rabbit Tracker error:

Tracker started, with env={}
23/08/23 16:53:11 ERROR RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 1
23/08/23 16:53:12 ERROR XGBoostSpark: the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

Even after this error, the executors still commence with training, according to their log:

[16:59:23] [97]	train-mlogloss:0.63060436905694506
[16:59:23] [97]	train-mlogloss:0.63249886897005347
[16:59:24] [98]	train-mlogloss:0.63300104089375020
...

@nawidsayed nawidsayed changed the title [jvm-packages] XGBoost4J-Spark fails on EMR [jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 Aug 23, 2023
@nawidsayed nawidsayed changed the title [jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 [jvm-packages] XGBoost4J-Spark 2.0.0-RC1 fails for Spark 3.4.0 on EMR Aug 23, 2023
@trivialfis
Copy link
Member

I think we should prioritize the refactoring of the tracker, otherwise JVM related issues are quite difficult to resolve

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 23, 2023

is it possible the tracker is also running with python 2?

@nawidsayed
Copy link
Author

nawidsayed commented Aug 24, 2023

I don't know, isn't it written in C? The default python command resolves to python 3.7.16 on EMR tho. Anyways, I was able to run Xgboost-Spark 1.1.2 on EMR 5.36.1 (Spark 2.4.8) successfully and I didn't change anything besides EMR and XGboost version to get it running.

If it helps, I could write out a minimal example that leads to aforementioned success and failure respectively.

@djmarti
Copy link

djmarti commented Nov 29, 2023

I bumped into the exact same generic error reported by the OP, using a very similar setup (EMR 6.5.0, Spark 3.1.2). Even if I am using Scala Spark, there is a python dependence through RabitTracker, which requires python >= 3.8. But EMR 6.5.0 provides python3.7. Setting up a virtual environtment that allows the cluster to use a higher python version solved the problem for me.

@djmarti
Copy link

djmarti commented May 20, 2024

Coming back again, after the solution I suggested in my post in Nov 28 didn't seem to work out on a second attempt. For me it was important to activate the virtual environment with the right python version before starting my spark-shell session in the master node.

So in the master node I would run

source pyspark_venv_python_3.9.9/bin/activate

and then I would launch my spark-shell session with:

MASTER=yarn-client /usr/bin/spark-shell \
  --name my_static_shell \
  --queue default \
  --driver-memory 20G \
  --executor-memory 16G \
  --executor-cores 1 \
  --num-executors 90 \
  --archives s3://mypath/pyspark_venv_python_3.9.9.tar.gz#environment \
  --conf spark.yarn.maxAppAttempts=0 \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.task.cpus=1 \
  --conf spark.kryoserializer.buffer.max=2047mb \
  --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python \
  --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python \
  --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python \
  --jars s3://path_to_one.jar

Only then is the tracker able to start with a sensible environment:

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.x.x.x, DMLC_TRACKER_PORT=xxxxx, DMLC_NUM_WORKER=80}

If I am not in the virtual environment before launching the shell, the tracker fails.

@trivialfis
Copy link
Member

That's caused by the Python dependency. We have removed the use of Python in the master branch.

@djmarti
Copy link

djmarti commented May 20, 2024

Thanks @trivialfis. I am bound to use version 1.7.3, but it's great to hear the python dependency has been removed in recent versions. It was really a pain to deal with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants