Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48651][DOC] Configuring different JDK for Spark on YARN #47010

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions docs/running-on-yarn.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ Please see [Spark Security](security.html) and the specific security sections in

# Launching Spark on YARN

Apache Hadoop does not support Java 17 as of 3.4.0, while Apache Spark requires at least Java 17 since 4.0.0, so a different JDK should be configured for Spark applications.
Please refer to [Configuring different JDKs for Spark Applications](#configuring-different-jdks-for-spark-applications) for details.

Ensure that `HADOOP_CONF_DIR` or `YARN_CONF_DIR` points to the directory which contains the (client side) configuration files for the Hadoop cluster.
These configs are used to write to HDFS and connect to the YARN ResourceManager. The
configuration contained in this directory will be distributed to the YARN cluster so that all
Expand Down Expand Up @@ -1032,3 +1035,34 @@ and one should be configured with:
spark.shuffle.service.name = spark_shuffle_y
spark.shuffle.service.port = <other value>
```

# Configuring different JDKs for Spark Applications

In some cases it may be desirable to use a different JDK from YARN node manager to run Spark applications,
this can be achieved by setting the `JAVA_HOME` environment variable for YARN containers and the `spark-submit`
process.

Note that, Spark assumes that all JVM processes runs in one application use the same version of JDK, otherwise,
you may encounter JDK serialization issues.

To configure a Spark application to use a JDK which has been pre-installed on all nodes at `/opt/openjdk-17`:

$ export JAVA_HOME=/opt/openjdk-17
$ ./bin/spark-submit --class path.to.your.Class \
--master yarn \
--conf spark.yarn.appMasterEnv.JAVA_HOME=/opt/openjdk-17 \
--conf spark.executorEnv.JAVA_HOME=/opt/openjdk-17 \
<app jar> [app options]

Optionally, the user may want to avoid installing a different JDK on the YARN cluster nodes, in such a case,
it's also possible to distribute the JDK using YARN's Distributed Cache. For example, to use Java 21 to run
a Spark application, prepare a JDK 21 tarball `openjdk-21.tar.gz` and untar it to `/opt` on the local node,
then submit a Spark application:

$ export JAVA_HOME=/opt/openjdk-21
$ ./bin/spark-submit --class path.to.your.Class \
--master yarn \
--archives path/to/openjdk-21.tar.gz \
--conf spark.yarn.appMasterEnv.JAVA_HOME=./openjdk-21.tar.gz/openjdk-21 \
--conf spark.executorEnv.JAVA_HOME=./openjdk-21.tar.gz/openjdk-21 \
Comment on lines +1066 to +1067
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn @tgravescs sorry for correcting this in 5bbe200 after your approval, I also updated the PR description to add the manual test result on a YARN cluster

<app jar> [app options]