[SPARK-31582][YARN] Being able to not populate Hadoop classpath

### What changes were proposed in this pull request? We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`. ### Why are the changes needed? Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster. However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars. One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors. By not populating the Hadoop classpath from the clusters can address this issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop. We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath. Closes #28376 from dbtsai/yarn-classpath. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>
apache · Apr 29, 2020 · ecfee82 · ecfee82
1 parent 226301a
commit ecfee82
Show file tree

Hide file tree

Showing 4 changed files with 43 additions and 1 deletion.
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
@@ -385,6 +385,17 @@ To use a custom metrics.properties for the application master and executors, upd
   </td>
   <td>1.4.0</td>
 </tr>
+<tr>
+  <td><code>spark.yarn.populateHadoopClasspath</code></td>
+  <td>true</td>
+  <td>
+    Whether to populate Hadoop classpath from <code>yarn.application.classpath</code> and
+    <code>mapreduce.application.classpath</code> Note that if this is set to <code>false</code>, 
+    it requires a <code>with-Hadoop</code> Spark distribution that bundles Hadoop runtime or
+    user has to provide a Hadoop installation separately.
+  </td>
+  <td>2.4.6</td>
+</tr>
 <tr>
   <td><code>spark.yarn.maxAppAttempts</code></td>
   <td><code>yarn.resourcemanager.am.max-attempts</code> in YARN</td>

diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -1353,7 +1353,10 @@ private object Client extends Logging {
       }
     }
 
-    populateHadoopClasspath(conf, env)
+    if (sparkConf.get(POPULATE_HADOOP_CLASSPATH)) {
+      populateHadoopClasspath(conf, env)
+    }
+
     sys.env.get(ENV_DIST_CLASSPATH).foreach { cp =>
       addClasspathEntry(getClusterPath(sparkConf, cp), env)
     }

diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
@@ -70,6 +70,15 @@ package object config {
     .booleanConf
     .createWithDefault(false)
 
+  private[spark] val POPULATE_HADOOP_CLASSPATH = ConfigBuilder("spark.yarn.populateHadoopClasspath")
+    .doc("Whether to populate Hadoop classpath from `yarn.application.classpath` and " +
+      "`mapreduce.application.classpath` Note that if this is set to `false`, it requires " +
+      "a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user has to provide " +
+      "a Hadoop installation separately.")
+    .version("2.4.6")
+    .booleanConf
+    .createWithDefault(true)
+
   private[spark] val GATEWAY_ROOT_PATH = ConfigBuilder("spark.yarn.config.gatewayPath")
     .doc("Root of configuration paths that is present on gateway nodes, and will be replaced " +
       "with the corresponding path in cluster machines.")

diff --git a/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala b/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
@@ -485,6 +485,25 @@ class ClientSuite extends SparkFunSuite with Matchers {
     }
   }
 
+  test("SPARK-31582 Being able to not populate Hadoop classpath") {
+    Seq(true, false).foreach { populateHadoopClassPath =>
+      withAppConf(Fixtures.mapAppConf) { conf =>
+        val sparkConf = new SparkConf()
+          .set(POPULATE_HADOOP_CLASSPATH, populateHadoopClassPath)
+        val env = new MutableHashMap[String, String]()
+        val args = new ClientArguments(Array("--jar", USER))
+        populateClasspath(args, conf, sparkConf, env)
+        if (populateHadoopClassPath) {
+          classpath(env) should
+            (contain (Fixtures.knownYARNAppCP) and contain (Fixtures.knownMRAppCP))
+        } else {
+          classpath(env) should
+            (not contain (Fixtures.knownYARNAppCP) and not contain (Fixtures.knownMRAppCP))
+        }
+      }
+    }
+  }
+
   private val matching = Seq(
     ("files URI match test1", "file:///file1", "file:///file2"),
     ("files URI match test2", "file:///c:file1", "file://c:file2"),