Skip to content

Commit

Permalink
[SPARK-31582][YARN] Being able to not populate Hadoop classpath
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`.

### Why are the changes needed?
Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster.

However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars.

One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors.

By not populating the Hadoop classpath from the clusters can address this issue.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop.

We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath.

Closes #28376 from dbtsai/yarn-classpath.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
  • Loading branch information
dbtsai committed Apr 29, 2020
1 parent 226301a commit ecfee82
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 1 deletion.
11 changes: 11 additions & 0 deletions docs/running-on-yarn.md
Expand Up @@ -385,6 +385,17 @@ To use a custom metrics.properties for the application master and executors, upd
</td>
<td>1.4.0</td>
</tr>
<tr>
<td><code>spark.yarn.populateHadoopClasspath</code></td>
<td>true</td>
<td>
Whether to populate Hadoop classpath from <code>yarn.application.classpath</code> and
<code>mapreduce.application.classpath</code> Note that if this is set to <code>false</code>,
it requires a <code>with-Hadoop</code> Spark distribution that bundles Hadoop runtime or
user has to provide a Hadoop installation separately.
</td>
<td>2.4.6</td>
</tr>
<tr>
<td><code>spark.yarn.maxAppAttempts</code></td>
<td><code>yarn.resourcemanager.am.max-attempts</code> in YARN</td>
Expand Down
Expand Up @@ -1353,7 +1353,10 @@ private object Client extends Logging {
}
}

populateHadoopClasspath(conf, env)
if (sparkConf.get(POPULATE_HADOOP_CLASSPATH)) {
populateHadoopClasspath(conf, env)
}

sys.env.get(ENV_DIST_CLASSPATH).foreach { cp =>
addClasspathEntry(getClusterPath(sparkConf, cp), env)
}
Expand Down
Expand Up @@ -70,6 +70,15 @@ package object config {
.booleanConf
.createWithDefault(false)

private[spark] val POPULATE_HADOOP_CLASSPATH = ConfigBuilder("spark.yarn.populateHadoopClasspath")
.doc("Whether to populate Hadoop classpath from `yarn.application.classpath` and " +
"`mapreduce.application.classpath` Note that if this is set to `false`, it requires " +
"a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user has to provide " +
"a Hadoop installation separately.")
.version("2.4.6")
.booleanConf
.createWithDefault(true)

private[spark] val GATEWAY_ROOT_PATH = ConfigBuilder("spark.yarn.config.gatewayPath")
.doc("Root of configuration paths that is present on gateway nodes, and will be replaced " +
"with the corresponding path in cluster machines.")
Expand Down
Expand Up @@ -485,6 +485,25 @@ class ClientSuite extends SparkFunSuite with Matchers {
}
}

test("SPARK-31582 Being able to not populate Hadoop classpath") {
Seq(true, false).foreach { populateHadoopClassPath =>
withAppConf(Fixtures.mapAppConf) { conf =>
val sparkConf = new SparkConf()
.set(POPULATE_HADOOP_CLASSPATH, populateHadoopClassPath)
val env = new MutableHashMap[String, String]()
val args = new ClientArguments(Array("--jar", USER))
populateClasspath(args, conf, sparkConf, env)
if (populateHadoopClassPath) {
classpath(env) should
(contain (Fixtures.knownYARNAppCP) and contain (Fixtures.knownMRAppCP))
} else {
classpath(env) should
(not contain (Fixtures.knownYARNAppCP) and not contain (Fixtures.knownMRAppCP))
}
}
}
}

private val matching = Seq(
("files URI match test1", "file:///file1", "file:///file2"),
("files URI match test2", "file:///c:file1", "file://c:file2"),
Expand Down

0 comments on commit ecfee82

Please sign in to comment.