[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path #32052

sarutak · 2021-04-05T06:19:27Z

What changes were proposed in this pull request?

This PR fixes an issue that ADD JAR command can't add jar files which contain whitespaces in the path though ADD FILE and ADD ARCHIVE work with such files.

If we have /some/path/test file.jar and execute the following command:

ADD JAR "/some/path/test file.jar";

The following exception is thrown.

21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"]
java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar
	at java.net.URI.create(URI.java:852)
	at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
	at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)

This is because HiveSessionStateBuilder and SessionStateBuilder don't check whether the form of the path is URI or plain path and it always regards the path as URI form.
Whitespces should be encoded to %20 so /some/path/test file.jar is rejected.
We can resolve this part by checking whether the given path is URI form or not.

Unfortunatelly, if we fix this part, another problem occurs.
When we execute ADD JAR command, Hive's ADD JAR command is executed in HiveClientImpl.addJar and AddResourceProcessor.run is transitively invoked.
In AddResourceProcessor.run, the command line is just split by s+ and the path is also split into /some/path/test and file.jar and passed to ss.add_resources.
https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75
So, the command still fails.

Even if we convert the form of the path to URI like file:/some/path/test%20file.jar and execute the following command:

ADD JAR "file:/some/path/test%20file";

The following exception is thrown.

21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist
java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist
	at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336)
	at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74)

The reason is Utilities.realFile invoked in SessionState.validateFiles returns null as the result of fs.exists(path) is false.
https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064

fs.exists checks the existence of the given path by comparing the string representation of Hadoop's Path.
The string representation of Path is similar to URI but it's actually different.
Path doesn't encode the given path.
For example, the URI form of /some/path/jar file.jar is file:/some/path/jar%20file.jar but the Path form of it is file:/some/path/jar file.jar. So fs.exists returns false.

So the solution I come up with is removing Hive's ADD JAR from HiveClientimpl.addJar.
I think Hive's ADD JAR was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution.
https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103

But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, the class loaders for metadata and execution seem to be isolated with different way.
https://github.com/apache/spark/pull/8909/files#diff-8ef7cabf145d3fe7081da799fa415189d9708892ed76d4d13dd20fa27021d149R635-R641

In the current implementation, such class loaders seem to be isolated by SharedState.jarClassLoader and IsolatedClientLoader.classLoader.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L173-L188
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L956-L967

So I wonder we can remove Hive's ADD JAR from HiveClientImpl.addJar.

Why are the changes needed?

This is a bug.

Does this PR introduce any user-facing change?

How was this patch tested?

…s whitespaces in its path.

SparkQA · 2021-04-05T07:44:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41480/

SparkQA · 2021-04-05T07:44:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41480/

SparkQA · 2021-04-05T11:19:46Z

Test build #136903 has finished for PR 32052 at commit c88104e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @sarutak . Merged to master for Apache Spark 3.2.0.
cc @HyukjinKwon and @wangyum

HyukjinKwon · 2021-04-08T03:32:47Z

I agree with this change. Thanks @sarutak and @dongjoon-hyun

Fix the issue that ADD JAR command cannot add jar files which contain…

c88104e

…s whitespaces in its path.

github-actions bot added the SQL label Apr 5, 2021

sarutak mentioned this pull request Apr 7, 2021

[SPARK-34977][SQL] LIST FILES/JARS/ARCHIVES cannot handle multiple arguments properly when at least one path is quoted #32074

Closed

dongjoon-hyun approved these changes Apr 7, 2021

View reviewed changes

dongjoon-hyun closed this in e5d972e Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path #32052

[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path #32052

sarutak commented Apr 5, 2021 •

edited

SparkQA commented Apr 5, 2021

SparkQA commented Apr 5, 2021

SparkQA commented Apr 5, 2021

dongjoon-hyun left a comment

HyukjinKwon commented Apr 8, 2021

[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path #32052

[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path #32052

Conversation

sarutak commented Apr 5, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Apr 5, 2021

SparkQA commented Apr 5, 2021

SparkQA commented Apr 5, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 8, 2021

sarutak commented Apr 5, 2021 •

edited