Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path #32052

Closed
wants to merge 1 commit into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Apr 5, 2021

What changes were proposed in this pull request?

This PR fixes an issue that ADD JAR command can't add jar files which contain whitespaces in the path though ADD FILE and ADD ARCHIVE work with such files.

If we have /some/path/test file.jar and execute the following command:

ADD JAR "/some/path/test file.jar";

The following exception is thrown.

21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"]
java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar
	at java.net.URI.create(URI.java:852)
	at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
	at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)

This is because HiveSessionStateBuilder and SessionStateBuilder don't check whether the form of the path is URI or plain path and it always regards the path as URI form.
Whitespces should be encoded to %20 so /some/path/test file.jar is rejected.
We can resolve this part by checking whether the given path is URI form or not.

Unfortunatelly, if we fix this part, another problem occurs.
When we execute ADD JAR command, Hive's ADD JAR command is executed in HiveClientImpl.addJar and AddResourceProcessor.run is transitively invoked.
In AddResourceProcessor.run, the command line is just split by s+ and the path is also split into /some/path/test and file.jar and passed to ss.add_resources.
https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75
So, the command still fails.

Even if we convert the form of the path to URI like file:/some/path/test%20file.jar and execute the following command:

ADD JAR "file:/some/path/test%20file";

The following exception is thrown.

21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist
java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist
	at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336)
	at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74)

The reason is Utilities.realFile invoked in SessionState.validateFiles returns null as the result of fs.exists(path) is false.
https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064

fs.exists checks the existence of the given path by comparing the string representation of Hadoop's Path.
The string representation of Path is similar to URI but it's actually different.
Path doesn't encode the given path.
For example, the URI form of /some/path/jar file.jar is file:/some/path/jar%20file.jar but the Path form of it is file:/some/path/jar file.jar. So fs.exists returns false.

So the solution I come up with is removing Hive's ADD JAR from HiveClientimpl.addJar.
I think Hive's ADD JAR was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution.
https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103

But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, the class loaders for metadata and execution seem to be isolated with different way.
https://github.com/apache/spark/pull/8909/files#diff-8ef7cabf145d3fe7081da799fa415189d9708892ed76d4d13dd20fa27021d149R635-R641

In the current implementation, such class loaders seem to be isolated by SharedState.jarClassLoader and IsolatedClientLoader.classLoader.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L173-L188
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L956-L967

So I wonder we can remove Hive's ADD JAR from HiveClientImpl.addJar.

Why are the changes needed?

This is a bug.

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions github-actions bot added the SQL label Apr 5, 2021
@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41480/

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41480/

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Test build #136903 has finished for PR 32052 at commit c88104e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @sarutak . Merged to master for Apache Spark 3.2.0.
cc @HyukjinKwon and @wangyum

@HyukjinKwon
Copy link
Member

I agree with this change. Thanks @sarutak and @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants