[SPARK-18910][SQL]Resolve faile to use UDF that jar file in hdfs. #16324

shenh062326 · 2016-12-17T06:01:53Z

What changes were proposed in this pull request?

When I create a UDF that jar file in hdfs, I can't use the UDF.

spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar';
spark-sql> describe function trans_array;
Function: test_db.trans_array
Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray
Usage: N/A.
Time taken: 0.127 seconds, Fetched 3 row(s)
spark-sql> select trans_array(1, '\|', id, position) as (id0, position0) from test_spark limit 10;
Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7

The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs"

def addJar(path: String): Unit = {
sparkSession.sparkContext.addJar(path)
val uri = new Path(path).toUri
val jarURL = if (uri.getScheme == null) {
// path is a local file path without a URL scheme
new File(path).toURI.toURL
} else {
// path is a URL with a scheme
{color:red}uri.toURL{color}
}
jarClassLoader.addURL(jarURL)
Thread.currentThread().setContextClassLoader(jarClassLoader)
}

I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like:

static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}

How was this patch tested?

I have test it in my cluster.

gatorsmile · 2016-12-17T07:51:27Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -2373,6 +2373,7 @@ class SparkContext(config: SparkConf) extends Logging {
 * various Spark features.
 */
 object SparkContext extends Logging {
+  URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory())


SparkContext might not be the good place. object SharedState might be better.

In addition, you also need to add comments to explain why you did it before this line.

Thanks for you comment, I will add it to SharedState.

gatorsmile · 2016-12-17T07:53:07Z

Please add the description in the JIRA to the PR description. FYI, you still can edit the description after you create the PR.

SparkQA · 2016-12-17T07:57:51Z

Test build #70296 has finished for PR 16324 at commit a91b08f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-17T08:19:27Z

This method can only be called once per JVM, so it is typically executed in a static block. This limitation means that if some other part of your program—perhaps a third-party component outside your control— sets a URLStreamHandlerFactory, you won’t be able to use this approach for reading data from Hadoop.

gatorsmile · 2016-12-17T08:20:58Z

This might not be the right solution to do it, as explained above.

shenh062326 · 2016-12-17T08:31:57Z

Should we download the UDF jar from hdfs.

gatorsmile · 2016-12-17T08:35:50Z

First, I am not sure whether we should support reading UDF jar from HDFS. cc @rxin

Second, if we want to support it, the best reviewers are @zsxwing @tdas They added the file HDFSMetadataLog.scala recently

shenh062326 · 2016-12-17T08:56:38Z

Currently，we can create a UDF with jar in HDFS, but failed to use it.
Spark driver won't download the jar from HDFS, it only add the path to the classLoader.
If we don't support reading UDF jar from HDFS, we should download the UDF jar.
I think support reading UDF jar from HDFS is better.

rxin · 2016-12-17T20:58:53Z

This is to allow using jars defined using HDFS-API, not just HDFS right? In that case it sounds like a good idea too ... but we need a test case for it.

shenh062326 · 2016-12-18T12:10:24Z

I‘m sorry, @rxin, I don't understand what you mean.

rxin · 2016-12-18T23:09:24Z

I was saying we need to create a test case for this change.

HyukjinKwon · 2017-02-09T14:38:48Z

(gentle ping @shenh062326 )

melin · 2017-03-16T08:26:03Z

in spark 2.1.0
0: jdbc:hive2://localhost:10000> add jar hdfs:///user/datacompute/datacompute-udf-1.0-
Error: java.net.MalformedURLException: unknown protocol: hdfs (state=,code=0)
0: jdbc:hive2://localhost:10000>

## What changes were proposed in this pull request? Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](apache#16324) , but all of them are inactivity for a long time or have been closed. This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler. ## How was this patch tested? 1. Add a new unit test. 2. Check manually. Before: throw an exception with " failed unknown protocol: hdfs" <img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png"> After: <img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png"> Author: Weiqing Yang <yangweiqing001@gmail.com> Closes apache#17342 from weiqingy/SPARK-18910.

## What changes were proposed in this pull request? Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](#16324) , but all of them are inactivity for a long time or have been closed. This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler. ## How was this patch tested? 1. Add a new unit test. 2. Check manually. Before: throw an exception with " failed unknown protocol: hdfs" <img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png"> After: <img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png"> Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #17342 from weiqingy/SPARK-18910. (cherry picked from commit 2ba1eba) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

Resolve faile to use UDF that jar file in hdfs.

a91b08f

shenh062326 changed the title ~~Resolve faile to use UDF that jar file in hdfs.~~ [SPARK-18910][SQL]Resolve faile to use UDF that jar file in hdfs. Dec 17, 2016

gatorsmile reviewed Dec 17, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

weiqingy mentioned this pull request Mar 18, 2017

[SPARK-12868][SQL] Allow adding jars from hdfs #17342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18910][SQL]Resolve faile to use UDF that jar file in hdfs. #16324

[SPARK-18910][SQL]Resolve faile to use UDF that jar file in hdfs. #16324

shenh062326 commented Dec 17, 2016 •

edited

gatorsmile Dec 17, 2016

shenh062326 Dec 17, 2016

gatorsmile commented Dec 17, 2016

SparkQA commented Dec 17, 2016

gatorsmile commented Dec 17, 2016

gatorsmile commented Dec 17, 2016

shenh062326 commented Dec 17, 2016

gatorsmile commented Dec 17, 2016 •

edited

shenh062326 commented Dec 17, 2016

rxin commented Dec 17, 2016

shenh062326 commented Dec 18, 2016

rxin commented Dec 18, 2016

HyukjinKwon commented Feb 9, 2017 •

edited

melin commented Mar 16, 2017

[SPARK-18910][SQL]Resolve faile to use UDF that jar file in hdfs. #16324

[SPARK-18910][SQL]Resolve faile to use UDF that jar file in hdfs. #16324

Conversation

shenh062326 commented Dec 17, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile Dec 17, 2016

Choose a reason for hiding this comment

shenh062326 Dec 17, 2016

Choose a reason for hiding this comment

gatorsmile commented Dec 17, 2016

SparkQA commented Dec 17, 2016

gatorsmile commented Dec 17, 2016

gatorsmile commented Dec 17, 2016

shenh062326 commented Dec 17, 2016

gatorsmile commented Dec 17, 2016 • edited

shenh062326 commented Dec 17, 2016

rxin commented Dec 17, 2016

shenh062326 commented Dec 18, 2016

rxin commented Dec 18, 2016

HyukjinKwon commented Feb 9, 2017 • edited

melin commented Mar 16, 2017

shenh062326 commented Dec 17, 2016 •

edited

gatorsmile commented Dec 17, 2016 •

edited

HyukjinKwon commented Feb 9, 2017 •

edited