Skip to content

Conversation

@jinxing64
Copy link

What changes were proposed in this pull request?

In our cluster, there are lots of UDF jars, some of them have the same filename but different path, for example:

hdfs://A/B/udf.jar  -> udfA
hdfs://C/D/udf.jar  -> udfB

When user uses udfA and udfB in same sql, executor will fetch both hdfs://A/B/udf.jar and hdfs://C/D/udf.jar to local. There will be a conflict for the same name.

Can we config to fetch jars and save with a filename with MD5 prefix, so there will be no conflict.

How was this patch tested?

UT

@SparkQA
Copy link

SparkQA commented Mar 13, 2018

Test build #88208 has finished for PR 20812 at commit f78c273.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Author

@vanzin @zsxwing @jerryshao
How do you think about this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we still need the localName here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think we don't need it . The File returned by Utils.fetchFile is the local file. We don't need localName to initialize the local file here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about s"${DigestUtils.md5Hex(url)}-${decodeFileNameInURI(new URI(url))}"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have this conf in internal/conf ?

@jiangxb1987
Copy link
Contributor

The idea looks good, just a few comments.

@jinxing64
Copy link
Author

@jiangxb1987
Thanks a lot for review. I will refine soon !

@jerryshao
Copy link
Contributor

Does it only fix the jars added by sc.addJar or using non-yarn mode? Because yarn uses distributed cache at start, so it has a different code path, right?

@jinxing64
Copy link
Author

@jerryshao
Thanks for comment;
Yes, this change is only for sc.addJar and the jars will be named with a prefix when executor updateDependencies.

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Test build #88303 has finished for PR 20812 at commit 4473878.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Test build #88302 has finished for PR 20812 at commit 4ac1e8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

jerryshao commented Mar 29, 2018

@jinxing64 , I think using same name jars which contains different classes seems practically not a best practice. Ideally different udfs should be packaged in different jars with different name/version. That will be easy for user to manage. Also same name jars could easily cause classpath issue usually.

As you always has a workaround for this issue out of Spark. So I would suggest not to fix it, since this is a quite user specific issue.

@jinxing64
Copy link
Author

jinxing64 commented Mar 30, 2018

@jerryshao
Understood, Ideally different udfs should be packaged in different jars with different name/version. True. But we are faced with tons of udf/jars migrating from other engine. I made this pr not just for merging, but also want to know you experts ideas.

@jiangxb1987 @jerryshao Thanks again for your comments.

@jiangxb1987
Copy link
Contributor

Should we close this then? @jinxing64 @jerryshao

@jinxing64 jinxing64 closed this Apr 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants