[SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced ja... #4880

trystanleftwich · 2015-03-03T22:39:16Z

...r will fail
While in cluster mode if you use ADD JAR with a HDFS sourced jar it will fail trying to source that jar on the worker nodes with the following error:

AmplabJenkins · 2015-03-03T22:42:10Z

Can one of the admins verify this patch?

marmbrus · 2015-03-03T22:51:08Z

ok to test

SparkQA · 2015-03-03T22:53:00Z

Test build #28246 has started for PR 4880 at commit 5931cc9.

This patch merges cleanly.

vanzin · 2015-03-03T23:12:33Z

core/src/main/scala/org/apache/spark/util/Utils.scala

-      fileOverwrite: Boolean): Unit = {
-    if (!targetDir.mkdir()) {
+      fileOverwrite: Boolean,
+      filename: String = ""): Unit = {


I'd use Option[String] = None. The in L648 you can do val targetFile = new File(targetDir, filename.getOrElse(innerPath.getName).

+1 to adding some kind of type safety here

vanzin · 2015-03-03T23:13:25Z

LGTM aside from minor style issue. I also think this should really go into 1.3...

vanzin · 2015-03-03T23:27:46Z

@pwendell adding to your radar.

andrewor14 · 2015-03-03T23:58:47Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+            new File(targetDir, filename)
+          } else {
+            new File(targetDir, innerPath.getName)
+          }


Is this correct? If path refers to a directory with multiple files in it, then this will fetch all of those files using the same name, overwriting all but the last one fetched. IIUC we need to differentiate between path being a directory from it being a file in the beginning of this method:

// L641, before the listStatus logic if (fs.isFile(path)) { val targetFile = new File(targetDir, filename.getOrElse(path.getName)) val in = fs.open(path) downloadFile(path.toString, in, targetFile, fileOverwrite) } else { ... // do the listStatus thing we've been doing before }

where filename should be set if and only if the path refers to a file.

I think it's a little weird but it works. Only the very first call to fetchHcfsFile defines filename. If the path passed to it is a directory, it will recursively call itself without setting filename. If it's a file, it will write the file using the given filename. So even if it could be clearer, the code as is should work.

But I'm ok with making it clearer, exactly to avoid this kind of discussion. :-)

If the path passed to it is a directory, it will recursively call itself without setting filename

That's not actually true. If you call listStatus on a directory it will list the directory's contents but not include the directory itself (I just verified this). So if the directory contains multiple files they will all go into the else case in L646 and be renamed to the same thing.

Yes, the problem here is that now targetDir is the parent directory of where path should be, and the children are being written directly to that parent path. It needs some code to create this directory corresponding to path before downloading the children.

andrewor14 · 2015-03-04T00:07:08Z

@trystanleftwich thanks for fixing this. I believe given the current way we call fetchHcfsFile your existing patch is sufficient in fixing the problem. However, at the risk of being pedantic, I believe it is technically not correct if the path refers to a directory for the reason I described above. This is only a concern with fetching directories from fetchHcfsFile though, which is something I don't think we support anyway.

In other words I think this patch in its current state is probably fine to merge, but I'd be interested to hear what others think.

vanzin · 2015-03-04T00:20:34Z

I tried this patch locally and while it works for addFile(String), it seems to not work for addFile(String, boolean) (i.e. the directory version). Here's the error I got:

Exception in thread "Driver" org.apache.spark.SparkException: File /dataroot/local/yarn/nm/usercache/systest/appcache/application_1425400850634_0015/userFiles-ed754418-8d53-4e86-a324-738b70fab5cd/spark-files.23921
exists and does not match  contents of hdfs://vanzin-st1-1.vpc.cloudera.com:8020/tmp/spark-files.23921/core-site.xml
        at org.apache.spark.util.Utils$.copyFile(Utils.scala:519)
        at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$downloadFile(Utils.scala:471)
        at org.apache.spark.util.Utils$$anonfun$fetchHcfsFile$1.apply(Utils.scala:654)
        at org.apache.spark.util.Utils$$anonfun$fetchHcfsFile$1.apply(Utils.scala:641)

Let me take a look to see if I figure out what's missing.

andrewor14 · 2015-03-04T00:23:52Z

Ah, I didn't realize addFile also supports directories for Hadoop file systems. Then this does seem to a correctness problem.

SparkQA · 2015-03-04T00:32:16Z

Test build #28246 has finished for PR 4880 at commit 5931cc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-04T00:32:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28246/
Test PASSed.

trystanleftwich · 2015-03-04T01:27:43Z

I fat fingered and accidentally closed this ticket, And for some reason its not picking up that the branch has changes in it. I reopened here:
#4881

vanzin reviewed Mar 3, 2015
View reviewed changes

andrewor14 reviewed Mar 3, 2015
View reviewed changes

trystanleftwich closed this Mar 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced ja... #4880

[SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced ja... #4880

trystanleftwich commented Mar 3, 2015

AmplabJenkins commented Mar 3, 2015

marmbrus commented Mar 3, 2015

SparkQA commented Mar 3, 2015

vanzin Mar 3, 2015

andrewor14 Mar 3, 2015

vanzin commented Mar 3, 2015

vanzin commented Mar 3, 2015

andrewor14 Mar 3, 2015

vanzin Mar 4, 2015

andrewor14 Mar 4, 2015

vanzin Mar 4, 2015

andrewor14 commented Mar 4, 2015

vanzin commented Mar 4, 2015

andrewor14 commented Mar 4, 2015

SparkQA commented Mar 4, 2015

AmplabJenkins commented Mar 4, 2015

trystanleftwich commented Mar 4, 2015

[SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced ja... #4880

[SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced ja... #4880

Conversation

trystanleftwich commented Mar 3, 2015

AmplabJenkins commented Mar 3, 2015

marmbrus commented Mar 3, 2015

SparkQA commented Mar 3, 2015

vanzin Mar 3, 2015

Choose a reason for hiding this comment

andrewor14 Mar 3, 2015

Choose a reason for hiding this comment

vanzin commented Mar 3, 2015

vanzin commented Mar 3, 2015

andrewor14 Mar 3, 2015

Choose a reason for hiding this comment

vanzin Mar 4, 2015

Choose a reason for hiding this comment

andrewor14 Mar 4, 2015

Choose a reason for hiding this comment

vanzin Mar 4, 2015

Choose a reason for hiding this comment

andrewor14 commented Mar 4, 2015

vanzin commented Mar 4, 2015

andrewor14 commented Mar 4, 2015

SparkQA commented Mar 4, 2015

AmplabJenkins commented Mar 4, 2015

trystanleftwich commented Mar 4, 2015