[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided #19497

mridulm · 2017-10-14T06:23:10Z

What changes were proposed in this pull request?

PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid.
Namely:

empty string
URI parse exception while creating Path

This is resubmission of PR #19487, which I messed up while updating my repo.

How was this patch tested?

Enhanced test to cover new support added.

… output format

mridulm · 2017-10-14T06:24:10Z

+CC @HyukjinKwon, @steveloughran

Sorry for messing up PR #19487
The only change in this PR is to use ::invalid:: instead of test: in the test to address @steveloughran's comment.

Thanks.

SparkQA · 2017-10-14T07:05:02Z

Test build #82753 has finished for PR 19497 at commit a319df3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-14T07:06:55Z

retest this please

HyukjinKwon · 2017-10-14T07:07:34Z

Let me take a look with few tests and be back. Also I think I should cc @jiangxb1987 too.

SparkQA · 2017-10-14T10:20:11Z

Test build #82754 has finished for PR 19497 at commit a319df3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-15T05:51:04Z

@mridulm, I just checked through the past related changes and checked the tests pass on branch-2.1.

Seems this PR will actually also allow the cases below:

.saveAsNewAPIHadoopFile[...]("")
.saveAsNewAPIHadoopFile[...]("::invalid:::")

Currently both are failed but looks not after this PR.

Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
	at org.apache.hadoop.fs.Path.<init>(Path.java:89)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
...

java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
	at org.apache.hadoop.fs.Path.initialize(Path.java:206)
	at org.apache.hadoop.fs.Path.<init>(Path.java:172)
	at org.apache.hadoop.fs.Path.<init>(Path.java:89)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
...

I think we should protect these cases.

For the cases for old one:

.saveAsHadoopFile[...]("")
.saveAsHadoopFile[...]("::invalid:::")

these looks failed fast (whether it was initially intended or not) and I guess this PR does not affect these:

Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
	at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)

java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
	at org.apache.hadoop.fs.Path.initialize(Path.java:206)
	at org.apache.hadoop.fs.Path.<init>(Path.java:172)
	at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)

HyukjinKwon

If we protect the other cases I mentioned above, I think I don't see a reason to block this. Not sure if I see any downside for restoring the previous behaviour back. I am less sure if this fix is the best but looks minimised. So LGTM but @jiangxb1987 I believe It needs your look before we go further.

mridulm · 2017-10-15T06:56:01Z

Thx for taking a deeper look @HyukjinKwon, much appreciated !
I will wait for @jiangxb1987 to also opine before committing - I want to make sure we are not adding incorrect behavior; given that this is a followup to an earlier PR (some excellent work by @szhem btw)

jiangxb1987

LGTM, thanks for working on this @mridulm !

…ctory provided ## What changes were proposed in this pull request? PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR #19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes #19497 from mridulm/master. (cherry picked from commit 13c1559) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>

mridulm · 2017-10-16T01:45:29Z

Thanks for the reviews everyone !

HyukjinKwon · 2017-10-16T02:16:35Z

@mridulm, BTW, WDYT about disallowing:

.saveAsNewAPIHadoopFile[...]("")
.saveAsNewAPIHadoopFile[...]("::invalid:::")

within the APIs? If i tested this correctly, this PR also allows both cases but I think we should disallow as it requires path and overrides it and saveAsHadoopFile disallows it. Seems this is also allowed in branch-2.1 as I recall correctly but looks disallowing it sounds more making sense in any event.

mridulm · 2017-10-16T05:54:00Z

@HyukjinKwon My intention was to preserve earlier behavior.
Particularly for non-path based committer's, the path variable and its use/processing is not relevant, it makes more sense to ignore that codepath entirely.

mridulm · 2017-10-16T05:55:42Z

To clarify, we can look at changing the behavior (if required) in future - but that should be an explicit design choice informed by hadoop committer design. Until then, we should look to interoperate.

HyukjinKwon · 2017-10-16T06:20:53Z

I support this PR itself of course. I have no problem with this.

I meant a separate (soft) question about saveAsNewAPIHadoopFile (not saveAsNewAPIHadoopDataset) to validate path parameter which we take in saveAsNewAPIHadoopFile explicitly.

mridulm · 2017-10-16T21:08:04Z

saveAsNewAPIHadoopFile simply delegates to saveAsNewAPIHadoopDataset (with some options set), right ? The behavior would be similar ?

Do you mean saveAsHadoopDataset instead ?
I did not change behavior there - since the exception was getting raised from within hadoop code and not from our code (when we pass invalid values), and it is preserving behavior from earlier code.
I was focussed more on the regression introduced.

HyukjinKwon · 2017-10-16T23:15:41Z

I meant saveAsNewAPIHadoopFile comparing to saveAsHadoopFile.

saveAsNewAPIHadoopFile[...]("") // succeeds

saveAsHadoopFile[...]("") // fails

Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
	at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1066)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)

I wanted to talk about this. saveAsHadoopFile seems being failed fast within saveAsHadoopFile specifically before the delegation:

spark/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

Line 1066 in 3f958a9

SparkHadoopWriterUtils.createPathFromString(path, hadoopConf))

. So, I suspected saveAsNewAPIHadoopFile should also throw an exception in this way.

saveAsHadoopFile validates path so I thought saveAsNewAPIHadoopFile should also validate.

spark/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

Lines 1004 to 1008 in 3f958a9

    
              * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class 
        
              * supporting the key and value types K and V in this RDD. Compress with the supplied codec. 
        
              */ 
        
             def saveAsHadoopFile( 
        
                 path: String,

spark/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

Lines 983 to 987 in 3f958a9

    
              * Output the RDD to any Hadoop-supported file system, using a new Hadoop API `OutputFormat` 
        
              * (mapreduce.OutputFormat) object supporting the key and value types K and V in this RDD. 
        
              */ 
        
             def saveAsNewAPIHadoopFile( 
        
                 path: String,

HyukjinKwon · 2017-10-16T23:40:27Z

I agree this was focussed more on the regression introduced and it should be good enough already, and I am talking about a different thing for behaviour change.

Let me organise my idea and and try to file a JIRA later. I think strictly this is separate anyway if I haven't missed something or simply I was wrong.

mridulm · 2017-10-17T01:08:47Z

@HyukjinKwon Thanks for clarifying.

The way I look at it is:
saveAsHadoopFile is explicitly referring to Output the RDD to any Hadoop-supported file system in its description (and name) - and so valid Path is a reasonable requirement.

Additionally, in createPathFromString for path == null we are explicitly throwing IllegalArgumentException (new Path will do the same now, but I think this changed in past where it used to result in NPE ?).
The subsequent val outputPath = new Path(path) will do that for other invalid input paths as well.

In contrast saveAsHadoopDataset is not related to file system but Output the RDD to any Hadoop-supported storage system : where output being a valid Path is not a requirement.

Having said that, we can always iterate in a jira if you feel there is some confusion - it is always better to be explicitly clear about the interfaces we expose and support !
Thanks.

HyukjinKwon · 2017-10-17T01:23:14Z

Thank you @mridulm. I regret that I raised this here, causing confusion. Let's talk more in another place. I will cc you (and @jiangxb1987) when I happened to file up a JIRA or see similar issue related with this.

steveloughran · 2017-10-17T11:23:35Z

I guess one aspect of saveAsNewAPIHadoopFile is that it calls jobConfiguration.set("mapreduce.output.fileoutputformat.outputdir", path), and Configuration.set(String key, String value) has a check for null key or value.

If handling of paths is to be done in the committer, saveAsNewAPIHadoopFile should really be looking @ path and calling jobConfiguration.unset("mapreduce.output.fileoutputformat.outputdir) if path==null.

Looking at how Hadoop's FileOutputFormat implementations work, they can handle a null/undefined output dir property, but not an empty one.

public static Path getOutputPath(JobContext job) {
   String name = job.getConfiguration().get(FileOutputFormat.OUTDIR);
    return name == null ? null: new Path(name);

Which implies that saveAsNewHadoopFile("") might want to unset the config option too, so offloading the problem of what happens on an empty path to the committer. Though I'd recommend checking to see what meaningful exceptions actually get raised in this situation when the committer is the normal FileOutputFormat/FileOutputCommitter setup

…ctory provided ## What changes were proposed in this pull request? PR apache#19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR apache#19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes apache#19497 from mridulm/master. (cherry picked from commit 13c1559) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>

Fix HadoopMapReduceCommitProtocol based on failures seen with Phoenix…

a319df3

… output format

HyukjinKwon reviewed Oct 15, 2017

View reviewed changes

jiangxb1987 approved these changes Oct 15, 2017

View reviewed changes

asfgit closed this in 13c1559 Oct 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided #19497

[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided #19497

mridulm commented Oct 14, 2017

mridulm commented Oct 14, 2017

SparkQA commented Oct 14, 2017

HyukjinKwon commented Oct 14, 2017

HyukjinKwon commented Oct 14, 2017

SparkQA commented Oct 14, 2017

HyukjinKwon commented Oct 15, 2017 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

mridulm commented Oct 15, 2017

jiangxb1987 left a comment

mridulm commented Oct 16, 2017

HyukjinKwon commented Oct 16, 2017

mridulm commented Oct 16, 2017

mridulm commented Oct 16, 2017

HyukjinKwon commented Oct 16, 2017

mridulm commented Oct 16, 2017

HyukjinKwon commented Oct 16, 2017 •

edited

Loading

HyukjinKwon commented Oct 16, 2017

mridulm commented Oct 17, 2017

HyukjinKwon commented Oct 17, 2017

steveloughran commented Oct 17, 2017

[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided #19497

[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided #19497

Conversation

mridulm commented Oct 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

mridulm commented Oct 14, 2017

SparkQA commented Oct 14, 2017

HyukjinKwon commented Oct 14, 2017

HyukjinKwon commented Oct 14, 2017

SparkQA commented Oct 14, 2017

HyukjinKwon commented Oct 15, 2017 • edited Loading

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

mridulm commented Oct 15, 2017

jiangxb1987 left a comment

Choose a reason for hiding this comment

mridulm commented Oct 16, 2017

HyukjinKwon commented Oct 16, 2017

mridulm commented Oct 16, 2017

mridulm commented Oct 16, 2017

HyukjinKwon commented Oct 16, 2017

mridulm commented Oct 16, 2017

HyukjinKwon commented Oct 16, 2017 • edited Loading

HyukjinKwon commented Oct 16, 2017

mridulm commented Oct 17, 2017

HyukjinKwon commented Oct 17, 2017

steveloughran commented Oct 17, 2017

HyukjinKwon commented Oct 15, 2017 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

HyukjinKwon commented Oct 16, 2017 •

edited

Loading