[SPARK-21549][CORE] Respect OutputFormats with no output directory provided #19294

szhem · 2017-09-20T13:28:09Z

What changes were proposed in this pull request?

Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue.

Since version 2.2 Spark does not respect OutputFormat with no output paths provided.
The examples of such formats are Cassandra OutputFormat, Aerospike OutputFormat, etc. which do not have an ability to rollback the results written to an external systems on job failure.

Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems.

This pull request prevents accessing absPathStagingDir method that causes the error described in SPARK-21549 unless there are files to rename in addedAbsPathFiles.

How was this patch tested?

Unit tests

…ted to an absolute output location in case of custom output formats

…ted to an absolute location - reformatting imports

steveloughran · 2017-09-20T15:38:23Z

core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala

+      pairs.saveAsNewAPIHadoopDataset(jobConfiguration)
+    } finally {
+      // close to prevent filesystem caching across different tests
+      fs.close()


avoid. Either use FileSystem.newInstance() or skip the close. Given you aren't playing with low-level FS options, its faster and more efficient to reuse

I was counting on indirect filesystem caching, so that it was exactly the same both in tests as well as in SparkHadoopWriter and calling to newInstance prevents us from such a possibility. Currently I've updated PR to not use filesystem at all.

steveloughran · 2017-09-20T15:38:47Z

core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala

+      pairs.saveAsHadoopDataset(conf)
+    } finally {
+      // close to prevent filesystem caching across different tests
+      fs.close()


again, you don't need this.

I've updated PR to not use filesystem at all.

…s with absolute names to rename in addedAbsPathFiles

HyukjinKwon · 2017-09-21T01:50:09Z

ok to test

HyukjinKwon · 2017-09-21T01:51:45Z

cc @jiangxb1987 who I believe is interested in this. Without a super close look, it looks making sense.

The actual problem here looks indeed path being null (as described in the JIRA):

spark/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

Line 328 in 3ac6093

outputPath = getConf.get("mapreduce.output.fileoutputformat.outputdir")

spark/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

Line 146 in 3ac6093

ctor.newInstance(jobId, outputPath)

spark/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

Line 39 in e47f48c

class HadoopMapReduceCommitProtocol(jobId: String, path: String)

spark/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

Line 58 in e47f48c

private def absPathStagingDir: Path = new Path(path, "_temporary-" + jobId)

Can not create a Path from a null string
java.lang.IllegalArgumentException: Can not create a Path from a null string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
  at org.apache.hadoop.fs.Path.<init>(Path.java:135)
  at org.apache.hadoop.fs.Path.<init>(Path.java:89)
  at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
  at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
  at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)

SparkQA · 2017-09-21T04:16:15Z

Test build #82015 has finished for PR 19294 at commit 621c337.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bikassaha · 2017-09-21T05:29:15Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

-    val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration)
-    for ((src, dst) <- filesToMove) {
-      fs.rename(new Path(src), new Path(dst))
+    if (addedAbsPathFiles != null && addedAbsPathFiles.nonEmpty) {


Please consider using a common method instead of duplicating the code in the 2 if statements.

Introduced method

/** * Checks whether there are files to be committed to an absolute output location. */ private def hasAbsPathFiles: Boolean = addedAbsPathFiles != null && addedAbsPathFiles.nonEmpty

…les to be committed to an absolute output location

SparkQA · 2017-09-21T09:23:05Z

Test build #82020 has finished for PR 19294 at commit 34bb694.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2017-09-21T09:11:45Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

    }
-    fs.delete(absPathStagingDir, true)


Given the changes being made here, it seems a good place to add the suggestion of SPARK-20045 & make that abort() call resilient to failures, by doing that delete even if the hadoop committer raised an IOE

Wouldn't it be better to fix it in separate PR?

can do, now you've got a little mock committer in someone can just extend it to optionally throw an IOE in abort().

mridulm · 2017-09-21T19:51:49Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+  /**
+   * Checks whether there are files to be committed to an absolute output location.
+   */
+  private def hasAbsPathFiles: Boolean = addedAbsPathFiles != null && addedAbsPathFiles.nonEmpty


When addedAbsPathFiles is null and when it is not is slightly confusing.
Can we move to using an Option[Map[String, String]] instead ?

In the earlier code, it must always be non-null; but now it becomes optional

Good catch, thank you!
According to the FileCommitProtocol, addedAbsPathFiles is always null on driver, so we will not be able to commit or remove these files.

Replaced it with

private def hasAbsPathFiles: Boolean = path != null

mridulm · 2017-09-21T20:21:26Z

@szhem Did you try this patch with sql ?
A cursory look at org.apache.spark.sql.execution.datasources.FileFormatWriter and org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand indicates use of methods in this class which will continue to throw NPE ?

+CC @ericl

jiangxb1987 · 2017-09-21T20:41:19Z

IMO it should be fine to not provide output directory if you are not using absolute output paths, I also don't think we should always create absolute output paths in HadoopMapReduceCommitProtocol, cc @cloud-fan for more input.

mridulm · 2017-09-21T23:13:36Z

+CC @weiqingy
Can you try this PR with SHC and see if it works ?
That is, remove your current workaround for SPARK-21549 from SHC and try writing to hbase with a spark version patched with this PR [1]
That will allow us to have a real world test, and possibly surface issues.

@szhem incorporating a test for the sql part will also help in this matter.

[1] SHC will need the workaround even if this issue is resolved (since 2.2 has been released with this bug).

steveloughran · 2017-09-22T10:10:45Z

As I play with commit logic all the way through the stack, I can' t help thinking everyone's lives would be better if we tagged the MRv1 commit APIs as deprecated in Hadoop 3. and uses of the commit protocols went fully onto the v2 committers: one codepath to get confused by, half as much complexity.

The issue with the custom stuff is inevitably Hive related, isn't it? It's always liked to scatter data around a filesystem and pretend its a single dataset

szhem · 2017-09-23T21:17:06Z

@mridulm

incorporating a test for the sql part will also help in this matter.

What should be the expected behaviour in case of sql?
I'm asking because the sql part seems to fail even before setupJob on the committer is called.

FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))

mridulm · 2017-09-24T08:11:35Z

@szhem You are correct, currently it fails in the driver itself.
So failures in executor are not seen - since job submission fails.

With this pr, the job submission should succeed - but the subsequent execution in sql could fail (since sql uses some of the methods which have not been patched in this PR if I am not wrong - newTaskTempFileAbsPath, newTaskTempFile, etc).

A testcase to validate successful writes from a datasource in spark sql would clarify things.

… an absolute output location by means of checking whether the output path specified

…ted by the changes

SparkQA · 2017-09-24T13:09:03Z

Test build #82131 has finished for PR 19294 at commit 3429de5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

szhem · 2017-09-24T13:16:02Z

@mridulm Updated FileFormatWriterSuite to cover both branches of the committer calling - for newTaskTempFile as well as for newTaskTempFileAbsPath.

…er.write in tests

…the tests

…ain hadoop conf, etc.

SparkQA · 2017-09-24T14:52:59Z

Test build #82130 has finished for PR 19294 at commit ae0ba0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-24T16:44:41Z

Test build #82133 has finished for PR 19294 at commit 7963b58.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-02T23:00:44Z

If path could be null, this line will still fail with the error message like Can not create a Path from a null string.

In our Spark SQL code path, how can path be null?

szhem · 2017-10-03T05:49:26Z

@gatorsmile I believe that in Spark SQL code path path cannot be null, because in that case FileFormatWriter fails even before setupJob (which in its order calls setupCommitter) on the committer is called.

The interesting part is that the Hadoop's FileOutputCommitter allows null output paths and the line you highlighted is executed only in case of FileOutputCommitter implementations.

So there may be a chance that someone would like to use custom implementation of FileOutputCommitter, which allows nulls according to Hadoop docs, with Spark SQL.
I believe it is not an issue at all, because Spark SQL does not allow nulls.

SparkQA · 2017-10-03T07:04:43Z

Test build #82411 has finished for PR 19294 at commit ff7b084.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-03T07:09:50Z

retest this please

steveloughran · 2017-10-03T09:42:00Z

@szhem that null path support in FileOutputCommitter came with the App Master recovery work of MAPREDUCE-3711; its, trying to minimise the amount of HDFS IO done during the recovery process.

I don't think that's a codepath spark goes near; in the normal execution paths, FileOutputFormat & FileOutputCommitter will need output paths.

(disclaimer: the more I read of that code, the less I understand it. do not treat my opinions as normative in any way)

SparkQA · 2017-10-03T10:24:51Z

Test build #82412 has finished for PR 19294 at commit ff7b084.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-10-06T04:36:30Z

@gatorsmile have your concerns been addressed ? If yes, I will merge this into master and 2.2.1

This patch is clearly better than existing state for 2.2 and master - for spark core and some of the data sources I tested with.

gatorsmile · 2017-10-06T04:56:25Z

Since this is not related to Spark SQL, please do not add the test cases to the Spark SQL side.

mridulm · 2017-10-06T05:40:49Z

@gatorsmile Sounds good, @szhem can we remove the spark sql tests you added (due to my request).
Once build passes, I will commit this - it will definitely help spark core users.

…cted by the patch

SparkQA · 2017-10-06T12:35:46Z

Test build #82504 has finished for PR 19294 at commit e41abc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

szhem · 2017-10-06T12:39:17Z

@mridulm sql-related tests were removed.

jiangxb1987

LGTM

gatorsmile · 2017-10-06T16:15:56Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+  /**
+   * Checks whether there are files to be committed to an absolute output location.
+   *
+   * As the committing and aborting the job occurs on driver where `addedAbsPathFiles` is always


As the committing -> As committing

gatorsmile · 2017-10-06T16:17:18Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+   *
+   * As the committing and aborting the job occurs on driver where `addedAbsPathFiles` is always
+   * null, it is necessary to check whether the output path is specified, that may not be the case
+   * for committers not writing to distributed file systems.


This also has a grammar issue. It is not clear too

How about :
"As committing and aborting a job occurs on driver, where addedAbsPathFiles is always null, it is necessary to check whether the output path is specified. Output path may not be required for committers not writing to distributed file systems"

This is much better.

Thanks a lot, guys! I've just updated the comment

… clear

SparkQA · 2017-10-07T01:32:39Z

Test build #82529 has finished for PR 19294 at commit f55b7c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ovided ## What changes were proposed in this pull request? Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue. Since version 2.2 Spark does not respect OutputFormat with no output paths provided. The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure. Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems. This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`. ## How was this patch tested? Unit tests Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Closes #19294 from szhem/SPARK-21549-abs-output-commits. (cherry picked from commit 2030f19) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>

mridulm · 2017-10-07T03:47:26Z

Thanks for the fix @szhem, great work !
Merged to master and 2.2.1

…ctory provided ## What changes were proposed in this pull request? PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR #19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes #19497 from mridulm/master. (cherry picked from commit 13c1559) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>

…ctory provided ## What changes were proposed in this pull request? PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR #19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes #19497 from mridulm/master.

…ovided ## What changes were proposed in this pull request? Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue. Since version 2.2 Spark does not respect OutputFormat with no output paths provided. The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure. Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems. This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`. ## How was this patch tested? Unit tests Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Closes apache#19294 from szhem/SPARK-21549-abs-output-commits. (cherry picked from commit 2030f19) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>

…ctory provided ## What changes were proposed in this pull request? PR apache#19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR apache#19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes apache#19497 from mridulm/master. (cherry picked from commit 13c1559) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>

szhem added 2 commits September 20, 2017 16:07

[SPARK-21549][CORE] Respect empty output paths for files to be commit…

b993448

…ted to an absolute output location in case of custom output formats

[SPARK-21549][CORE] Respect empty output paths for files to be commit…

5c1474a

…ted to an absolute location - reformatting imports

steveloughran reviewed Sep 20, 2017

View reviewed changes

[SPARK-21549][CORE] dont use absPathStagingDir when there are no file…

621c337

…s with absolute names to rename in addedAbsPathFiles

bikassaha reviewed Sep 21, 2017

View reviewed changes

[SPARK-21549][CORE] introducing method to verify whether there are fi…

34bb694

…les to be committed to an absolute output location

steveloughran reviewed Sep 21, 2017

View reviewed changes

mridulm reviewed Sep 21, 2017

View reviewed changes

szhem added 2 commits September 24, 2017 14:40

[SPARK-21549][CORE] verifying that there are files to be committed to…

ae0ba0a

… an absolute output location by means of checking whether the output path specified

[SPARK-21549][CORE] verifying that sql part of the spark is not affec…

3429de5

…ted by the changes

szhem added 5 commits September 24, 2017 17:25

[SPARK-21549][CORE] reordering imports to respect scalastyle checks

7963b58

[SPARK-21549][CORE] using named arguments when calling FileFormatWrit…

90fef74

…er.write in tests

[SPARK-21549][CORE] using Option.toSeq instead of Seq(Option.get) in …

1e42e83

…the tests

[SPARK-21549][CORE] more clear generation of customPartitionLocations

0cf4724

[SPARK-21549][CORE] using variables which are already in scope to obt…

7c58abb

…ain hadoop conf, etc.

[SPARK-21549][CORE] scala docs for constructor parameters

ff7b084

[SPARK-21549][CORE] reverting sql-related changes not previously affe…

e41abc6

…cted by the patch

HyukjinKwon approved these changes Oct 6, 2017

View reviewed changes

jiangxb1987 approved these changes Oct 6, 2017

View reviewed changes

gatorsmile reviewed Oct 6, 2017

View reviewed changes

[SPARK-21549][CORE] fixing grammar issues for the comments to be more…

f55b7c2

… clear

asfgit closed this in 2030f19 Oct 7, 2017

This was referenced Oct 13, 2017

[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided #19487

Closed

[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided #19497

Closed

[SPARK-21549][CORE] Respect OutputFormats with no output directory provided #19294

[SPARK-21549][CORE] Respect OutputFormats with no output directory provided #19294

Conversation

szhem commented Sep 20, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

szhem Sep 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szhem Sep 20, 2017 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Sep 21, 2017

HyukjinKwon commented Sep 21, 2017 • edited Loading

SparkQA commented Sep 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Sep 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm commented Sep 21, 2017

jiangxb1987 commented Sep 21, 2017

mridulm commented Sep 21, 2017

steveloughran commented Sep 22, 2017

szhem commented Sep 23, 2017 • edited Loading

mridulm commented Sep 24, 2017 • edited Loading

SparkQA commented Sep 24, 2017

szhem commented Sep 24, 2017

SparkQA commented Sep 24, 2017

SparkQA commented Sep 24, 2017

gatorsmile commented Oct 2, 2017

szhem commented Oct 3, 2017 • edited Loading

SparkQA commented Oct 3, 2017

HyukjinKwon commented Oct 3, 2017

steveloughran commented Oct 3, 2017

SparkQA commented Oct 3, 2017

mridulm commented Oct 6, 2017

gatorsmile commented Oct 6, 2017

mridulm commented Oct 6, 2017

SparkQA commented Oct 6, 2017

szhem commented Oct 6, 2017

jiangxb1987 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Oct 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 7, 2017

mridulm commented Oct 7, 2017

szhem commented Sep 20, 2017 •

edited

Loading

szhem Sep 20, 2017 •

edited

Loading

szhem Sep 20, 2017 •

edited

Loading

HyukjinKwon commented Sep 21, 2017 •

edited

Loading

mridulm Sep 21, 2017 •

edited

Loading

szhem commented Sep 23, 2017 •

edited

Loading

mridulm commented Sep 24, 2017 •

edited

Loading

szhem commented Oct 3, 2017 •

edited

Loading

mridulm Oct 6, 2017 •

edited

Loading