-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21549][CORE] Respect OutputFormats with no output directory provided #19294
Conversation
…ted to an absolute output location in case of custom output formats
…ted to an absolute location - reformatting imports
pairs.saveAsNewAPIHadoopDataset(jobConfiguration) | ||
} finally { | ||
// close to prevent filesystem caching across different tests | ||
fs.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid. Either use FileSystem.newInstance() or skip the close. Given you aren't playing with low-level FS options, its faster and more efficient to reuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was counting on indirect filesystem caching, so that it was exactly the same both in tests as well as in SparkHadoopWriter
and calling to newInstance
prevents us from such a possibility. Currently I've updated PR to not use filesystem at all.
pairs.saveAsHadoopDataset(conf) | ||
} finally { | ||
// close to prevent filesystem caching across different tests | ||
fs.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, you don't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated PR to not use filesystem at all.
…s with absolute names to rename in addedAbsPathFiles
ok to test |
cc @jiangxb1987 who I believe is interested in this. Without a super close look, it looks making sense. The actual problem here looks indeed
spark/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala Line 39 in e47f48c
spark/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala Line 58 in e47f48c
|
Test build #82015 has finished for PR 19294 at commit
|
val fs = absPathStagingDir.getFileSystem(jobContext.getConfiguration) | ||
for ((src, dst) <- filesToMove) { | ||
fs.rename(new Path(src), new Path(dst)) | ||
if (addedAbsPathFiles != null && addedAbsPathFiles.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider using a common method instead of duplicating the code in the 2 if statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced method
/**
* Checks whether there are files to be committed to an absolute output location.
*/
private def hasAbsPathFiles: Boolean = addedAbsPathFiles != null && addedAbsPathFiles.nonEmpty
…les to be committed to an absolute output location
Test build #82020 has finished for PR 19294 at commit
|
} | ||
fs.delete(absPathStagingDir, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the changes being made here, it seems a good place to add the suggestion of SPARK-20045 & make that abort() call resilient to failures, by doing that delete even if the hadoop committer raised an IOE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to fix it in separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do, now you've got a little mock committer in someone can just extend it to optionally throw an IOE in abort().
/** | ||
* Checks whether there are files to be committed to an absolute output location. | ||
*/ | ||
private def hasAbsPathFiles: Boolean = addedAbsPathFiles != null && addedAbsPathFiles.nonEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When addedAbsPathFiles is null and when it is not is slightly confusing.
Can we move to using an Option[Map[String, String]] instead ?
In the earlier code, it must always be non-null; but now it becomes optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thank you!
According to the FileCommitProtocol
, addedAbsPathFiles
is always null on driver, so we will not be able to commit or remove these files.
Replaced it with
private def hasAbsPathFiles: Boolean = path != null
IMO it should be fine to not provide output directory if you are not using absolute output paths, I also don't think we should always create absolute output paths in |
+CC @weiqingy @szhem incorporating a test for the sql part will also help in this matter. [1] SHC will need the workaround even if this issue is resolved (since 2.2 has been released with this bug). |
As I play with commit logic all the way through the stack, I can' t help thinking everyone's lives would be better if we tagged the MRv1 commit APIs as deprecated in Hadoop 3. and uses of the commit protocols went fully onto the v2 committers: one codepath to get confused by, half as much complexity. The issue with the custom stuff is inevitably Hive related, isn't it? It's always liked to scatter data around a filesystem and pretend its a single dataset |
What should be the expected behaviour in case of sql?
|
@szhem You are correct, currently it fails in the driver itself. With this pr, the job submission should succeed - but the subsequent execution in sql could fail (since sql uses some of the methods which have not been patched in this PR if I am not wrong - newTaskTempFileAbsPath, newTaskTempFile, etc). A testcase to validate successful writes from a datasource in spark sql would clarify things. |
… an absolute output location by means of checking whether the output path specified
…ted by the changes
Test build #82131 has finished for PR 19294 at commit
|
@mridulm Updated |
…er.write in tests
…ain hadoop conf, etc.
Test build #82130 has finished for PR 19294 at commit
|
Test build #82133 has finished for PR 19294 at commit
|
If In our Spark SQL code path, how can |
@gatorsmile I believe that in Spark SQL code path The interesting part is that the Hadoop's FileOutputCommitter allows null output paths and the line you highlighted is executed only in case of So there may be a chance that someone would like to use custom implementation of |
Test build #82411 has finished for PR 19294 at commit
|
retest this please |
@szhem that null path support in I don't think that's a codepath spark goes near; in the normal execution paths, (disclaimer: the more I read of that code, the less I understand it. do not treat my opinions as normative in any way) |
Test build #82412 has finished for PR 19294 at commit
|
@gatorsmile have your concerns been addressed ? If yes, I will merge this into master and 2.2.1 This patch is clearly better than existing state for 2.2 and master - for spark core and some of the data sources I tested with. |
Since this is not related to Spark SQL, please do not add the test cases to the Spark SQL side. |
@gatorsmile Sounds good, @szhem can we remove the spark sql tests you added (due to my request). |
…cted by the patch
Test build #82504 has finished for PR 19294 at commit
|
@mridulm sql-related tests were removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/** | ||
* Checks whether there are files to be committed to an absolute output location. | ||
* | ||
* As the committing and aborting the job occurs on driver where `addedAbsPathFiles` is always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the committing
-> As committing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* | ||
* As the committing and aborting the job occurs on driver where `addedAbsPathFiles` is always | ||
* null, it is necessary to check whether the output path is specified, that may not be the case | ||
* for committers not writing to distributed file systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also has a grammar issue. It is not clear too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about :
"As committing and aborting a job occurs on driver, where addedAbsPathFiles
is always null, it is necessary to check whether the output path is specified. Output path may not be required for committers not writing to distributed file systems"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, guys! I've just updated the comment
Test build #82529 has finished for PR 19294 at commit
|
…ovided ## What changes were proposed in this pull request? Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue. Since version 2.2 Spark does not respect OutputFormat with no output paths provided. The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure. Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems. This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`. ## How was this patch tested? Unit tests Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Closes #19294 from szhem/SPARK-21549-abs-output-commits. (cherry picked from commit 2030f19) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
Thanks for the fix @szhem, great work ! |
…ctory provided ## What changes were proposed in this pull request? PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR #19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes #19497 from mridulm/master. (cherry picked from commit 13c1559) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
…ctory provided ## What changes were proposed in this pull request? PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR #19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes #19497 from mridulm/master.
…ovided ## What changes were proposed in this pull request? Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue. Since version 2.2 Spark does not respect OutputFormat with no output paths provided. The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure. Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems. This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`. ## How was this patch tested? Unit tests Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Closes apache#19294 from szhem/SPARK-21549-abs-output-commits. (cherry picked from commit 2030f19) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
…ctory provided ## What changes were proposed in this pull request? PR apache#19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid. Namely: * empty string * URI parse exception while creating Path This is resubmission of PR apache#19487, which I messed up while updating my repo. ## How was this patch tested? Enhanced test to cover new support added. Author: Mridul Muralidharan <mridul@gmail.com> Closes apache#19497 from mridulm/master. (cherry picked from commit 13c1559) Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
What changes were proposed in this pull request?
Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue.
Since version 2.2 Spark does not respect OutputFormat with no output paths provided.
The examples of such formats are Cassandra OutputFormat, Aerospike OutputFormat, etc. which do not have an ability to rollback the results written to an external systems on job failure.
Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems.
This pull request prevents accessing
absPathStagingDir
method that causes the error described in SPARK-21549 unless there are files to rename inaddedAbsPathFiles
.How was this patch tested?
Unit tests