[SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters #40221
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Follow on to SPARK-40034, Dynamic/absolute path support in PathOutputCommitters
Dynamic partitioning though the PathOutputCommitProtocol needs to add the target directories to the superclass's partition list else the partition delete doesn't
take place -the job extends the dataset, rather than replaces it.
Fix:
addPartition()
method subclasses can usegetPartitions
method to return an immutablecopy of the list for testing.
newTaskTempFileAbsPath(
) to return a path, irrespective of committer type.In dynamic mode, because the parent dir of an absolute path is deleted, there's a safety check to reject any requests for a file in a parent dir. This
is something which could be pulled up to HadoopMapReduceCommitProtocol -it needs the same check, if the risk is considered realistic.
The patch now downgrades from failing on dynamic partitioning if the committer doesn't declare it supports it to printing a warning. Why this? well, it
does work through the s3a committers, it's just
O(data)
. If someone does want to doINSERT OVERWRITE
then they can be allowed to, just warned aboutit. The outcome will be correct except in the case of: "if the driver fails partway through dir rename, only some of the files will be there"
Finally, it update the protocol spec in
HadoopMapReduceCommitProtocol
to cover the dynamic partition job commit in more detail.Why are the changes needed?
newFileAbsPath()
code is required of all committers, despite its near-total lack of use.Does this PR introduce any user-facing change?
Updates the cloud docs to say that dynamic partition overwrite does work everywhere, just may be really slow.
How was this patch tested?
New unit tests in
CommitterBindingSuite
; `New test suite
PathOutputPartitionedWriteSuite extends PartitionedWriteSuite
which runs thePartitionedWriteSuite
throughthe PathOutputCommitter and with, on hadoop 3.3.5+ the manifest committer chosen.
executed with