[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x).
The root cause is the semantics difference of
FileSystem.globStatus()
between different versions of Hadoop, as illustrated by the following test code:Target directory structure:
Hadoop 2.4.1 result:
Hadoop 1.0.4 result:
In #2226 and #2616, we call
FileOutputCommitter.commitJob()
at the end of the job, and the_SUCCESS
mark file is written. When working with lower Hadoop versions, due to theglobStatus()
semantics issue,_SUCCESS
is included as a separate partition data file byHive.loadDynamicPartitions()
, and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the_SUCCESS
marker to workaround this issue.Hive doesn't suffer this issue because
FileSinkOperator
doesn't callFileOutputCommitter.commitJob()
, instead, it callsUtilities.mvFileToFinalPath()
to cleanup the output directory and then loads it into Hive warehouse by withloadDynamicPartitions()
/loadPartition()
/loadTable()
. This approach is better because it handles failed job and speculative tasks properly. We should add this step toInsertIntoHiveTable
in another PR.