New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. #9517
Conversation
cc @liancheng |
ok to test |
add to whitelist |
Test build #45235 has started for PR 9517 at commit |
retest this please |
Test build #45324 has finished for PR 9517 at commit
|
val pathTwo = s"${dir.getCanonicalPath}/table2" | ||
Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo) | ||
val pathThree = s"${dir.getCanonicalPath}/table3" | ||
Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use a partitioned table here. Directories like base/table1
, base/table2
, and base/table3
are not valid partition directory names, and loading base
as a Parquet file should throw an exception. It's not expected that this test case can pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for commands!
…t and rename variable.
In this commit, I added partitioned tables for the test and sorted the There are several things to mention here. Firstly, now we do not need to change Secondly, in any cases, the columns of the lexicographically first file head first, which might be a matter for files containing numeric values. However, I left this as I thought anyway it is deterministic. |
Test build #45359 has finished for PR 9517 at commit
|
I used |
Test build #45489 has finished for PR 9517 at commit
|
LGTM, merged to master. Thanks! |
…ing schemas. https://issues.apache.org/jira/browse/SPARK-11500 As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns. This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order. Also, I changed `reduceOption` to `reduceLeftOption`, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist. One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files. However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas. In addition, in the test code for this, I only checked the names of fields. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9517 from HyukjinKwon/SPARK-11500. (cherry picked from commit 1bc4112) Signed-off-by: Cheng Lian <lian@databricks.com>
Also backported to branch-1.6. |
https://issues.apache.org/jira/browse/SPARK-11500
As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns.
This was mostly because of the use of
Set
andMap
so I replaced them toLinkedHashSet
andLinkedHashMap
to keep the insertion order.Also, I changed
reduceOption
toreduceLeftOption
, and replaced the order offilesToTouch
frommetadataStatuses ++ commonMetadataStatuses ++ needMerged
toneedMerged ++ metadataStatuses ++ commonMetadataStatuses
in order to touch the part-files first which always have the schema in footers whereas the others might not exist.One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files.
However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas.
In addition, in the test code for this, I only checked the names of fields.