[SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. #9517

HyukjinKwon · 2015-11-06T07:51:46Z

https://issues.apache.org/jira/browse/SPARK-11500

As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns.

This was mostly because of the use of Set and Map so I replaced them to LinkedHashSet and LinkedHashMap to keep the insertion order.

Also, I changed reduceOption to reduceLeftOption, and replaced the order of filesToTouch from metadataStatuses ++ commonMetadataStatuses ++ needMerged to needMerged ++ metadataStatuses ++ commonMetadataStatuses in order to touch the part-files first which always have the schema in footers whereas the others might not exist.

One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files.

However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas.

In addition, in the test code for this, I only checked the names of fields.

…ing schemas.

HyukjinKwon · 2015-11-06T07:52:08Z

cc @liancheng

cloud-fan · 2015-11-06T18:47:07Z

ok to test

yhuai · 2015-11-06T19:00:04Z

add to whitelist

SparkQA · 2015-11-06T19:03:24Z

Test build #45235 has started for PR 9517 at commit bcf72d3.

HyukjinKwon · 2015-11-09T00:22:49Z

retest this please

SparkQA · 2015-11-09T02:53:04Z

Test build #45324 has finished for PR 9517 at commit bcf72d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-11-09T06:26:54Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala

+        val pathTwo = s"${dir.getCanonicalPath}/table2"
+        Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo)
+        val pathThree = s"${dir.getCanonicalPath}/table3"
+        Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree)


We should probably use a partitioned table here. Directories like base/table1, base/table2, and base/table3 are not valid partition directory names, and loading base as a Parquet file should throw an exception. It's not expected that this test case can pass.

Thanks for commands!

…t and rename variable.

HyukjinKwon · 2015-11-09T10:14:49Z

In this commit, I added partitioned tables for the test and sorted the FileStatuses.

There are several things to mention here.

Firstly, now we do not need to change Set to LinkedHashSet and Map to LinkedHashMap for this issue since it manually sorts the FileStatuses. However, I left them as I though anyway the order of files better be in the order as they are retrieved. If that looks weird, I would like to get it back.

Secondly, in any cases, the columns of the lexicographically first file head first, which might be a matter for files containing numeric values. However, I left this as I thought anyway it is deterministic.

SparkQA · 2015-11-09T12:41:29Z

Test build #45359 has finished for PR 9517 at commit 4f47063.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2015-11-10T04:15:16Z

I used sortBy instead of sortWith

SparkQA · 2015-11-10T06:16:26Z

Test build #45489 has finished for PR 9517 at commit 32dfb87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-11-11T08:47:04Z

LGTM, merged to master. Thanks!

…ing schemas. https://issues.apache.org/jira/browse/SPARK-11500 As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns. This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order. Also, I changed `reduceOption` to `reduceLeftOption`, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist. One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files. However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas. In addition, in the test code for this, I only checked the names of fields. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9517 from HyukjinKwon/SPARK-11500. (cherry picked from commit 1bc4112) Signed-off-by: Cheng Lian <lian@databricks.com>

liancheng · 2015-11-11T08:51:54Z

Also backported to branch-1.6.

HyukjinKwon added 3 commits November 6, 2015 16:38

[SPARK-11500][SQL] Not deterministic order of columns when using merg…

b0e6ce2

…ing schemas.

[SPARK-11500][SQL] Add a test to check the deterministic order.

08fc91c

[SPARK-11500][SQL] Remove trailing newline.

bcf72d3

liancheng reviewed Nov 9, 2015
View reviewed changes

[SPARK-11500][SQL] Sort file statuses, partitioned tables for the tes…

4f47063

…t and rename variable.

[SPARK-11500][SQL] Use sortBy instead of sortWith.

32dfb87

asfgit closed this in 1bc4112 Nov 11, 2015

HyukjinKwon deleted the SPARK-11500 branch September 23, 2016 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. #9517

[SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. #9517

HyukjinKwon commented Nov 6, 2015

HyukjinKwon commented Nov 6, 2015

cloud-fan commented Nov 6, 2015

yhuai commented Nov 6, 2015

SparkQA commented Nov 6, 2015

HyukjinKwon commented Nov 9, 2015

SparkQA commented Nov 9, 2015

liancheng Nov 9, 2015

HyukjinKwon Nov 9, 2015

HyukjinKwon commented Nov 9, 2015

SparkQA commented Nov 9, 2015

HyukjinKwon commented Nov 10, 2015

SparkQA commented Nov 10, 2015

liancheng commented Nov 11, 2015

liancheng commented Nov 11, 2015

[SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. #9517

[SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. #9517

Conversation

HyukjinKwon commented Nov 6, 2015

HyukjinKwon commented Nov 6, 2015

cloud-fan commented Nov 6, 2015

yhuai commented Nov 6, 2015

SparkQA commented Nov 6, 2015

HyukjinKwon commented Nov 9, 2015

SparkQA commented Nov 9, 2015

liancheng Nov 9, 2015

Choose a reason for hiding this comment

HyukjinKwon Nov 9, 2015

Choose a reason for hiding this comment

HyukjinKwon commented Nov 9, 2015

SparkQA commented Nov 9, 2015

HyukjinKwon commented Nov 10, 2015

SparkQA commented Nov 10, 2015

liancheng commented Nov 11, 2015

liancheng commented Nov 11, 2015