[SPARK-8580] [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array #8341

liancheng · 2015-08-20T15:59:37Z

This PR was originally for SPARK-8580, but I caught SPARK-10136 while adding more test cases to ParquetAvroCompatibilitySuite. Actual bug fix code lies in CatalystRowConverter.scala.

liancheng · 2015-08-20T16:10:01Z

Please note that the .avpr file and all Java files under gen-java are auto-generated.

liancheng · 2015-08-20T16:15:45Z

@rxin This issue is a 1.5 regression, so I marked it as blocker. Despite of all those newly added test cases and auto-generated files, the actual fix is quite small. And here is the test case dedicated for this issue.

marmbrus · 2015-08-20T17:53:57Z

.../gen-java/org/apache/spark/sql/execution/datasources/parquet/test/avro/AvroArrayOfArray.java

@@ -0,0 +1,142 @@
+/**
+ * Autogenerated by Avro


We can merge this so its included in 1.5, but why are we checking in auto generated code? Can't we build it on demand.

When adding the first set of Parquet Avro/Thrift/Hive compatibility suites, I considered building these files on demand, but that complicates the build process. Especially, it requires developers and Jenkins to install Thrift binary executable for code generation (or is there any way to avoid this?). Besides that, it's super tricky to compile the version of the Thrift binary executable Parquet uses under Mac because Mac lacks some C++ header files (I always compile parquet-thrift tests under a Ubuntu VM). That's why I decided to include the generated Java files. In this way, only developers who update these test suites need to install these extra dependencies.

rxin · 2015-08-20T17:55:11Z

This is not for SPARK-7231 is it? SPARK-7231 is a SparkR ticket.

… array I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`. Author: Cheng Lian <lian@databricks.com> Closes #8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array. (cherry picked from commit 85f9a61) Signed-off-by: Michael Armbrust <michael@databricks.com>

SparkQA · 2015-08-20T18:27:51Z

Test build #41323 has finished for PR 8341 at commit 34547d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-21T01:05:36Z

@rxin Yeah, sorry, it's for SPARK-8580, and it tests changes made majorly in PR #7231. I updated the PR title and description. Thanks for updating the commit message while merging this!

PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <lian@databricks.com> Closes #8361 from liancheng/spark-10136/proper-version. (cherry picked from commit bf03fe6) Signed-off-by: Cheng Lian <lian@databricks.com>

PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <lian@databricks.com> Closes #8361 from liancheng/spark-10136/proper-version.

Fixes SPARK-10136 and adds more tests

34547d6

marmbrus reviewed Aug 20, 2015
View reviewed changes

asfgit closed this in 85f9a61 Aug 20, 2015

liancheng deleted the spark-10136/parquet-avro-nested-primitive-array branch August 20, 2015 23:52

liancheng changed the title ~~[SPARK-7231] [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array~~ [SPARK-8580] [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array Aug 21, 2015

liancheng mentioned this pull request Aug 21, 2015

[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8580] [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array #8341

[SPARK-8580] [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array #8341

liancheng commented Aug 20, 2015

liancheng commented Aug 20, 2015

liancheng commented Aug 20, 2015

marmbrus Aug 20, 2015

liancheng Aug 21, 2015

rxin commented Aug 20, 2015

SparkQA commented Aug 20, 2015

liancheng commented Aug 21, 2015

[SPARK-8580] [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array #8341

[SPARK-8580] [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array #8341

Conversation

liancheng commented Aug 20, 2015

liancheng commented Aug 20, 2015

liancheng commented Aug 20, 2015

marmbrus Aug 20, 2015

Choose a reason for hiding this comment

liancheng Aug 21, 2015

Choose a reason for hiding this comment

rxin commented Aug 20, 2015

SparkQA commented Aug 20, 2015

liancheng commented Aug 21, 2015