[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361

liancheng · 2015-08-21T17:51:49Z

PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of LIST backwards-compatibility rules. Let me have a try to give an explanation here.

The structure of the problematic Parquet schema generated by parquet-avro is something like this:

message m {
  <repetition> group f (LIST) {         // Level 1
    repeated group array (LIST) {       // Level 2
      repeated <primitive-type> array;  // Level 3
    }
  }
}

(The schema generated by parquet-thrift is structurally similar, just replace the array at level 2 with f_tuple, and the other one at level 3 with f_tuple_tuple.)

This structure consists of two nested legacy 2-level LIST-like structures:

The repeated group type at level 2 is the element type of the outer array defined at level 1

This group should map to an CatalystArrayConverter.ElementConverter when building converters.
The repeated primitive type at level 3 is the element type of the inner array defined at level 2

This group should also map to an CatalystArrayConverter.ElementConverter.

The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a RepeatedPrimitiveConverter instead of a CatalystArrayConverter.ElementConverter is created for it.

According to parquet-format spec, unannotated repeated type shouldn't appear in a LIST- or MAP-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in LIST-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.)

As for the reason why level 2 isn't recognized as a list element type, it's because of the following LIST backwards-compatibility rule defined in the parquet-format spec:

If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.

(The array part is for parquet-avro compatibility, while the _tuple part is for parquet-thrift.)

This rule is implemented in CatalystSchemaConverter.isElementType, but neglected in CatalystRowConverter.isElementType. This PR delivers a more robust fix by adding this rule in the latter method.

Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at PARQUET-364.

liancheng · 2015-08-21T17:54:03Z

Although #8341 is hacky, it's still a valid fix. I won't list this PR as a blocker for 1.5 at this time since RC1 is already cut. But I'd like to have this merged into branch-1.5 if there's RC2.

liancheng · 2015-08-21T18:06:14Z

(No more test cases added in this PR because those added in #8341 already cover this issue.)

SparkQA · 2015-08-21T20:21:46Z

Test build #41376 has finished for PR 8361 at commit 15cd105.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-22T15:49:43Z

Test build #41407 has finished for PR 8361 at commit 74802cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <lian@databricks.com> Closes #8361 from liancheng/spark-10136/proper-version. (cherry picked from commit bf03fe6) Signed-off-by: Cheng Lian <lian@databricks.com>

liancheng · 2015-08-25T06:59:56Z

Merged to master and branch-1.5.

Fixes SPARK-10136 in a more proper way

15cd105

Minor refactoring for readability

74802cd

asfgit closed this in bf03fe6 Aug 25, 2015

liancheng deleted the spark-10136/proper-version branch August 25, 2015 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361

[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361

liancheng commented Aug 21, 2015

liancheng commented Aug 21, 2015

liancheng commented Aug 21, 2015

SparkQA commented Aug 21, 2015

SparkQA commented Aug 22, 2015

liancheng commented Aug 25, 2015

[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361

[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361

Conversation

liancheng commented Aug 21, 2015

liancheng commented Aug 21, 2015

liancheng commented Aug 21, 2015

SparkQA commented Aug 21, 2015

SparkQA commented Aug 22, 2015

liancheng commented Aug 25, 2015