Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361

Closed
wants to merge 2 commits into from

Conversation

liancheng
Copy link
Contributor

PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of LIST backwards-compatibility rules. Let me have a try to give an explanation here.

The structure of the problematic Parquet schema generated by parquet-avro is something like this:

message m {
  <repetition> group f (LIST) {         // Level 1
    repeated group array (LIST) {       // Level 2
      repeated <primitive-type> array;  // Level 3
    }
  }
}

(The schema generated by parquet-thrift is structurally similar, just replace the array at level 2 with f_tuple, and the other one at level 3 with f_tuple_tuple.)

This structure consists of two nested legacy 2-level LIST-like structures:

  1. The repeated group type at level 2 is the element type of the outer array defined at level 1

    This group should map to an CatalystArrayConverter.ElementConverter when building converters.

  2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2

    This group should also map to an CatalystArrayConverter.ElementConverter.

The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a RepeatedPrimitiveConverter instead of a CatalystArrayConverter.ElementConverter is created for it.

According to parquet-format spec, unannotated repeated type shouldn't appear in a LIST- or MAP-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in LIST-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.)

As for the reason why level 2 isn't recognized as a list element type, it's because of the following LIST backwards-compatibility rule defined in the parquet-format spec:

If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.

(The array part is for parquet-avro compatibility, while the _tuple part is for parquet-thrift.)

This rule is implemented in CatalystSchemaConverter.isElementType, but neglected in CatalystRowConverter.isElementType. This PR delivers a more robust fix by adding this rule in the latter method.

Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at PARQUET-364.

@liancheng
Copy link
Contributor Author

Although #8341 is hacky, it's still a valid fix. I won't list this PR as a blocker for 1.5 at this time since RC1 is already cut. But I'd like to have this merged into branch-1.5 if there's RC2.

@liancheng
Copy link
Contributor Author

(No more test cases added in this PR because those added in #8341 already cover this issue.)

@SparkQA
Copy link

SparkQA commented Aug 21, 2015

Test build #41376 has finished for PR 8361 at commit 15cd105.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2015

Test build #41407 has finished for PR 8361 at commit 74802cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Aug 25, 2015
PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause.  The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules.  Let me have a try to give an explanation here.

The structure of the problematic Parquet schema generated by parquet-avro is something like this:

```
message m {
  <repetition> group f (LIST) {         // Level 1
    repeated group array (LIST) {       // Level 2
      repeated <primitive-type> array;  // Level 3
    }
  }
}
```

(The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.)

This structure consists of two nested legacy 2-level `LIST`-like structures:

1. The repeated group type at level 2 is the element type of the outer array defined at level 1

   This group should map to an `CatalystArrayConverter.ElementConverter` when building converters.

2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2

   This group should also map to an `CatalystArrayConverter.ElementConverter`.

The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1.  Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it.

According to  parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group.  PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix.  (I didn't realize this when authoring #8341 though.)

As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec:

> If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.

(The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.)

This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2].  This PR delivers a more robust fix by adding this rule in the latter method.

Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3].

[1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305
[2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463
[3]: https://issues.apache.org/jira/browse/PARQUET-364

Author: Cheng Lian <lian@databricks.com>

Closes #8361 from liancheng/spark-10136/proper-version.

(cherry picked from commit bf03fe6)
Signed-off-by: Cheng Lian <lian@databricks.com>
@liancheng
Copy link
Contributor Author

Merged to master and branch-1.5.

@asfgit asfgit closed this in bf03fe6 Aug 25, 2015
@liancheng liancheng deleted the spark-10136/proper-version branch August 25, 2015 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants