[SPARK-10005] [SQL] Fixes schema merging for nested structs #8228

liancheng · 2015-08-16T11:28:21Z

In case of schema merging, we only handled first level fields when converting Parquet groups to InternalRows. Nested struct fields are not properly handled.

For example, the schema of a Parquet file to be read can be:

message individual {
  required group f1 {
    optional binary f11 (utf8);
  }
}

while the global schema is:

message global {
  required group f1 {
    optional binary f11 (utf8);
    optional int32 f12;
  }
}

This PR fixes this issue by padding missing fields when creating actual converters.

liancheng · 2015-08-16T11:28:34Z

cc @yhuai

SparkQA · 2015-08-16T13:56:35Z

Test build #40991 has finished for PR 8228 at commit 63eb764.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-08-16T16:36:55Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala

+    val paddedParquetFields = {
+      val parquetFields = parquetType.getFields
+      val parquetFieldNames = parquetFields.map(_.getName).toSet
+      val missingFields = catalystType.filterNot(f => parquetFieldNames.contains(f.name))


Will the case-sensitivity of field names introduce any issue at here?

No. Currently reading Parquet metastore tables may introduce case sensitivity issue. But it's already reconciled within ParquetRelation before building the converters.

yhuai · 2015-08-16T17:17:23Z

LGTM. Merging to master and branch 1.5. Let's also take a look at cases when catalyst schema is a subset of parquet schema (it may happen when users provide schema of the dataset) and see if our read path has any issue.

In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled. For example, the schema of a Parquet file to be read can be: ``` message individual { required group f1 { optional binary f11 (utf8); } } ``` while the global schema is: ``` message global { required group f1 { optional binary f11 (utf8); optional int32 f12; } } ``` This PR fixes this issue by padding missing fields when creating actual converters. Author: Cheng Lian <lian@databricks.com> Closes #8228 from liancheng/spark-10005/nested-schema-merging. (cherry picked from commit ae2370e) Signed-off-by: Yin Huai <yhuai@databricks.com>

In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled. For example, the schema of a Parquet file to be read can be: ``` message individual { required group f1 { optional binary f11 (utf8); } } ``` while the global schema is: ``` message global { required group f1 { optional binary f11 (utf8); optional int32 f12; } } ``` This PR fixes this issue by padding missing fields when creating actual converters. Author: Cheng Lian <lian@databricks.com> Closes apache#8228 from liancheng/spark-10005/nested-schema-merging.

This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here. When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons: 1. Requested schema must conform to the real schema of the physical file to be read. This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231. 1. Support for schema merging. A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema ``` message root { required group f0 { required group f00 { required int32 f000; required binary f001 (UTF8); } } } ``` we may request for column paths defined in the following schema: ``` message root { required group f0 { required group f00 { required binary f001 (UTF8); required float f002; } } optional double f1; } ``` Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`. The good news is that Parquet handles non-existing column paths properly and always returns null for them. 1. The map from `StructType` to `MessageType` is a one-to-many map. This is the most unfortunate part. Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema: ``` message m0 { repeated int32 f; } ``` while parquet-avro generates another version: ``` message m1 { required group f (LIST) { repeated int32 array; } } ``` and parquet-thrift spills this: ``` message m1 { required group f (LIST) { repeated int32 f_tuple; } } ``` All of them can be mapped to the following _unique_ Catalyst schema: ``` StructType( StructField( "f", ArrayType(IntegerType, containsNull = false), nullable = false)) ``` This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`. In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005]. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way. For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`: For a leaf column path `c` in `cs`: - if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`; - otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`; - no other column paths should exist in `ps'`. Then comes the most tedious part: > Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`? Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are: 1. the standard structure of nested types, and 1. cases defined in all backwards-compatibility rules for `LIST` and `MAP`. The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively. The column path selection algorithm is implemented in `clipParquetGroupFields()`. With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by [this test case] [test-case]. [spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301 [spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005 [test-case]: liancheng@38644d8#diff-a9b98e28ce3ae30641829dffd1173be2R26 Author: Cheng Lian <lian@databricks.com> Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.

Fixes schema merging for nested structs

63eb764

yhuai reviewed Aug 16, 2015
View reviewed changes

asfgit closed this in ae2370e Aug 16, 2015

liancheng deleted the spark-10005/nested-schema-merging branch August 16, 2015 17:24

liancheng mentioned this pull request Aug 16, 2015

[SPARK-9899] [SQL] Fixes HadoopFsRelation speculative writes when used together with direct output committer #8191

Closed

liancheng mentioned this pull request Aug 28, 2015

[SPARK-10301] [SQL] Fixes schema merging for nested structs #8509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10005] [SQL] Fixes schema merging for nested structs #8228

[SPARK-10005] [SQL] Fixes schema merging for nested structs #8228

liancheng commented Aug 16, 2015

liancheng commented Aug 16, 2015

SparkQA commented Aug 16, 2015

yhuai Aug 16, 2015

liancheng Aug 16, 2015

yhuai commented Aug 16, 2015

[SPARK-10005] [SQL] Fixes schema merging for nested structs #8228

[SPARK-10005] [SQL] Fixes schema merging for nested structs #8228

Conversation

liancheng commented Aug 16, 2015

liancheng commented Aug 16, 2015

SparkQA commented Aug 16, 2015

yhuai Aug 16, 2015

Choose a reason for hiding this comment

liancheng Aug 16, 2015

Choose a reason for hiding this comment

yhuai commented Aug 16, 2015