Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8811][SQL] Read array struct data from parquet error #7209

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -446,7 +446,7 @@ private[parquet] class CatalystSchemaConverter(
field.name,
Types
.buildGroup(REPEATED)
.addField(convertField(StructField("element", elementType, nullable)))
.addField(convertField(StructField("array", elementType, nullable)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I made a mistake here. We should leave this array as array_element.

This is a little bit complicated... So in the early days, when Spark SQL Parquet support was firstly authored, Parquet format spec wasn't clear about how to write arrays and maps. So Spark SQL took a somewhat weird approach here: if the array may contain nulls, we mimic parquet-hive, which writes a 3-level structure with array_element as the 2nd level type name; otherwise, we mimic parquet-avro, which writes a 2-level structure with array as the 2nd level type name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, PR #7231 already covers the original bug this PR tried to fix. We'll be able to read Hive data with legacy format. The field names changed here matter for the write path, because we want to write exactly the same format as older Spark SQL versions when compatible mode is turned on.

.named(CatalystConverter.ARRAY_CONTAINS_NULL_BAG_SCHEMA_NAME))

// Spark 1.4.x and prior versions convert ArrayType with non-nullable elements into a 2-level
Expand All @@ -459,7 +459,7 @@ private[parquet] class CatalystSchemaConverter(
ConversionPatterns.listType(
repetition,
field.name,
convertField(StructField("element", elementType, nullable), REPEATED))
convertField(StructField("array", elementType, nullable), REPEATED))

// Spark 1.4.x and prior versions convert MapType into a 3-level group annotated by
// MAP_KEY_VALUE. This is covered by `convertGroupField(field: GroupType): DataType`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
"""
|message root {
| optional group _1 (LIST) {
| repeated int32 element;
| repeated int32 array;
| }
|}
""".stripMargin)
Expand All @@ -198,7 +198,7 @@ class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
|message root {
| optional group _1 (LIST) {
| repeated group bag {
| optional int32 element;
| optional int32 array;
| }
| }
|}
Expand Down Expand Up @@ -267,7 +267,7 @@ class ParquetSchemaInferenceSuite extends ParquetSchemaTest {
| optional binary _1 (UTF8);
| optional group _2 (LIST) {
| repeated group bag {
| optional group element {
| optional group array {
| required int32 _1;
| required double _2;
| }
Expand Down Expand Up @@ -467,7 +467,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
"""message root {
| optional group f1 (LIST) {
| repeated group list {
| optional int32 element;
| optional int32 array;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes made are limited to Catalyst to Parquet schema conversion, so there's no need to modify this Parquet to Catalyst test case.

| }
| }
|}
Expand All @@ -478,11 +478,12 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
StructType(Seq(
StructField(
"f1",
ArrayType(IntegerType, containsNull = true),
ArrayType(
StructType(Seq(StructField("num", IntegerType))), containsNull = false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this change is supposed to be wrong. The Parquet schema below just represents an integer array. Don't know why it passed tests... Trying to figure this out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured out why the test passes. It's because array is considered as a special name by Parquet LIST backwards-compatibility rule 3 (this is specific to parquet-avro). Whenever we see a 2nd level group type named array, Parquet format spec see it as the element type. That's why it's inferred as an array of a struct with a single int field, rather than simply an int array here.

nullable = true))),
"""message root {
| optional group f1 (LIST) {
| repeated group element {
| repeated group array {
| optional int32 num;
| }
| }
Expand All @@ -496,7 +497,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
"""message root {
| optional group f1 (LIST) {
| repeated group list {
| required int32 element;
| required int32 array;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be unnecessary either.

| }
| }
|}
Expand All @@ -505,10 +506,13 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
testParquetToCatalyst(
"Backwards-compatibility: LIST with non-nullable element type - 2",
StructType(Seq(
StructField("f1", ArrayType(IntegerType, containsNull = false), nullable = true))),
StructField("f1",
ArrayType(StructType(Seq(StructField("num", IntegerType, nullable = false))),
containsNull = false),
nullable = true))),
"""message root {
| optional group f1 (LIST) {
| repeated group element {
| repeated group array {
| required int32 num;
| }
| }
Expand All @@ -521,7 +525,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
StructField("f1", ArrayType(IntegerType, containsNull = false), nullable = true))),
"""message root {
| optional group f1 (LIST) {
| repeated int32 element;
| repeated int32 array;
| }
|}
""".stripMargin)
Expand All @@ -539,7 +543,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
nullable = true))),
"""message root {
| optional group f1 (LIST) {
| repeated group element {
| repeated group array {
| required binary str (UTF8);
| required int32 num;
| }
Expand Down Expand Up @@ -616,7 +620,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
"""message root {
| optional group f1 (LIST) {
| repeated group bag {
| optional int32 element;
| optional int32 array;
| }
| }
|}
Expand Down Expand Up @@ -648,7 +652,7 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
nullable = true))),
"""message root {
| optional group f1 (LIST) {
| repeated int32 element;
| repeated int32 array;
| }
|}
""".stripMargin)
Expand Down