[SPARK-8811][SQL] Read array struct data from parquet error #7209

Sephiroth-Lin · 2015-07-03T08:13:39Z

JIRA:https://issues.apache.org/jira/browse/SPARK-8811

we have a table: 
t1(c1 string, 
   c2 string, 
   arr_c1 array<struct<in_c1 string, in_c2 string>>, 
   arr_c2 array<struct<in_c1 string, in_c2 string>>
)

we save data in parquet.

for select * from t1, we know in parquet the fileSchema may be:
message hive_schema {
  optional binary c1;
  optional binary c2;
  optional group arr_c1 (LIST) {
    repeated group bag {
      optional group array_element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
  optional group arr_c2 (LIST) {
    repeated group bag {
      optional group array_element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
}
but the requestSchema is:
message root {
  optional binary c1;
  optional binary c2;
  optional group arr_c1 (LIST) {
    repeated group bag {
      optional group element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
  optional group arr_c2 (LIST) {
    repeated group bag {
      optional group element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
}

so when read data from parquet will cause java.lang.ArrayIndexOutOfBoundsException

SparkQA · 2015-07-03T08:36:57Z

Test build #36489 has finished for PR 7209 at commit ecd2547.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

scwf · 2015-07-03T10:01:11Z

@liancheng we changing like this resolved the parquet query issue i send to you, but it failed the unit test, can you have a look?

SparkQA · 2015-07-03T11:18:34Z

Test build #36494 has finished for PR 7209 at commit 3d38a75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SaintBacchus · 2015-07-03T11:26:46Z

LGTM

liancheng · 2015-07-04T08:13:29Z

@Sephiroth-Lin @scwf This issue is actually much more complicated than it looks like. The TL;DR is that, in the early days, Parquet didn't explicitly specify how LIST and MAP should be constructed, and different systems and tools just reinvent their own wheels. The consequence is that it breaks Parquet interoperability. Namely, Parquet files written by system A might not be read by system B. The most recent Parquet format spec tries to fix this by specifying LIST and MAP structures explicitly and adding backwards-compatibility rules (1, 2) to cover existing legacy data files.

We are trying to make Spark SQL compatible with Parquet format spec. This work consists of three parts:

Refactoring schema conversion between Parquet and Spark SQL (done, [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter #6617)

This makes Spark SQL recognizes all "weird" LIST and MAP structures in legacy data files. But this only fixes schema conversion. [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter #6617 doesn't refactor the actual data read path. So there's an internal feature flag spark.sql.parquet.followParquetFormatSpec, and is turned off by default to keep consistent with the current data read path.
Refactoring Parquet data read path

After finishing this part, we are expected to able to read all kinds of legacy Parquet files, including the one mentioned in this PR.
Refactoring Parquet data write path

So that Spark SQL writes standard Parquet data which conform to Parquet format spec.

I'm currently working on part 2, which fixes your problem here. A PR will be sent out soon.

liancheng · 2015-07-04T20:10:50Z

After rethinking about this PR, I think it does spot another issue: the current master breaks backwards-compatibility of reading Parquet files created by parquet-avro. When converting a Spark SQL schema to a Parquet schema, Spark 1.4.0 and prior versions mostly follow parquet-avro, and convert arrays which may contain null values into something like this:

message root {
  optional group _c0 (LIST) {
    repeated group bag {
      optional group array {
        <element-type>
      }
    }
  }
}

Please note the field name array. However, current master changes this to element even when we are using compatible mode.

@Sephiroth-Lin Would you mind to fix this issue by changing the array_element string to array? My motivation is that, we should behave exactly the same as Spark 1.4.0- and then fix SPARK-8811 in the work I mentioned in my previous comment. You may either continue working on this PR or just close this one and start a new PR for the parquet-avro compatibility issue.

liancheng · 2015-07-06T02:18:28Z

@Sephiroth-Lin @scwf The aforementioned PR is here: #7231. A test case for SPARK-8811 is added.

scwf · 2015-07-06T02:26:36Z

wow that's cool !!

scwf · 2015-07-06T02:27:07Z

do we still need file a PR to changing the array_element string to array?

Sephiroth-Lin · 2015-07-06T03:03:44Z

@liancheng OK, good, thank you.

liancheng · 2015-07-06T06:19:02Z

@scwf Yeah, I didn't make the element to array change in #7231. It would be good to have one, either based on this PR or open a new one. The tricky part is it needs parquet-avro for writing test case. We may generate a Parquet file with parquet-avro and then add it as a resource.

Sephiroth-Lin · 2015-07-07T03:49:08Z

@liancheng I have updated, please help to review, thank you!

SparkQA · 2015-07-07T04:07:29Z

Test build #36638 has finished for PR 7209 at commit e887706.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

pzzs · 2015-07-07T06:15:28Z

LGTM

liancheng · 2015-07-07T06:20:50Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala

@@ -490,7 +490,7 @@ private[parquet] class CatalystSchemaConverter(
          .buildGroup(repetition).as(LIST)
          .addField(
            Types.repeatedGroup()
-              .addField(convertField(StructField("element", elementType, containsNull)))
+              .addField(convertField(StructField("array", elementType, containsNull)))


This line shouldn't be changed. As commented above, this case branch is implements standard Parquet schema conversion following the Parquet format spec, which explicitly require the inner most element type name to be element.

SparkQA · 2015-07-07T07:05:20Z

Test build #36651 has finished for PR 7209 at commit d931141.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-07T07:13:43Z

Test build #36653 has finished for PR 7209 at commit 0069895.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-07T10:09:53Z

Test build #36663 has finished for PR 7209 at commit 2480abd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-08T07:27:23Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala

@@ -446,7 +446,7 @@ private[parquet] class CatalystSchemaConverter(
          field.name,
          Types
            .buildGroup(REPEATED)
-            .addField(convertField(StructField("element", elementType, nullable)))
+            .addField(convertField(StructField("array", elementType, nullable)))


Actually I made a mistake here. We should leave this array as array_element.

This is a little bit complicated... So in the early days, when Spark SQL Parquet support was firstly authored, Parquet format spec wasn't clear about how to write arrays and maps. So Spark SQL took a somewhat weird approach here: if the array may contain nulls, we mimic parquet-hive, which writes a 3-level structure with array_element as the 2nd level type name; otherwise, we mimic parquet-avro, which writes a 2-level structure with array as the 2nd level type name.

Just to be clear, PR #7231 already covers the original bug this PR tried to fix. We'll be able to read Hive data with legacy format. The field names changed here matter for the write path, because we want to write exactly the same format as older Spark SQL versions when compatible mode is turned on.

liancheng · 2015-07-08T07:37:41Z

Hey @Sephiroth-Lin, do you mind me forking this PR branch and continue work on this (will still credit you as the main author)? Parquet schema conversion is particularly hard to get right because there are a bunch of head scratching historical compatibility issues :(

Sephiroth-Lin · 2015-07-08T08:58:50Z

@liancheng OK, no problem. Thank you!

liancheng · 2015-07-08T17:04:09Z

Cool, then would you mind closing this PR for now?

liancheng · 2015-07-08T23:48:37Z

Opened #7304 for fixing this issue.

scwf · 2015-07-09T01:34:23Z

@Sephiroth-Lin please close this PR.

…hen handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7304 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode

…hen handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7314 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode

rxin · 2015-07-09T07:03:17Z

@Sephiroth-Lin you should add the email you used for the commit to your github profile. Then it will show up as your commit.

liancheng · 2015-07-09T08:01:05Z

@Sephiroth-Lin BTW, I added your email address manually when merging #7314. (Failed to update the author field when merging this PR so I reverted this one and reopened it as #7314.)

Change schema for array type from element to array_element

ecd2547

Update TestSuite

3d38a75

Update, array_element to array

e887706

liancheng reviewed Jul 7, 2015
View reviewed changes

Sephiroth-Lin added 3 commits July 7, 2015 14:41

Update unit test

49ca112

Update unit test

d931141

Update

0069895

Update unit test

2480abd

liancheng reviewed Jul 8, 2015
View reviewed changes

liancheng mentioned this pull request Jul 8, 2015

[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode #7304

Closed

Sephiroth-Lin closed this Jul 9, 2015

liancheng mentioned this pull request Jul 9, 2015

[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode #7314

Closed

Sephiroth-Lin deleted the SPARK-8811 branch May 15, 2016 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8811][SQL] Read array struct data from parquet error #7209

[SPARK-8811][SQL] Read array struct data from parquet error #7209

Sephiroth-Lin commented Jul 3, 2015

SparkQA commented Jul 3, 2015

scwf commented Jul 3, 2015

SparkQA commented Jul 3, 2015

SaintBacchus commented Jul 3, 2015

liancheng commented Jul 4, 2015

liancheng commented Jul 4, 2015

liancheng commented Jul 6, 2015

scwf commented Jul 6, 2015

scwf commented Jul 6, 2015

Sephiroth-Lin commented Jul 6, 2015

liancheng commented Jul 6, 2015

Sephiroth-Lin commented Jul 7, 2015

SparkQA commented Jul 7, 2015

pzzs commented Jul 7, 2015

liancheng Jul 7, 2015

SparkQA commented Jul 7, 2015

SparkQA commented Jul 7, 2015

SparkQA commented Jul 7, 2015

liancheng Jul 8, 2015

liancheng Jul 8, 2015

liancheng commented Jul 8, 2015

Sephiroth-Lin commented Jul 8, 2015

liancheng commented Jul 8, 2015

liancheng commented Jul 8, 2015

scwf commented Jul 9, 2015

rxin commented Jul 9, 2015

liancheng commented Jul 9, 2015

[SPARK-8811][SQL] Read array struct data from parquet error #7209

[SPARK-8811][SQL] Read array struct data from parquet error #7209

Conversation

Sephiroth-Lin commented Jul 3, 2015

SparkQA commented Jul 3, 2015

scwf commented Jul 3, 2015

SparkQA commented Jul 3, 2015

SaintBacchus commented Jul 3, 2015

liancheng commented Jul 4, 2015

liancheng commented Jul 4, 2015

liancheng commented Jul 6, 2015

scwf commented Jul 6, 2015

scwf commented Jul 6, 2015

Sephiroth-Lin commented Jul 6, 2015

liancheng commented Jul 6, 2015

Sephiroth-Lin commented Jul 7, 2015

SparkQA commented Jul 7, 2015

pzzs commented Jul 7, 2015

liancheng Jul 7, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 7, 2015

SparkQA commented Jul 7, 2015

SparkQA commented Jul 7, 2015

liancheng Jul 8, 2015

Choose a reason for hiding this comment

liancheng Jul 8, 2015

Choose a reason for hiding this comment

liancheng commented Jul 8, 2015

Sephiroth-Lin commented Jul 8, 2015

liancheng commented Jul 8, 2015

liancheng commented Jul 8, 2015

scwf commented Jul 9, 2015

rxin commented Jul 9, 2015

liancheng commented Jul 9, 2015