Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8811][SQL] Read array struct data from parquet error #7209

Closed
wants to merge 7 commits into from

Conversation

Sephiroth-Lin
Copy link
Contributor

JIRA:https://issues.apache.org/jira/browse/SPARK-8811

we have a table: 
t1(c1 string, 
   c2 string, 
   arr_c1 array<struct<in_c1 string, in_c2 string>>, 
   arr_c2 array<struct<in_c1 string, in_c2 string>>
)

we save data in parquet.

for select * from t1, we know in parquet the fileSchema may be:
message hive_schema {
  optional binary c1;
  optional binary c2;
  optional group arr_c1 (LIST) {
    repeated group bag {
      optional group array_element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
  optional group arr_c2 (LIST) {
    repeated group bag {
      optional group array_element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
}
but the requestSchema is:
message root {
  optional binary c1;
  optional binary c2;
  optional group arr_c1 (LIST) {
    repeated group bag {
      optional group element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
  optional group arr_c2 (LIST) {
    repeated group bag {
      optional group element {
        optional binary IN_C1;
        optional binary IN_C2;
      }
    }
  }
}

so when read data from parquet will cause java.lang.ArrayIndexOutOfBoundsException

@SparkQA
Copy link

SparkQA commented Jul 3, 2015

Test build #36489 has finished for PR 7209 at commit ecd2547.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@scwf
Copy link
Contributor

scwf commented Jul 3, 2015

@liancheng we changing like this resolved the parquet query issue i send to you, but it failed the unit test, can you have a look?

@SparkQA
Copy link

SparkQA commented Jul 3, 2015

Test build #36494 has finished for PR 7209 at commit 3d38a75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SaintBacchus
Copy link
Contributor

LGTM

@liancheng
Copy link
Contributor

@Sephiroth-Lin @scwf This issue is actually much more complicated than it looks like. The TL;DR is that, in the early days, Parquet didn't explicitly specify how LIST and MAP should be constructed, and different systems and tools just reinvent their own wheels. The consequence is that it breaks Parquet interoperability. Namely, Parquet files written by system A might not be read by system B. The most recent Parquet format spec tries to fix this by specifying LIST and MAP structures explicitly and adding backwards-compatibility rules (1, 2) to cover existing legacy data files.

We are trying to make Spark SQL compatible with Parquet format spec. This work consists of three parts:

  1. Refactoring schema conversion between Parquet and Spark SQL (done, [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter #6617)

    This makes Spark SQL recognizes all "weird" LIST and MAP structures in legacy data files. But this only fixes schema conversion. [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter #6617 doesn't refactor the actual data read path. So there's an internal feature flag spark.sql.parquet.followParquetFormatSpec, and is turned off by default to keep consistent with the current data read path.

  2. Refactoring Parquet data read path

    After finishing this part, we are expected to able to read all kinds of legacy Parquet files, including the one mentioned in this PR.

  3. Refactoring Parquet data write path

    So that Spark SQL writes standard Parquet data which conform to Parquet format spec.

I'm currently working on part 2, which fixes your problem here. A PR will be sent out soon.

@liancheng
Copy link
Contributor

After rethinking about this PR, I think it does spot another issue: the current master breaks backwards-compatibility of reading Parquet files created by parquet-avro. When converting a Spark SQL schema to a Parquet schema, Spark 1.4.0 and prior versions mostly follow parquet-avro, and convert arrays which may contain null values into something like this:

message root {
  optional group _c0 (LIST) {
    repeated group bag {
      optional group array {
        <element-type>
      }
    }
  }
}

Please note the field name array. However, current master changes this to element even when we are using compatible mode.

@Sephiroth-Lin Would you mind to fix this issue by changing the array_element string to array? My motivation is that, we should behave exactly the same as Spark 1.4.0- and then fix SPARK-8811 in the work I mentioned in my previous comment. You may either continue working on this PR or just close this one and start a new PR for the parquet-avro compatibility issue.

@liancheng
Copy link
Contributor

@Sephiroth-Lin @scwf The aforementioned PR is here: #7231. A test case for SPARK-8811 is added.

@scwf
Copy link
Contributor

scwf commented Jul 6, 2015

wow that's cool !!

@scwf
Copy link
Contributor

scwf commented Jul 6, 2015

do we still need file a PR to changing the array_element string to array?

@Sephiroth-Lin
Copy link
Contributor Author

@liancheng OK, good, thank you.

@liancheng
Copy link
Contributor

@scwf Yeah, I didn't make the element to array change in #7231. It would be good to have one, either based on this PR or open a new one. The tricky part is it needs parquet-avro for writing test case. We may generate a Parquet file with parquet-avro and then add it as a resource.

@Sephiroth-Lin
Copy link
Contributor Author

@liancheng I have updated, please help to review, thank you!

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36638 has finished for PR 7209 at commit e887706.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pzzs
Copy link
Contributor

pzzs commented Jul 7, 2015

LGTM

@@ -490,7 +490,7 @@ private[parquet] class CatalystSchemaConverter(
.buildGroup(repetition).as(LIST)
.addField(
Types.repeatedGroup()
.addField(convertField(StructField("element", elementType, containsNull)))
.addField(convertField(StructField("array", elementType, containsNull)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line shouldn't be changed. As commented above, this case branch is implements standard Parquet schema conversion following the Parquet format spec, which explicitly require the inner most element type name to be element.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36651 has finished for PR 7209 at commit d931141.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36653 has finished for PR 7209 at commit 0069895.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36663 has finished for PR 7209 at commit 2480abd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -446,7 +446,7 @@ private[parquet] class CatalystSchemaConverter(
field.name,
Types
.buildGroup(REPEATED)
.addField(convertField(StructField("element", elementType, nullable)))
.addField(convertField(StructField("array", elementType, nullable)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I made a mistake here. We should leave this array as array_element.

This is a little bit complicated... So in the early days, when Spark SQL Parquet support was firstly authored, Parquet format spec wasn't clear about how to write arrays and maps. So Spark SQL took a somewhat weird approach here: if the array may contain nulls, we mimic parquet-hive, which writes a 3-level structure with array_element as the 2nd level type name; otherwise, we mimic parquet-avro, which writes a 2-level structure with array as the 2nd level type name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, PR #7231 already covers the original bug this PR tried to fix. We'll be able to read Hive data with legacy format. The field names changed here matter for the write path, because we want to write exactly the same format as older Spark SQL versions when compatible mode is turned on.

@liancheng
Copy link
Contributor

Hey @Sephiroth-Lin, do you mind me forking this PR branch and continue work on this (will still credit you as the main author)? Parquet schema conversion is particularly hard to get right because there are a bunch of head scratching historical compatibility issues :(

@Sephiroth-Lin
Copy link
Contributor Author

@liancheng OK, no problem. Thank you!

@liancheng
Copy link
Contributor

Cool, then would you mind closing this PR for now?

@liancheng
Copy link
Contributor

Opened #7304 for fixing this issue.

@scwf
Copy link
Contributor

scwf commented Jul 9, 2015

@Sephiroth-Lin please close this PR.

asfgit pushed a commit that referenced this pull request Jul 9, 2015
…hen handling Parquet LISTs in compatible mode

This PR is based on #7209 authored by Sephiroth-Lin.

Author: Weizhong Lin <linweizhong@huawei.com>

Closes #7304 from liancheng/spark-8928 and squashes the following commits:

75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
asfgit pushed a commit that referenced this pull request Jul 9, 2015
…hen handling Parquet LISTs in compatible mode

This PR is based on #7209 authored by Sephiroth-Lin.

Author: Weizhong Lin <linweizhong@huawei.com>

Closes #7314 from liancheng/spark-8928 and squashes the following commits:

75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
@rxin
Copy link
Contributor

rxin commented Jul 9, 2015

@Sephiroth-Lin you should add the email you used for the commit to your github profile. Then it will show up as your commit.

@liancheng
Copy link
Contributor

@Sephiroth-Lin BTW, I added your email address manually when merging #7314. (Failed to update the author field when merging this PR so I reverted this one and reopened it as #7314.)

@Sephiroth-Lin Sephiroth-Lin deleted the SPARK-8811 branch May 15, 2016 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants