[SPARK-9876][SQL]: Update Parquet to 1.8.1. #13280

rdblue · 2016-05-24T18:38:28Z

What changes were proposed in this pull request?

This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.

How was this patch tested?

This uses the existing Parquet tests.

rxin · 2016-05-24T18:46:02Z

in the past parquet upgrades brought perf regressions. Any idea about this release?

SparkQA · 2016-05-24T18:46:08Z

Test build #59214 has finished for PR 13280 at commit 022dd6b.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-24T18:47:23Z

Test build #3015 has finished for PR 13280 at commit 022dd6b.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-05-24T19:22:14Z

@rxin, I agree that we shouldn't upgrade if there are perf regressions. I would like to know what they are so we can fix them in Parquet upstream though. This should be a big performance improvement for selective queries on sorted data because we can't currently push down any string predicates. It seems like a good thing to pair with the new DataFrameWriter#sortBy feature.

rdblue · 2016-05-24T22:11:15Z

I'm not sure what should be done to fix the dependency test failure. Looks like there's a list of dependencies that needs to be updated. Is that something I should include in this PR?

rxin · 2016-05-24T22:14:00Z

Yea you would need to explicitly update the dependency list. We added that as a safe-guard to accidentally changing dependencies.

rxin · 2016-05-24T22:14:13Z

cc @liancheng who might have idea about past parquet perf regressions.

srowen · 2016-05-24T23:07:16Z

The dev/test-dependencies script can auto update the deps files for this purpose.

One thing we ask people to investigate are changes between old and new version so we can think through benefits vs potential incompatibilities

liancheng · 2016-05-25T02:30:38Z

(Updated the performance regression discussion link below.)

I had once tried to upgrade Parquet to 1.8.1, and one more change needs to be done for the upgrade: https://github.com/apache/spark/pull/9225/files#diff-b4108187503e0f3ac64c1630d266b122R115

For the performance regression, here is the full thread of previous discussion. We observed that 1.8.2-SNAPSHOT didn't have the regression any more.

I had tried to bisect, but failed to find anything useful. I thought 1.8.2 would probably be released soon at that time, so didn't try hard to dig it...

yhuai · 2016-05-26T17:15:38Z

@rdblue Is there any perf evaluation of this new version that we can refer to ?

liancheng · 2016-05-26T20:16:58Z

I think we should upgrade Parquet to 1.8.1 in Spark 2.0 due to the following reasons:

Get PARQUET-251 fixed so that we no longer write corrupted statistics for string/binary columns, and can re-enable filter push-down for these columns (1.8.1 ignores corrupted statistics automatically).
The performance regression was discovered by a micro benchmark that simply scans the flat TPC-DS store_sales table (scale factor 15). However, we are using our own vectorized Parquet reader for flat tables now. So at least table scan for flat schemata shouldn't be affected at the moment.

My only concern for this PR is that we should add the hack done in PR #9225. Otherwise there would be noticeable performance regression for queries like SELECT COUNT(*) FROM t.

rdblue · 2016-05-26T22:36:28Z

@liancheng, thanks for pointing out that fix, I've added it. I thought that was already committed since it has been a while since we fixed the Parquet side.

I've also updated the dependency files, thanks @srowen.

I think given the fix for PARQUET-251, updating is a good idea. There isn't a perf evaluation for 1.8.1 that I can point to, but we've been using it without a noticeable change for months.

liancheng · 2016-05-26T23:44:02Z

@rdblue LGTM pending rebasing and Jenkins. Thanks for fixing this!

rdblue · 2016-05-26T23:55:49Z

@liancheng: rebased. Sorry I missed that earlier.

SparkQA · 2016-05-27T00:41:37Z

Test build #59422 has finished for PR 13280 at commit 30769bd.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-05-27T01:33:25Z

Test build #59430 has finished for PR 13280 at commit d1c79c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-27T02:35:14Z

Test build #59441 has finished for PR 13280 at commit af3957f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-27T17:48:30Z

Test build #59500 has finished for PR 13280 at commit 85e03f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-05-27T17:54:56Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

-             |$actual
+              |$expectedSchema
+              |Actual clipped schema:
+              |$actual


Nit: Could you please revert this indentation change? (IntelliJ's bad I guess...)

rdblue · 2016-05-27T18:00:40Z

@liancheng, fixed. Yeah, IntelliJ has a few annoyances like that with scala. Imports are a mess.

liancheng · 2016-05-27T18:09:40Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

+  val emptySchema: MessageType = Types.buildMessage()
+      .required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
+      .named("root")
+  emptySchema.getFields.clear(); // remove the dummy column


Maybe move emptySchema into object CatalystSchemaConverter as a method? Now the same "HACK ALERT" appears 3 times in total:

def emptyMessageType(name: String): MessageType = { // (Add the HACK ALERT here.) val messageType = Types.buildMessage() .required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy") .named(name) messageType.getFields.clear() messageType }

Also, let's make sure the hack alert message mentions the specific parquet-mr version (1.8.2-SNAPSHOT) that fixes PARQUET-363. So that people clearly know when we can remove this hack.

Just a double check, so the following two are equivalent, right? I wasn't aware of the first approach (which is a little bit confusing since there are two consecutive named calls).

// (1) Types .buildMessage() .required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy") .named("root") // (2) Types .buildMessage() .addField( Types.required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy") ) .named("root")

Yes, those are equivalent. The builders nest so that it reads like the schema definition and the parent builder is returned by named.

Fixed the duplicates by adding CatalystSchemaConverter.EMPTY_MESSAGE.

rdblue · 2016-05-28T00:03:34Z

Thanks @liancheng! It will be great to have predicate push-down for strings in 2.0!

yhuai · 2016-06-01T06:37:36Z

Hello @rdblue, we are pretty late in this release cycle. I am afraid that we cannot actually upgrade Parquet to 1.8.1 because of the following two reasons:

Since this change was merged pretty late, I am not sure this change can be thoroughly tested. All of our previous testing was based on the Parquet 1.7.0. Also, some of our internal jobs started to fail after this upgrade.
Parquet 1.8.1 potentially introduces performance regression. I am not sure we can upgrade Parquet before finish the investigation of this regression.

So, I'd like to propose to revert this upgrade. We can try to upgrade Parquet in the early development cycle of 2.1 (assuming we have figured out the regression). So, we can have more time to test this change.

rxin · 2016-06-01T06:40:40Z

+1 on reverting for the reasons Yin mentioned. It's very risky to do dep updates at this point for 2.0, and I was surprised this got merged without at least verifying the prior performance regression we found went away.

markhamstra · 2016-06-01T13:48:58Z

@rxin Huh? The merge was to master, not branch-2.0. Doesn't that put it on the 2.1 track and not into 2.0.0? I think that is all that Yin was saying, that @rdblue was mistaken about this change going into 2.0.

markhamstra · 2016-06-01T13:50:45Z

Oh, wait... sorry, I just realized that @liancheng said he also merged to branch-2.0. +1 on reverting that.

liancheng · 2016-06-01T15:06:12Z

I'm working on verifying the regression using the micro benchmark I did before. Sorry for the troubles.

rdblue · 2016-06-01T16:37:29Z

As I said on PR #13445: It sounds reasonable, but we should follow up on this. If we revert the change I suggest that we only revert it in 2.0 or add it to master as soon as 2.0 is branched. That way we don't adversely affect 2.0, but we do get this addressed. Is that a reasonable path forward?

rdblue · 2016-06-01T16:38:01Z

@yhuai, what started failing?

… 1.8.1." ## What changes were proposed in this pull request? Since we are pretty late in the 2.0 release cycle, it is not clear if this upgrade can be tested thoroughly and if we can resolve the regression issue that we observed before. This PR reverts #13280 from branch 2.0. ## How was this patch tested? Existing tests This reverts commit 776d183. Author: Yin Huai <yhuai@databricks.com> Closes #13450 from yhuai/revertParquet1.8.1-branch-2.0.

## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3.

## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3. (cherry picked from commit f958c1c) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? revived apache#13464 Fix Java Lint errors introduced by apache#13286 and apache#13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes apache#13559 from techaddict/minor-3.

yhuai · 2016-06-14T16:29:58Z

.../main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala

+  // To workaround this problem, here we first construct a `MessageType` with a single dummy
+  // field, and then remove the field to obtain an empty `MessageType`.
+  //
+  // TODO Reverts this change after upgrading parquet-mr to 1.8.2+


@rdblue When will Parquet 1.8.2 be released?

We're close to a 1.9.0 release, just working through some performance issues with the switch to ByteBuffer APIs. No estimate on that yet. We can do a 1.8.2 if there's interest so that we can fix some things like this without pulling in all those changes.

…t predicate pushdown and replace deprecated fromByteArray. ## What changes were proposed in this pull request? It seems Parquet has been upgraded to 1.8.1 by #13280. So, this PR enables string and binary predicate push down which was disabled due to [SPARK-11153](https://issues.apache.org/jira/browse/SPARK-11153) and [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and cleans up some comments unremoved (I think by mistake). This PR also replace the API, `fromByteArray()` deprecated in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251). ## How was this patch tested? Unit tests in `ParquetFilters` Author: hyukjinkwon <gurwls223@gmail.com> Closes #13389 from HyukjinKwon/parquet-1.8-followup.

rdblue force-pushed the SPARK-9876-update-parquet branch from 022dd6b to 30769bd Compare May 26, 2016 22:31

rdblue force-pushed the SPARK-9876-update-parquet branch from 30769bd to d1c79c7 Compare May 26, 2016 23:55

rdblue force-pushed the SPARK-9876-update-parquet branch from d1c79c7 to af3957f Compare May 27, 2016 00:53

rdblue force-pushed the SPARK-9876-update-parquet branch from af3957f to 85e03f9 Compare May 27, 2016 16:13

liancheng reviewed May 27, 2016
View reviewed changes

rdblue force-pushed the SPARK-9876-update-parquet branch from 85e03f9 to 40241dc Compare May 27, 2016 17:59

liancheng reviewed May 27, 2016
View reviewed changes

rdblue force-pushed the SPARK-9876-update-parquet branch 2 times, most recently from 31125de to 37b4978 Compare May 27, 2016 18:21

SPARK-9876: Update Parquet to 1.8.1.

33e07ee

HyukjinKwon mentioned this pull request May 29, 2016

[SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet predicate pushdown and replace deprecated fromByteArray. #13389

Closed

yhuai mentioned this pull request Jun 1, 2016

[SPARK-9876] Revert "[SPARK-9876][SQL] Update Parquet to 1.8.1." #13445

Closed

yhuai mentioned this pull request Jun 1, 2016

[SPARK-9876] [BRANCH-2.0] Revert "[SPARK-9876][SQL] Update Parquet to 1.8.1." #13450

Closed

techaddict mentioned this pull request Jun 2, 2016

[Minor] Fix Java Lint errors introduced by #13286 and 13280 #13464

Closed

techaddict mentioned this pull request Jun 8, 2016

[Minor] Fix Java Lint errors introduced by #13286 and #13280 #13559

Closed

yhuai reviewed Jun 14, 2016
View reviewed changes

gatorsmile mentioned this pull request Apr 24, 2018

[SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0. #21070

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9876][SQL]: Update Parquet to 1.8.1. #13280

[SPARK-9876][SQL]: Update Parquet to 1.8.1. #13280

rdblue commented May 24, 2016

rxin commented May 24, 2016

SparkQA commented May 24, 2016

SparkQA commented May 24, 2016

rdblue commented May 24, 2016 •

edited

rdblue commented May 24, 2016

rxin commented May 24, 2016

rxin commented May 24, 2016

srowen commented May 24, 2016

liancheng commented May 25, 2016 •

edited

yhuai commented May 26, 2016

liancheng commented May 26, 2016 •

edited

rdblue commented May 26, 2016

liancheng commented May 26, 2016

rdblue commented May 26, 2016

SparkQA commented May 27, 2016

SparkQA commented May 27, 2016

SparkQA commented May 27, 2016

SparkQA commented May 27, 2016

liancheng May 27, 2016

rdblue commented May 27, 2016

liancheng May 27, 2016 •

edited

liancheng May 27, 2016

rdblue May 27, 2016

rdblue May 27, 2016

rdblue commented May 28, 2016

yhuai commented Jun 1, 2016

rxin commented Jun 1, 2016 •

edited

markhamstra commented Jun 1, 2016

markhamstra commented Jun 1, 2016

liancheng commented Jun 1, 2016

rdblue commented Jun 1, 2016

rdblue commented Jun 1, 2016

yhuai Jun 14, 2016

rdblue Jun 14, 2016

[SPARK-9876][SQL]: Update Parquet to 1.8.1. #13280

[SPARK-9876][SQL]: Update Parquet to 1.8.1. #13280

Conversation

rdblue commented May 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

rxin commented May 24, 2016

SparkQA commented May 24, 2016

SparkQA commented May 24, 2016

rdblue commented May 24, 2016 • edited

rdblue commented May 24, 2016

rxin commented May 24, 2016

rxin commented May 24, 2016

srowen commented May 24, 2016

liancheng commented May 25, 2016 • edited

yhuai commented May 26, 2016

liancheng commented May 26, 2016 • edited

rdblue commented May 26, 2016

liancheng commented May 26, 2016

rdblue commented May 26, 2016

SparkQA commented May 27, 2016

SparkQA commented May 27, 2016

SparkQA commented May 27, 2016

SparkQA commented May 27, 2016

liancheng May 27, 2016

Choose a reason for hiding this comment

rdblue commented May 27, 2016

liancheng May 27, 2016 • edited

Choose a reason for hiding this comment

liancheng May 27, 2016

Choose a reason for hiding this comment

rdblue May 27, 2016

Choose a reason for hiding this comment

rdblue May 27, 2016

Choose a reason for hiding this comment

rdblue commented May 28, 2016

yhuai commented Jun 1, 2016

rxin commented Jun 1, 2016 • edited

markhamstra commented Jun 1, 2016

markhamstra commented Jun 1, 2016

liancheng commented Jun 1, 2016

rdblue commented Jun 1, 2016

rdblue commented Jun 1, 2016

yhuai Jun 14, 2016

Choose a reason for hiding this comment

rdblue Jun 14, 2016

Choose a reason for hiding this comment

rdblue commented May 24, 2016 •

edited

liancheng commented May 25, 2016 •

edited

liancheng commented May 26, 2016 •

edited

liancheng May 27, 2016 •

edited

rxin commented Jun 1, 2016 •

edited