New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9876][SQL]: Update Parquet to 1.8.1. #13280
Conversation
in the past parquet upgrades brought perf regressions. Any idea about this release? |
Test build #59214 has finished for PR 13280 at commit
|
Test build #3015 has finished for PR 13280 at commit
|
@rxin, I agree that we shouldn't upgrade if there are perf regressions. I would like to know what they are so we can fix them in Parquet upstream though. This should be a big performance improvement for selective queries on sorted data because we can't currently push down any string predicates. It seems like a good thing to pair with the new |
I'm not sure what should be done to fix the dependency test failure. Looks like there's a list of dependencies that needs to be updated. Is that something I should include in this PR? |
Yea you would need to explicitly update the dependency list. We added that as a safe-guard to accidentally changing dependencies. |
cc @liancheng who might have idea about past parquet perf regressions. |
The dev/test-dependencies script can auto update the deps files for this purpose. One thing we ask people to investigate are changes between old and new version so we can think through benefits vs potential incompatibilities |
(Updated the performance regression discussion link below.) I had once tried to upgrade Parquet to 1.8.1, and one more change needs to be done for the upgrade: https://github.com/apache/spark/pull/9225/files#diff-b4108187503e0f3ac64c1630d266b122R115 For the performance regression, here is the full thread of previous discussion. We observed that 1.8.2-SNAPSHOT didn't have the regression any more. I had tried to bisect, but failed to find anything useful. I thought 1.8.2 would probably be released soon at that time, so didn't try hard to dig it... |
@rdblue Is there any perf evaluation of this new version that we can refer to ? |
I think we should upgrade Parquet to 1.8.1 in Spark 2.0 due to the following reasons:
My only concern for this PR is that we should add the hack done in PR #9225. Otherwise there would be noticeable performance regression for queries like |
022dd6b
to
30769bd
Compare
@liancheng, thanks for pointing out that fix, I've added it. I thought that was already committed since it has been a while since we fixed the Parquet side. I've also updated the dependency files, thanks @srowen. I think given the fix for PARQUET-251, updating is a good idea. There isn't a perf evaluation for 1.8.1 that I can point to, but we've been using it without a noticeable change for months. |
@rdblue LGTM pending rebasing and Jenkins. Thanks for fixing this! |
30769bd
to
d1c79c7
Compare
@liancheng: rebased. Sorry I missed that earlier. |
Test build #59422 has finished for PR 13280 at commit
|
d1c79c7
to
af3957f
Compare
Test build #59430 has finished for PR 13280 at commit
|
Test build #59441 has finished for PR 13280 at commit
|
af3957f
to
85e03f9
Compare
Test build #59500 has finished for PR 13280 at commit
|
|$actual | ||
|$expectedSchema | ||
|Actual clipped schema: | ||
|$actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you please revert this indentation change? (IntelliJ's bad I guess...)
85e03f9
to
40241dc
Compare
@liancheng, fixed. Yeah, IntelliJ has a few annoyances like that with scala. Imports are a mess. |
val emptySchema: MessageType = Types.buildMessage() | ||
.required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy") | ||
.named("root") | ||
emptySchema.getFields.clear(); // remove the dummy column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move emptySchema
into object CatalystSchemaConverter
as a method? Now the same "HACK ALERT" appears 3 times in total:
def emptyMessageType(name: String): MessageType = {
// (Add the HACK ALERT here.)
val messageType = Types.buildMessage()
.required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
.named(name)
messageType.getFields.clear()
messageType
}
Also, let's make sure the hack alert message mentions the specific parquet-mr version (1.8.2-SNAPSHOT) that fixes PARQUET-363. So that people clearly know when we can remove this hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a double check, so the following two are equivalent, right? I wasn't aware of the first approach (which is a little bit confusing since there are two consecutive named
calls).
// (1)
Types
.buildMessage()
.required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
.named("root")
// (2)
Types
.buildMessage()
.addField(
Types.required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
)
.named("root")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are equivalent. The builders nest so that it reads like the schema definition and the parent builder is returned by named
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the duplicates by adding CatalystSchemaConverter.EMPTY_MESSAGE.
31125de
to
37b4978
Compare
Thanks @liancheng! It will be great to have predicate push-down for strings in 2.0! |
Hello @rdblue, we are pretty late in this release cycle. I am afraid that we cannot actually upgrade Parquet to 1.8.1 because of the following two reasons:
So, I'd like to propose to revert this upgrade. We can try to upgrade Parquet in the early development cycle of 2.1 (assuming we have figured out the regression). So, we can have more time to test this change. |
+1 on reverting for the reasons Yin mentioned. It's very risky to do dep updates at this point for 2.0, and I was surprised this got merged without at least verifying the prior performance regression we found went away. |
Oh, wait... sorry, I just realized that @liancheng said he also merged to branch-2.0. +1 on reverting that. |
I'm working on verifying the regression using the micro benchmark I did before. Sorry for the troubles. |
As I said on PR #13445: It sounds reasonable, but we should follow up on this. If we revert the change I suggest that we only revert it in 2.0 or add it to master as soon as 2.0 is branched. That way we don't adversely affect 2.0, but we do get this addressed. Is that a reasonable path forward? |
@yhuai, what started failing? |
… 1.8.1." ## What changes were proposed in this pull request? Since we are pretty late in the 2.0 release cycle, it is not clear if this upgrade can be tested thoroughly and if we can resolve the regression issue that we observed before. This PR reverts #13280 from branch 2.0. ## How was this patch tested? Existing tests This reverts commit 776d183. Author: Yin Huai <yhuai@databricks.com> Closes #13450 from yhuai/revertParquet1.8.1-branch-2.0.
## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3.
## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3. (cherry picked from commit f958c1c) Signed-off-by: Sean Owen <sowen@cloudera.com>
## What changes were proposed in this pull request? revived apache#13464 Fix Java Lint errors introduced by apache#13286 and apache#13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes apache#13559 from techaddict/minor-3.
// To workaround this problem, here we first construct a `MessageType` with a single dummy | ||
// field, and then remove the field to obtain an empty `MessageType`. | ||
// | ||
// TODO Reverts this change after upgrading parquet-mr to 1.8.2+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue When will Parquet 1.8.2 be released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're close to a 1.9.0 release, just working through some performance issues with the switch to ByteBuffer APIs. No estimate on that yet. We can do a 1.8.2 if there's interest so that we can fix some things like this without pulling in all those changes.
…t predicate pushdown and replace deprecated fromByteArray. ## What changes were proposed in this pull request? It seems Parquet has been upgraded to 1.8.1 by #13280. So, this PR enables string and binary predicate push down which was disabled due to [SPARK-11153](https://issues.apache.org/jira/browse/SPARK-11153) and [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and cleans up some comments unremoved (I think by mistake). This PR also replace the API, `fromByteArray()` deprecated in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251). ## How was this patch tested? Unit tests in `ParquetFilters` Author: hyukjinkwon <gurwls223@gmail.com> Closes #13389 from HyukjinKwon/parquet-1.8-followup.
What changes were proposed in this pull request?
This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.
How was this patch tested?
This uses the existing Parquet tests.