Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9876][SQL]: Update Parquet to 1.8.1. #13280

Closed
wants to merge 1 commit into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented May 24, 2016

What changes were proposed in this pull request?

This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.

How was this patch tested?

This uses the existing Parquet tests.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

in the past parquet upgrades brought perf regressions. Any idea about this release?

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59214 has finished for PR 13280 at commit 022dd6b.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #3015 has finished for PR 13280 at commit 022dd6b.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented May 24, 2016

@rxin, I agree that we shouldn't upgrade if there are perf regressions. I would like to know what they are so we can fix them in Parquet upstream though. This should be a big performance improvement for selective queries on sorted data because we can't currently push down any string predicates. It seems like a good thing to pair with the new DataFrameWriter#sortBy feature.

@rdblue
Copy link
Contributor Author

rdblue commented May 24, 2016

I'm not sure what should be done to fix the dependency test failure. Looks like there's a list of dependencies that needs to be updated. Is that something I should include in this PR?

@rxin
Copy link
Contributor

rxin commented May 24, 2016

Yea you would need to explicitly update the dependency list. We added that as a safe-guard to accidentally changing dependencies.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

cc @liancheng who might have idea about past parquet perf regressions.

@srowen
Copy link
Member

srowen commented May 24, 2016

The dev/test-dependencies script can auto update the deps files for this purpose.

One thing we ask people to investigate are changes between old and new version so we can think through benefits vs potential incompatibilities

@liancheng
Copy link
Contributor

liancheng commented May 25, 2016

(Updated the performance regression discussion link below.)

I had once tried to upgrade Parquet to 1.8.1, and one more change needs to be done for the upgrade: https://github.com/apache/spark/pull/9225/files#diff-b4108187503e0f3ac64c1630d266b122R115

For the performance regression, here is the full thread of previous discussion. We observed that 1.8.2-SNAPSHOT didn't have the regression any more.

I had tried to bisect, but failed to find anything useful. I thought 1.8.2 would probably be released soon at that time, so didn't try hard to dig it...

@yhuai
Copy link
Contributor

yhuai commented May 26, 2016

@rdblue Is there any perf evaluation of this new version that we can refer to ?

@liancheng
Copy link
Contributor

liancheng commented May 26, 2016

I think we should upgrade Parquet to 1.8.1 in Spark 2.0 due to the following reasons:

  1. Get PARQUET-251 fixed so that we no longer write corrupted statistics for string/binary columns, and can re-enable filter push-down for these columns (1.8.1 ignores corrupted statistics automatically).
  2. The performance regression was discovered by a micro benchmark that simply scans the flat TPC-DS store_sales table (scale factor 15). However, we are using our own vectorized Parquet reader for flat tables now. So at least table scan for flat schemata shouldn't be affected at the moment.

My only concern for this PR is that we should add the hack done in PR #9225. Otherwise there would be noticeable performance regression for queries like SELECT COUNT(*) FROM t.

@rdblue rdblue force-pushed the SPARK-9876-update-parquet branch from 022dd6b to 30769bd Compare May 26, 2016 22:31
@rdblue
Copy link
Contributor Author

rdblue commented May 26, 2016

@liancheng, thanks for pointing out that fix, I've added it. I thought that was already committed since it has been a while since we fixed the Parquet side.

I've also updated the dependency files, thanks @srowen.

I think given the fix for PARQUET-251, updating is a good idea. There isn't a perf evaluation for 1.8.1 that I can point to, but we've been using it without a noticeable change for months.

@liancheng
Copy link
Contributor

@rdblue LGTM pending rebasing and Jenkins. Thanks for fixing this!

@rdblue rdblue force-pushed the SPARK-9876-update-parquet branch from 30769bd to d1c79c7 Compare May 26, 2016 23:55
@rdblue
Copy link
Contributor Author

rdblue commented May 26, 2016

@liancheng: rebased. Sorry I missed that earlier.

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59422 has finished for PR 13280 at commit 30769bd.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-9876-update-parquet branch from d1c79c7 to af3957f Compare May 27, 2016 00:53
@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59430 has finished for PR 13280 at commit d1c79c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59441 has finished for PR 13280 at commit af3957f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-9876-update-parquet branch from af3957f to 85e03f9 Compare May 27, 2016 16:13
@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59500 has finished for PR 13280 at commit 85e03f9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|$actual
|$expectedSchema
|Actual clipped schema:
|$actual
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you please revert this indentation change? (IntelliJ's bad I guess...)

@rdblue rdblue force-pushed the SPARK-9876-update-parquet branch from 85e03f9 to 40241dc Compare May 27, 2016 17:59
@rdblue
Copy link
Contributor Author

rdblue commented May 27, 2016

@liancheng, fixed. Yeah, IntelliJ has a few annoyances like that with scala. Imports are a mess.

val emptySchema: MessageType = Types.buildMessage()
.required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
.named("root")
emptySchema.getFields.clear(); // remove the dummy column
Copy link
Contributor

@liancheng liancheng May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move emptySchema into object CatalystSchemaConverter as a method? Now the same "HACK ALERT" appears 3 times in total:

    def emptyMessageType(name: String): MessageType = {
      // (Add the HACK ALERT here.)
      val messageType = Types.buildMessage()
        .required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
        .named(name)
      messageType.getFields.clear()
      messageType
    }

Also, let's make sure the hack alert message mentions the specific parquet-mr version (1.8.2-SNAPSHOT) that fixes PARQUET-363. So that people clearly know when we can remove this hack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a double check, so the following two are equivalent, right? I wasn't aware of the first approach (which is a little bit confusing since there are two consecutive named calls).

    // (1)
    Types
      .buildMessage()
      .required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
      .named("root")

    // (2)
    Types
      .buildMessage()
      .addField(
        Types.required(PrimitiveType.PrimitiveTypeName.INT32).named("dummy")
      )
      .named("root")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are equivalent. The builders nest so that it reads like the schema definition and the parent builder is returned by named.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the duplicates by adding CatalystSchemaConverter.EMPTY_MESSAGE.

@rdblue rdblue force-pushed the SPARK-9876-update-parquet branch 2 times, most recently from 31125de to 37b4978 Compare May 27, 2016 18:21
@rdblue
Copy link
Contributor Author

rdblue commented May 28, 2016

Thanks @liancheng! It will be great to have predicate push-down for strings in 2.0!

@yhuai
Copy link
Contributor

yhuai commented Jun 1, 2016

Hello @rdblue, we are pretty late in this release cycle. I am afraid that we cannot actually upgrade Parquet to 1.8.1 because of the following two reasons:

  1. Since this change was merged pretty late, I am not sure this change can be thoroughly tested. All of our previous testing was based on the Parquet 1.7.0. Also, some of our internal jobs started to fail after this upgrade.
  2. Parquet 1.8.1 potentially introduces performance regression. I am not sure we can upgrade Parquet before finish the investigation of this regression.

So, I'd like to propose to revert this upgrade. We can try to upgrade Parquet in the early development cycle of 2.1 (assuming we have figured out the regression). So, we can have more time to test this change.

@rxin
Copy link
Contributor

rxin commented Jun 1, 2016

+1 on reverting for the reasons Yin mentioned. It's very risky to do dep updates at this point for 2.0, and I was surprised this got merged without at least verifying the prior performance regression we found went away.

@markhamstra
Copy link
Contributor

@rxin Huh? The merge was to master, not branch-2.0. Doesn't that put it on the 2.1 track and not into 2.0.0? I think that is all that Yin was saying, that @rdblue was mistaken about this change going into 2.0.

@markhamstra
Copy link
Contributor

Oh, wait... sorry, I just realized that @liancheng said he also merged to branch-2.0. +1 on reverting that.

@liancheng
Copy link
Contributor

I'm working on verifying the regression using the micro benchmark I did before. Sorry for the troubles.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 1, 2016

As I said on PR #13445: It sounds reasonable, but we should follow up on this. If we revert the change I suggest that we only revert it in 2.0 or add it to master as soon as 2.0 is branched. That way we don't adversely affect 2.0, but we do get this addressed. Is that a reasonable path forward?

@rdblue
Copy link
Contributor Author

rdblue commented Jun 1, 2016

@yhuai, what started failing?

asfgit pushed a commit that referenced this pull request Jun 1, 2016
… 1.8.1."

## What changes were proposed in this pull request?
Since we are pretty late in the 2.0 release cycle, it is not clear if this upgrade can be tested thoroughly and if we can resolve the regression issue that we observed before. This PR reverts #13280 from branch 2.0.

## How was this patch tested?
Existing tests

This reverts commit 776d183.

Author: Yin Huai <yhuai@databricks.com>

Closes #13450 from yhuai/revertParquet1.8.1-branch-2.0.
asfgit pushed a commit that referenced this pull request Jun 8, 2016
## What changes were proposed in this pull request?

revived #13464

Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```

## How was this patch tested?
ran `dev/lint-java` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13559 from techaddict/minor-3.
asfgit pushed a commit that referenced this pull request Jun 8, 2016
## What changes were proposed in this pull request?

revived #13464

Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```

## How was this patch tested?
ran `dev/lint-java` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13559 from techaddict/minor-3.

(cherry picked from commit f958c1c)
Signed-off-by: Sean Owen <sowen@cloudera.com>
zjffdu pushed a commit to zjffdu/spark that referenced this pull request Jun 10, 2016


## What changes were proposed in this pull request?

revived apache#13464

Fix Java Lint errors introduced by apache#13286 and apache#13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```

## How was this patch tested?
ran `dev/lint-java` locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes apache#13559 from techaddict/minor-3.
// To workaround this problem, here we first construct a `MessageType` with a single dummy
// field, and then remove the field to obtain an empty `MessageType`.
//
// TODO Reverts this change after upgrading parquet-mr to 1.8.2+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue When will Parquet 1.8.2 be released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're close to a 1.9.0 release, just working through some performance issues with the switch to ByteBuffer APIs. No estimate on that yet. We can do a 1.8.2 if there's interest so that we can fix some things like this without pulling in all those changes.

asfgit pushed a commit that referenced this pull request Jul 5, 2016
…t predicate pushdown and replace deprecated fromByteArray.

## What changes were proposed in this pull request?

It seems Parquet has been upgraded to 1.8.1 by #13280. So,  this PR enables string and binary predicate push down which was disabled due to [SPARK-11153](https://issues.apache.org/jira/browse/SPARK-11153) and [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and cleans up some comments unremoved (I think by mistake).

This PR also replace the API, `fromByteArray()` deprecated in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251).

## How was this patch tested?

Unit tests in `ParquetFilters`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13389 from HyukjinKwon/parquet-1.8-followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants