New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22217] [SQL] ParquetFileFormat to support arbitrary OutputCommitters #19448

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@steveloughran
Contributor

steveloughran commented Oct 6, 2017

What changes were proposed in this pull request?

ParquetFileFormat to relax its requirement of output committer class from org.apache.parquet.hadoop.ParquetOutputCommitter or subclass thereof (and so implicitly Hadoop FileOutputCommitter) to any committer implementing org.apache.hadoop.mapreduce.OutputCommitter

This enables output committers which don't write to the filesystem the way FileOutputCommitter does to save parquet data from a dataframe: at present you cannot do this.

Before a committer which isn't a subclass of ParquetOutputCommitter, it checks to see if the context has requested summary metadata by setting parquet.enable.summary-metadata. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)

Note that SQLConf already states that any OutputCommitter can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,

How was this patch tested?

The patch includes a test suite, ParquetCommitterSuite, with a new committer, MarkingFileOutputCommitter which extends FileOutputCommitter and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.

committer summary outcome
parquet true success
parquet false success
marking false success with marker
marking true exception

All tests are happy.

SPARK-22217 tuning ParquetOutputCommitter to support any committer cl…
…ass, provided saveSummaries is disabled. With Tests

Change-Id: I19872dc1c095068ed5a61985d53cb7258bd9a9bb
@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran
Contributor

steveloughran commented Oct 6, 2017

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 6, 2017

Test build #82517 has finished for PR 19448 at commit e6fdbdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 6, 2017

Test build #82517 has finished for PR 19448 at commit e6fdbdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@rdblue

This comment has been minimized.

Show comment
Hide comment
@rdblue

rdblue Oct 7, 2017

Contributor

+1

I completely agree that using a ParquetOutputCommitter should be optional.

Contributor

rdblue commented Oct 7, 2017

+1

I completely agree that using a ParquetOutputCommitter should be optional.

@vanzin

One minor suggestion otherwise LGTM.

@HyukjinKwon

HyukjinKwon approved these changes Oct 11, 2017 edited

LGTM too, just few tiny nits while double checking.

SPARK-22217 review suggestions: use require() for exceptions; slight …
…test suite tuning

Change-Id: Ib7e99860fab66cb2bc47e2e4f90f4fc8041c7f03
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 11, 2017

Test build #82643 has finished for PR 19448 at commit d634f9e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 11, 2017

Test build #82643 has finished for PR 19448 at commit d634f9e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@jiangxb1987

This comment has been minimized.

Show comment
Hide comment
@jiangxb1987

jiangxb1987 Oct 12, 2017

Contributor

retest this please

Contributor

jiangxb1987 commented Oct 12, 2017

retest this please

SPARK-22217 updated error text and check for it in the test suite
Change-Id: I92420bff4afe180eda106337df253b0445e56979
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 12, 2017

Test build #82680 has finished for PR 19448 at commit d634f9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 12, 2017

Test build #82680 has finished for PR 19448 at commit d634f9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 12, 2017

Member

Still LGTM except for few nits.

Member

HyukjinKwon commented Oct 12, 2017

Still LGTM except for few nits.

SPARK-22217 log @ warn, with test modified to expect it. warning text…
… tells user to unset the parquet property...is that needed now?

Change-Id: I1c34b341fb4e0e3297becec4fc3dd3e63c005b7c
@gatorsmile

This comment has been minimized.

Show comment
Hide comment
@gatorsmile

gatorsmile Oct 12, 2017

Member

LGTM pending Jenkins

Member

gatorsmile commented Oct 12, 2017

LGTM pending Jenkins

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 12, 2017

Test build #82692 has finished for PR 19448 at commit c93eb1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 12, 2017

Test build #82692 has finished for PR 19448 at commit c93eb1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
SPARK-22217 include the fact that you don't get summaries in SQLConf …
…text

Change-Id: Ibcc8ada3c57091dd6a03e3efbcbc4791c556a287
@rdblue

This comment has been minimized.

Show comment
Hide comment
@rdblue

rdblue Oct 12, 2017

Contributor

Still +1 from me as well.

Contributor

rdblue commented Oct 12, 2017

Still +1 from me as well.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 12, 2017

Test build #82700 has finished for PR 19448 at commit 42afccb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 12, 2017

Test build #82700 has finished for PR 19448 at commit 42afccb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 12, 2017

Test build #82702 has finished for PR 19448 at commit f486263.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 12, 2017

Test build #82702 has finished for PR 19448 at commit f486263.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 12, 2017

Member

Merged to master and branch-2.2.

Member

HyukjinKwon commented Oct 12, 2017

Merged to master and branch-2.2.

@asfgit asfgit closed this in 9104add Oct 12, 2017

@dongjoon-hyun

This comment has been minimized.

Show comment
Hide comment
@dongjoon-hyun

dongjoon-hyun Oct 13, 2017

Member

Hi, All.
Can we have this in Apache Spark 2.2.1?

Member

dongjoon-hyun commented Oct 13, 2017

Hi, All.
Can we have this in Apache Spark 2.2.1?

asfgit pushed a commit that referenced this pull request Oct 13, 2017

[SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommi…
…tters

## What changes were proposed in this pull request?

`ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter`

This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this.

Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)

Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,

## How was this patch tested?

The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.

| committer | summary | outcome |
|-----------|---------|---------|
| parquet   | true    | success |
| parquet   | false   | success |
| marking   | false   | success with marker |
| marking   | true    | exception |

All tests are happy.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #19448 from steveloughran/cloud/SPARK-22217-committer.
@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

I didn't backported this one respecting the JIRA issue type, Improvement but yea, it sounds more like a bug fix.

Member

HyukjinKwon commented Oct 13, 2017

I didn't backported this one respecting the JIRA issue type, Improvement but yea, it sounds more like a bug fix.

&& !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
// output summary is requested, but the class is not a Parquet Committer
logWarning(s"Committer $committerClass is not a ParquetOutputCommitter and cannot" +
s" create job summaries. " +

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

D'oh, s ...

@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

D'oh, s ...

@gatorsmile

This comment has been minimized.

Show comment
Hide comment
@gatorsmile

gatorsmile Oct 13, 2017

Member

This is not eligible for backporting. We should not do it next time.

Member

gatorsmile commented Oct 13, 2017

This is not eligible for backporting. We should not do it next time.

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

I think this is a bug to fix as the previous behaviour does not work as documented:

subclass of org.apache.hadoop.mapreduce.OutputCommitter...

and does not change existing behaviour.

Could you elaborate why you think it is not eligible?

Member

HyukjinKwon commented Oct 13, 2017

I think this is a bug to fix as the previous behaviour does not work as documented:

subclass of org.apache.hadoop.mapreduce.OutputCommitter...

and does not change existing behaviour.

Could you elaborate why you think it is not eligible?

@gatorsmile

This comment has been minimized.

Show comment
Hide comment
@gatorsmile

gatorsmile Oct 13, 2017

Member

That conf is an internal one. The end users will not see it. This is not a bug fix.

We should not extend the existing functions or introduce new behaviors/features in 2.2.x releases.

Member

gatorsmile commented Oct 13, 2017

That conf is an internal one. The end users will not see it. This is not a bug fix.

We should not extend the existing functions or introduce new behaviors/features in 2.2.x releases.

@gatorsmile

This comment has been minimized.

Show comment
Hide comment
@gatorsmile

gatorsmile Oct 13, 2017

Member

Since the risk is low, I did not revert it.

Member

gatorsmile commented Oct 13, 2017

Since the risk is low, I did not revert it.

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

How come fixing the behaviour as documented is not a bug fix? I think that basically mean we don't backport fixes for things not working as documented for other internal configurations.

This does not extend the functionailities. This fixes functionalities to work as documented and expected, and I call it a bugfix.

Member

HyukjinKwon commented Oct 13, 2017

How come fixing the behaviour as documented is not a bug fix? I think that basically mean we don't backport fixes for things not working as documented for other internal configurations.

This does not extend the functionailities. This fixes functionalities to work as documented and expected, and I call it a bugfix.

@gatorsmile

This comment has been minimized.

Show comment
Hide comment
@gatorsmile

gatorsmile Oct 13, 2017

Member

This one starts at least since Spark 1.5. If you are not confident whether this is bug or not, please check it before merging it.

Member

gatorsmile commented Oct 13, 2017

This one starts at least since Spark 1.5. If you are not confident whether this is bug or not, please check it before merging it.

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

I did this as I was confident if it is a bug because doc says it should work but actually not, without breaking the previous support.

Member

HyukjinKwon commented Oct 13, 2017

I did this as I was confident if it is a bug because doc says it should work but actually not, without breaking the previous support.

@gatorsmile

This comment has been minimized.

Show comment
Hide comment
@gatorsmile

gatorsmile Oct 13, 2017

Member

Ok. Next time, please check it with the committers who are familiar with Spark SQL.

Member

gatorsmile commented Oct 13, 2017

Ok. Next time, please check it with the committers who are familiar with Spark SQL.

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

Will check it if I am not confident next time.

Member

HyukjinKwon commented Oct 13, 2017

Will check it if I am not confident next time.

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Oct 13, 2017

Contributor

Thanks for reviewing this/getting it in. Personally, I had it in the "improvement" category rather than bug fix. If it wasn't for that line in the docs, there'd be no ambiguity about improve/vs fix, and there is always a lower-risk way to fix doc/code mismatch: change the docs.

But I'm grateful for it being in; with the backport to branch-2 ryan should be able to use it semi-immediately

Contributor

steveloughran commented Oct 13, 2017

Thanks for reviewing this/getting it in. Personally, I had it in the "improvement" category rather than bug fix. If it wasn't for that line in the docs, there'd be no ambiguity about improve/vs fix, and there is always a lower-risk way to fix doc/code mismatch: change the docs.

But I'm grateful for it being in; with the backport to branch-2 ryan should be able to use it semi-immediately

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Oct 13, 2017

Contributor

PS, for people who are interested in dynamic committers, MAPREDUCE-6823 is something to look at. It allows you to switch committers under pretty much everything other than parquet...this patch helps make Parquet manageable too

Contributor

steveloughran commented Oct 13, 2017

PS, for people who are interested in dynamic committers, MAPREDUCE-6823 is something to look at. It allows you to switch committers under pretty much everything other than parquet...this patch helps make Parquet manageable too

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

I guess we wouldn't change the docs in branch-2.2 alone as we have a safe fix here for this mismatch anyway. I think I just wanted to say this backport can be justified.

Member

HyukjinKwon commented Oct 13, 2017

I guess we wouldn't change the docs in branch-2.2 alone as we have a safe fix here for this mismatch anyway. I think I just wanted to say this backport can be justified.

@gatorsmile

This comment has been minimized.

Show comment
Hide comment
@gatorsmile

gatorsmile Oct 13, 2017

Member

@steveloughran Thanks for your inputs. Totally agree on your opinions.

Spark is an infrastructure software. We have to be very careful when backporting the PRs.

Member

gatorsmile commented Oct 13, 2017

@steveloughran Thanks for your inputs. Totally agree on your opinions.

Spark is an infrastructure software. We have to be very careful when backporting the PRs.

@yhuai

This comment has been minimized.

Show comment
Hide comment
@yhuai

yhuai Oct 13, 2017

Contributor

@HyukjinKwon branch-2.2 is in a maintenance branch, I am not sure it is appropriate to merge this change to branch-2.2 since it is not really a bug fix. If the doc is not accurate, we should fix the doc. For a maintenance branch, we need to be very careful on what we merge and we should always avoid of unnecessary changes.

Contributor

yhuai commented Oct 13, 2017

@HyukjinKwon branch-2.2 is in a maintenance branch, I am not sure it is appropriate to merge this change to branch-2.2 since it is not really a bug fix. If the doc is not accurate, we should fix the doc. For a maintenance branch, we need to be very careful on what we merge and we should always avoid of unnecessary changes.

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

Okay. I am sorry for this trouble. Should we revert this if you guys feel strongly about it? I am okay with reverting it.

Member

HyukjinKwon commented Oct 13, 2017

Okay. I am sorry for this trouble. Should we revert this if you guys feel strongly about it? I am okay with reverting it.

@rdblue

This comment has been minimized.

Show comment
Hide comment
@rdblue

rdblue Oct 13, 2017

Contributor

I have a lot of sympathy for the argument that infrastructure software shouldn't have too many backports and that those should be generally bug fixes. But, if I were working on a Spark distribution at a vendor, this is something I would definitely include because it's such a useful feature. I think that by not backporting this, we're just pushing that work downstream. Plus, the risk to adding this is low: the main behavior change is that users can specify a previously-banned committer for Parquet writes. Is it a bug fix? Probably not. But it fixes a big blocker.

Contributor

rdblue commented Oct 13, 2017

I have a lot of sympathy for the argument that infrastructure software shouldn't have too many backports and that those should be generally bug fixes. But, if I were working on a Spark distribution at a vendor, this is something I would definitely include because it's such a useful feature. I think that by not backporting this, we're just pushing that work downstream. Plus, the risk to adding this is low: the main behavior change is that users can specify a previously-banned committer for Parquet writes. Is it a bug fix? Probably not. But it fixes a big blocker.

@yhuai

This comment has been minimized.

Show comment
Hide comment
@yhuai

yhuai Oct 13, 2017

Contributor

I am not really worried about this particular change. It's already merged and it seems a small and safe change. I am not planning to revert it.

But, in general, let's avoid of merging changes that are not bug fixes to a maintenance branch. If there is an exception, it will be better to make it clear earlier.

Contributor

yhuai commented Oct 13, 2017

I am not really worried about this particular change. It's already merged and it seems a small and safe change. I am not planning to revert it.

But, in general, let's avoid of merging changes that are not bug fixes to a maintenance branch. If there is an exception, it will be better to make it clear earlier.

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Oct 13, 2017

Member

Sure, I will and let me note it ahead next time. I made a mistake while trying to think of reasons for this backport.

Member

HyukjinKwon commented Oct 13, 2017

Sure, I will and let me note it ahead next time. I made a mistake while trying to think of reasons for this backport.

@yhuai

This comment has been minimized.

Show comment
Hide comment
@yhuai

yhuai Oct 13, 2017

Contributor

Thank you :)

Contributor

yhuai commented Oct 13, 2017

Thank you :)

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Oct 13, 2017

Contributor

But, if I were working on a Spark distribution at a vendor, this is something I would definitely include because it's such a useful feature.

I concur :)

Contributor

steveloughran commented Oct 13, 2017

But, if I were working on a Spark distribution at a vendor, this is something I would definitely include because it's such a useful feature.

I concur :)

ptkool added a commit to ptkool/spark that referenced this pull request Nov 13, 2017

[SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommi…
…tters

## What changes were proposed in this pull request?

`ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter`

This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this.

Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)

Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,

## How was this patch tested?

The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.

| committer | summary | outcome |
|-----------|---------|---------|
| parquet   | true    | success |
| parquet   | false   | success |
| marking   | false   | success with marker |
| marking   | true    | exception |

All tests are happy.

Author: Steve Loughran <stevel@hortonworks.com>

Closes apache#19448 from steveloughran/cloud/SPARK-22217-committer.

MatthewRBruce added a commit to Shopify/spark that referenced this pull request Jul 31, 2018

[SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommi…
…tters

## What changes were proposed in this pull request?

`ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter`

This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this.

Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)

Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,

## How was this patch tested?

The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.

| committer | summary | outcome |
|-----------|---------|---------|
| parquet   | true    | success |
| parquet   | false   | success |
| marking   | false   | success with marker |
| marking   | true    | exception |

All tests are happy.

Author: Steve Loughran <stevel@hortonworks.com>

Closes apache#19448 from steveloughran/cloud/SPARK-22217-committer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment