[SPARK-22217] [SQL] ParquetFileFormat to support arbitrary OutputCommitters #19448

steveloughran · 2017-10-06T16:14:58Z

What changes were proposed in this pull request?

ParquetFileFormat to relax its requirement of output committer class from org.apache.parquet.hadoop.ParquetOutputCommitter or subclass thereof (and so implicitly Hadoop FileOutputCommitter) to any committer implementing org.apache.hadoop.mapreduce.OutputCommitter

This enables output committers which don't write to the filesystem the way FileOutputCommitter does to save parquet data from a dataframe: at present you cannot do this.

Before a committer which isn't a subclass of ParquetOutputCommitter, it checks to see if the context has requested summary metadata by setting parquet.enable.summary-metadata. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)

Note that SQLConf already states that any OutputCommitter can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,

How was this patch tested?

The patch includes a test suite, ParquetCommitterSuite, with a new committer, MarkingFileOutputCommitter which extends FileOutputCommitter and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.

committer	summary	outcome
parquet	true	success
parquet	false	success
marking	false	success with marker
marking	true	exception

All tests are happy.

…ass, provided saveSummaries is disabled. With Tests Change-Id: I19872dc1c095068ed5a61985d53cb7258bd9a9bb

steveloughran · 2017-10-06T16:15:27Z

@rdblue

SparkQA · 2017-10-06T18:52:17Z

Test build #82517 has finished for PR 19448 at commit e6fdbdc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2017-10-07T00:06:57Z

+1

I completely agree that using a ParquetOutputCommitter should be optional.

vanzin

One minor suggestion otherwise LGTM.

vanzin · 2017-10-10T23:20:04Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+    if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
+      && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+      // output summary is requested, but the class is not a Parquet Committer
+      throw new RuntimeException(s"Committer $committerClass is not a ParquetOutputCommitter" +


IllegalArgumentException or some other better exception?

How about require maybe?

HyukjinKwon · 2017-10-10T23:27:27Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+      && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+      // output summary is requested, but the class is not a Parquet Committer
+      throw new RuntimeException(s"Committer $committerClass is not a ParquetOutputCommitter" +
+        s" and cannot create job summaries.")


Looks we can remove this s BTW.

Depends on the policy about "what to do if it's not a parquet committer and the option for job summaries is set. It could just mean "you don't get summaries", which worksforme :). May want to log at info though?

Oh, I mean .. s in s" .. " (s for string interpolation)

aah. in the move to require() everything is going back onto a single line. so now moot

HyukjinKwon

LGTM too, just few tiny nits while double checking.

HyukjinKwon · 2017-10-11T01:42:30Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala

+
+  override def afterAll(): Unit = {
+    spark.stop()
+    spark = null


maybe super.afterAll()?

done, + will add a check for spark==null so if a failure happens during setup, the exception doesn't get lost in teardown

HyukjinKwon · 2017-10-11T01:43:36Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala

+  }
+
+  test("alternative output committer, no merge schema") {
+    writeDataFrame(MarkingFileOutput.COMMITTER, false, true)


I think It might be a little bit better to use named arguments for readability: writeDataFrame(MarkingFileOutput.COMMITTER, summary = false, check = true)

…test suite tuning Change-Id: Ib7e99860fab66cb2bc47e2e4f90f4fc8041c7f03

gatorsmile · 2017-10-11T17:35:45Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -138,6 +138,10 @@ class ParquetFileFormat
      conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
    }

+    require(!conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)


We need to issue an AnalysisException here.

AnalysisException? Shouldn't this be SparkException? By the time this runs, Spark has already analyzed, optimized, and planned the job. Doesn't seem like failing analysis is appropriate.

SparkException is better. Normally, we want to issue a Spark-specific exception type.

SparkException makes it sound like it's a problem that Spark caused in some way. While this is caused by user input being incorrect, in which case the suggested IllegalArgumentException (which require throws) is better imo.

In Spark SQL, we do issue the AnalysisException in many similar cases. I am also fine to use SparkException.

In this specific case, the users are able to control the conf to make it works. Thus, we also need to improve the message to let users know how to resolve it by changing the conf.

I think I'd prefer the warn & continue option. It does little good to fail so late in a job, when the caller has already indicated that they want to use a different committer. Let them write the data out since this isn't a correctness issue, and they can add a summary file later if they want. Basically, there's less annoyance and interruption by not writing a summary file than by failing a job and forcing the user to re-run near the end.

+1 for warn and continue.

If we issuing a warning log, we will see such a warning message for each write operation. Does it look annoying?

yes. there is that. Options: do something complicated with a static field to only print ones. Log at debug so people only see the message if they are trying to track things down.

I think once per write operation is fine. It's not like it is once per file.

gatorsmile · 2017-10-11T17:52:36Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala

+      spark.stop()
+      spark = null
+    }
+    super.afterAll()


try { ... } finally { super.afterAll() }

SparkQA · 2017-10-11T18:54:58Z

Test build #82643 has finished for PR 19448 at commit d634f9e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM

jiangxb1987 · 2017-10-12T13:32:06Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala

+import org.apache.spark.sql.test.SQLTestUtils
+
+/**
+ * Test logic related to choice of output commtters


nit: commtters -> committers

jiangxb1987 · 2017-10-12T13:34:25Z

retest this please

Change-Id: I92420bff4afe180eda106337df253b0445e56979

SparkQA · 2017-10-12T16:19:21Z

Test build #82680 has finished for PR 19448 at commit d634f9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-12T16:39:37Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+    require(!conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
+      || classOf[ParquetOutputCommitter].isAssignableFrom(committerClass),
+      s"Committer $committerClass is not a ParquetOutputCommitter and cannot create job summaries."
+    + " Set Parquet option " + ParquetOutputFormat.ENABLE_JOB_SUMMARY + " to false.")


nit:

... s"Committer $committerClass is not a ParquetOutputCommitter and cannot create job summaries. " + s"Set Parquet option '${ParquetOutputFormat.ENABLE_JOB_SUMMARY}' to false.")

I'd thought about that; didn't look any better or worse. Will change it for log message.

HyukjinKwon · 2017-10-12T16:40:44Z

Still LGTM except for few nits.

… tells user to unset the parquet property...is that needed now? Change-Id: I1c34b341fb4e0e3297becec4fc3dd3e63c005b7c

gatorsmile · 2017-10-12T18:58:19Z

LGTM pending Jenkins

SparkQA · 2017-10-12T19:04:16Z

Test build #82692 has finished for PR 19448 at commit c93eb1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…text Change-Id: Ibcc8ada3c57091dd6a03e3efbcbc4791c556a287

rdblue · 2017-10-12T19:33:39Z

Still +1 from me as well.

SparkQA · 2017-10-12T21:29:42Z

Test build #82700 has finished for PR 19448 at commit 42afccb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-12T23:00:12Z

Test build #82702 has finished for PR 19448 at commit f486263.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-12T23:41:22Z

Merged to master and branch-2.2.

dongjoon-hyun · 2017-10-13T00:17:35Z

Hi, All.
Can we have this in Apache Spark 2.2.1?

…tters ## What changes were proposed in this pull request? `ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter` This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this. Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message. (It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.) Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify, ## How was this patch tested? The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary. | committer | summary | outcome | |-----------|---------|---------| | parquet | true | success | | parquet | false | success | | marking | false | success with marker | | marking | true | exception | All tests are happy. Author: Steve Loughran <stevel@hortonworks.com> Closes #19448 from steveloughran/cloud/SPARK-22217-committer.

HyukjinKwon · 2017-10-13T01:09:29Z

I didn't backported this one respecting the JIRA issue type, Improvement but yea, it sounds more like a bug fix.

HyukjinKwon · 2017-10-13T02:25:28Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+      && !classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+      // output summary is requested, but the class is not a Parquet Committer
+      logWarning(s"Committer $committerClass is not a ParquetOutputCommitter and cannot" +
+        s" create job summaries. " +


D'oh, s ...

gatorsmile · 2017-10-13T03:32:57Z

This is not eligible for backporting. We should not do it next time.

HyukjinKwon · 2017-10-13T04:20:25Z

I think this is a bug to fix as the previous behaviour does not work as documented:

subclass of org.apache.hadoop.mapreduce.OutputCommitter...

and does not change existing behaviour.

Could you elaborate why you think it is not eligible?

gatorsmile · 2017-10-13T04:37:28Z

That conf is an internal one. The end users will not see it. This is not a bug fix.

We should not extend the existing functions or introduce new behaviors/features in 2.2.x releases.

gatorsmile · 2017-10-13T04:37:53Z

Since the risk is low, I did not revert it.

HyukjinKwon · 2017-10-13T04:50:36Z

How come fixing the behaviour as documented is not a bug fix? I think that basically mean we don't backport fixes for things not working as documented for other internal configurations.

This does not extend the functionailities. This fixes functionalities to work as documented and expected, and I call it a bugfix.

gatorsmile · 2017-10-13T05:13:47Z

This one starts at least since Spark 1.5. If you are not confident whether this is bug or not, please check it before merging it.

HyukjinKwon · 2017-10-13T05:17:46Z

I did this as I was confident if it is a bug because doc says it should work but actually not, without breaking the previous support.

gatorsmile · 2017-10-13T05:20:28Z

Ok. Next time, please check it with the committers who are familiar with Spark SQL.

HyukjinKwon · 2017-10-13T05:24:39Z

Will check it if I am not confident next time.

steveloughran · 2017-10-13T12:40:59Z

Thanks for reviewing this/getting it in. Personally, I had it in the "improvement" category rather than bug fix. If it wasn't for that line in the docs, there'd be no ambiguity about improve/vs fix, and there is always a lower-risk way to fix doc/code mismatch: change the docs.

But I'm grateful for it being in; with the backport to branch-2 ryan should be able to use it semi-immediately

steveloughran · 2017-10-13T12:45:47Z

PS, for people who are interested in dynamic committers, MAPREDUCE-6823 is something to look at. It allows you to switch committers under pretty much everything other than parquet...this patch helps make Parquet manageable too

HyukjinKwon · 2017-10-13T13:06:12Z

I guess we wouldn't change the docs in branch-2.2 alone as we have a safe fix here for this mismatch anyway. I think I just wanted to say this backport can be justified.

gatorsmile · 2017-10-13T16:00:02Z

@steveloughran Thanks for your inputs. Totally agree on your opinions.

Spark is an infrastructure software. We have to be very careful when backporting the PRs.

yhuai · 2017-10-13T16:09:12Z

@HyukjinKwon branch-2.2 is in a maintenance branch, I am not sure it is appropriate to merge this change to branch-2.2 since it is not really a bug fix. If the doc is not accurate, we should fix the doc. For a maintenance branch, we need to be very careful on what we merge and we should always avoid of unnecessary changes.

HyukjinKwon · 2017-10-13T16:26:08Z

Okay. I am sorry for this trouble. Should we revert this if you guys feel strongly about it? I am okay with reverting it.

rdblue · 2017-10-13T16:34:12Z

I have a lot of sympathy for the argument that infrastructure software shouldn't have too many backports and that those should be generally bug fixes. But, if I were working on a Spark distribution at a vendor, this is something I would definitely include because it's such a useful feature. I think that by not backporting this, we're just pushing that work downstream. Plus, the risk to adding this is low: the main behavior change is that users can specify a previously-banned committer for Parquet writes. Is it a bug fix? Probably not. But it fixes a big blocker.

yhuai · 2017-10-13T16:51:51Z

I am not really worried about this particular change. It's already merged and it seems a small and safe change. I am not planning to revert it.

But, in general, let's avoid of merging changes that are not bug fixes to a maintenance branch. If there is an exception, it will be better to make it clear earlier.

HyukjinKwon · 2017-10-13T17:11:40Z

Sure, I will and let me note it ahead next time. I made a mistake while trying to think of reasons for this backport.

yhuai · 2017-10-13T17:25:47Z

Thank you :)

steveloughran · 2017-10-13T18:10:09Z

But, if I were working on a Spark distribution at a vendor, this is something I would definitely include because it's such a useful feature.

I concur :)

…tters ## What changes were proposed in this pull request? `ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter` This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this. Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message. (It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.) Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify, ## How was this patch tested? The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary. | committer | summary | outcome | |-----------|---------|---------| | parquet | true | success | | parquet | false | success | | marking | false | success with marker | | marking | true | exception | All tests are happy. Author: Steve Loughran <stevel@hortonworks.com> Closes apache#19448 from steveloughran/cloud/SPARK-22217-committer.

SPARK-22217 tuning ParquetOutputCommitter to support any committer cl…

e6fdbdc

…ass, provided saveSummaries is disabled. With Tests Change-Id: I19872dc1c095068ed5a61985d53cb7258bd9a9bb

vanzin reviewed Oct 10, 2017

View reviewed changes

HyukjinKwon reviewed Oct 10, 2017

View reviewed changes

HyukjinKwon approved these changes Oct 11, 2017

View reviewed changes

SPARK-22217 review suggestions: use require() for exceptions; slight …

d634f9e

…test suite tuning Change-Id: Ib7e99860fab66cb2bc47e2e4f90f4fc8041c7f03

gatorsmile reviewed Oct 11, 2017

View reviewed changes

jiangxb1987 approved these changes Oct 12, 2017

View reviewed changes

SPARK-22217 updated error text and check for it in the test suite

c93eb1b

Change-Id: I92420bff4afe180eda106337df253b0445e56979

HyukjinKwon reviewed Oct 12, 2017

View reviewed changes

SPARK-22217 log @ warn, with test modified to expect it. warning text…

42afccb

… tells user to unset the parquet property...is that needed now? Change-Id: I1c34b341fb4e0e3297becec4fc3dd3e63c005b7c

SPARK-22217 include the fact that you don't get summaries in SQLConf …

f486263

…text Change-Id: Ibcc8ada3c57091dd6a03e3efbcbc4791c556a287

asfgit closed this in 9104add Oct 12, 2017

HyukjinKwon reviewed Oct 13, 2017

View reviewed changes

[SPARK-22217] [SQL] ParquetFileFormat to support arbitrary OutputCommitters #19448

[SPARK-22217] [SQL] ParquetFileFormat to support arbitrary OutputCommitters #19448

Conversation

steveloughran commented Oct 6, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

steveloughran commented Oct 6, 2017

SparkQA commented Oct 6, 2017

rdblue commented Oct 7, 2017

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Oct 11, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2017

jiangxb1987 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiangxb1987 commented Oct 12, 2017

SparkQA commented Oct 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 12, 2017

gatorsmile commented Oct 12, 2017

SparkQA commented Oct 12, 2017

rdblue commented Oct 12, 2017

SparkQA commented Oct 12, 2017

SparkQA commented Oct 12, 2017

HyukjinKwon commented Oct 12, 2017 • edited

dongjoon-hyun commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017

Choose a reason for hiding this comment

gatorsmile commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017

steveloughran commented Oct 13, 2017

steveloughran commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

yhuai commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017 • edited

rdblue commented Oct 13, 2017

yhuai commented Oct 13, 2017

HyukjinKwon commented Oct 13, 2017

yhuai commented Oct 13, 2017

steveloughran commented Oct 13, 2017

steveloughran commented Oct 6, 2017 •

edited

HyukjinKwon Oct 11, 2017 •

edited

HyukjinKwon left a comment •

edited

HyukjinKwon commented Oct 12, 2017 •

edited

HyukjinKwon commented Oct 13, 2017 •

edited