[SPARK-18012][SQL] Simplify WriterContainer #15551

rxin · 2016-10-19T08:11:05Z

What changes were proposed in this pull request?

This patch refactors WriterContainer to simplify the logic and make control flow more obvious.The previous code setup made it pretty difficult to track the actual dependencies on variables and setups because the driver side and the executor side were using the same set of variables.

How was this patch tested?

N/A - this should be covered by existing tests.

rxin · 2016-10-19T08:11:32Z

cc @cloud-fan, @liancheng, @hvanhovell

What do you think about the high level idea? I got rid of WriterContainer and the associated OOP, since there isn't really any polymorphism in the old code. There was an "if" branch that triggers which object to instantiate. I basically moved that if and object hierarchy into an if branch in executeTask.

I'm hoping this provides a clearer high level control flow.

rxin · 2016-10-19T08:18:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+object WriteOutput extends Logging {
+
+  def write(
+      sparkSession: SparkSession,


i'm not too happy this function has a lot of arguments

rxin · 2016-10-19T08:19:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+    }
+  }
+
+  class WriteJobDescription(


this includes the all description needed in each task for the write job.

rxin · 2016-10-19T08:21:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+          writer.writeInternal(internalRow)
+        }
+
+        try {


looks like i can remove this try/catch

rxin · 2016-10-19T08:23:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+        committer,
+        iterator)
+    } else {
+      executeDynamicPartitionWriteTask(


note to self: it'd be better if we can move the task commit and abort logic into this function

cloud-fan · 2016-10-19T08:34:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+      description.outputFormatClass, taskAttemptContext, description.path, description.isAppend)
+    committer.setupTask(taskAttemptContext)
+
+    if (description.partitionColumns.isEmpty && description.bucketSpec.isEmpty) {


Now we will run this if-else at executor side right? Not a big deal, and worth it to make the code simpler.

SparkQA · 2016-10-19T08:56:27Z

Test build #67180 has finished for PR 15551 at commit 140eddb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-19T09:04:29Z

Test build #67182 has finished for PR 15551 at commit 006e5dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-19T09:13:41Z

Test build #67183 has finished for PR 15551 at commit dc46cc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-19T21:40:32Z

Test build #67209 has finished for PR 15551 at commit af07e73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-19T22:23:17Z

Test build #67214 has finished for PR 15551 at commit 455f6e4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ExecuteWriteTask

liancheng

Overall LGTM, left some minor comments and questions.

liancheng · 2016-10-19T23:14:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+        SparkHadoopMapRedUtil.commitTask(committer, taskAttemptContext, jobId.getId, taskId.getId)
+      })(catchBlock = {
+        // If there is an error, release resource and then abort the task
+        try {


Can we eliminate this try and just put the finally clause into finallyBlock of tryWithSafeFinallyAndFailureCallbacks?

No because we can't abort the task if there is no error (which would happen if we put it in finally).

liancheng · 2016-10-19T23:17:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+    } catch {
+      case t: Throwable =>
+        throw new SparkException("Task failed while writing rows", t)
+    }


Seems that the only purpose of this outermost try is to wrap SparkException over any thrown exception.

Curious about the contract here, when do we want a SparkException rather than some random exception class?

I'm preserving old behavior here.

liancheng · 2016-10-19T23:50:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

+
+      // Returns the data columns to be written given an input row
+      val getOutputRow =
+      UnsafeProjection.create(description.nonPartitionColumns, description.allColumns)


Nit: Indentation is off.

SparkQA · 2016-10-20T01:01:58Z

Test build #67223 has finished for PR 15551 at commit d405a3c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T02:42:24Z

Test build #67226 has finished for PR 15551 at commit cfba7bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T03:36:10Z

Test build #67227 has finished for PR 15551 at commit 0ca0c81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-10-20T05:21:42Z

LGTM

rxin · 2016-10-20T05:34:17Z

cc @tejasapatil fyi on the change

tejasapatil · 2016-10-21T04:45:17Z

@rxin : Thanks for notifying me.

## What changes were proposed in this pull request? This patch refactors WriterContainer to simplify the logic and make control flow more obvious.The previous code setup made it pretty difficult to track the actual dependencies on variables and setups because the driver side and the executor side were using the same set of variables. ## How was this patch tested? N/A - this should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#15551 from rxin/writercontainer-refactor.

WriterContainer refactoring.

140eddb

rxin added 2 commits October 19, 2016 01:12

Rename

544268b

Set path properly

006e5dd

rxin commented Oct 19, 2016

View reviewed changes

minor patch

dc46cc2

cloud-fan reviewed Oct 19, 2016

View reviewed changes

rxin changed the title ~~[SQL] WriterContainer refactoring - WIP~~ [SPARK-18012][SQL] WriterContainer refactoring - WIP Oct 19, 2016

rxin changed the title ~~[SPARK-18012][SQL] WriterContainer refactoring - WIP~~ [SPARK-18012][SQL] Simplify WriterContainer - WIP Oct 19, 2016

rxin added 2 commits October 19, 2016 13:26

Run some tests.

af07e73

Extract common try/catch and commit logic out

455f6e4

rxin added 2 commits October 19, 2016 16:23

Fix dynamic partition output

a37325a

Minor comment change

d405a3c

liancheng reviewed Oct 20, 2016

View reviewed changes

rxin changed the title ~~[SPARK-18012][SQL] Simplify WriterContainer - WIP~~ [SPARK-18012][SQL] Simplify WriterContainer Oct 20, 2016

rxin added 2 commits October 19, 2016 18:04

Simplify output committer path.

cfba7bb

Fix tests

0ca0c81

asfgit closed this in f313117 Oct 20, 2016

rxin mentioned this pull request Oct 20, 2016

[SPARK-18012][SQL] Simplify WriterContainer follow-up #15561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18012][SQL] Simplify WriterContainer #15551

[SPARK-18012][SQL] Simplify WriterContainer #15551

rxin commented Oct 19, 2016 •

edited

rxin commented Oct 19, 2016 •

edited

rxin Oct 19, 2016

rxin Oct 19, 2016

rxin Oct 19, 2016

rxin Oct 19, 2016

cloud-fan Oct 19, 2016

rxin Oct 19, 2016

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

liancheng left a comment

liancheng Oct 19, 2016

rxin Oct 20, 2016

liancheng Oct 19, 2016

rxin Oct 20, 2016

liancheng Oct 19, 2016

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

liancheng commented Oct 20, 2016

rxin commented Oct 20, 2016

tejasapatil commented Oct 21, 2016

[SPARK-18012][SQL] Simplify WriterContainer #15551

[SPARK-18012][SQL] Simplify WriterContainer #15551

Conversation

rxin commented Oct 19, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Oct 19, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

SparkQA commented Oct 19, 2016

liancheng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

liancheng commented Oct 20, 2016

rxin commented Oct 20, 2016

tejasapatil commented Oct 21, 2016

rxin commented Oct 19, 2016 •

edited

rxin commented Oct 19, 2016 •

edited