Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18012][SQL] Simplify WriterContainer #15551

Closed
wants to merge 10 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Oct 19, 2016

What changes were proposed in this pull request?

This patch refactors WriterContainer to simplify the logic and make control flow more obvious.The previous code setup made it pretty difficult to track the actual dependencies on variables and setups because the driver side and the executor side were using the same set of variables.

How was this patch tested?

N/A - this should be covered by existing tests.

@rxin
Copy link
Contributor Author

rxin commented Oct 19, 2016

cc @cloud-fan, @liancheng, @hvanhovell

What do you think about the high level idea? I got rid of WriterContainer and the associated OOP, since there isn't really any polymorphism in the old code. There was an "if" branch that triggers which object to instantiate. I basically moved that if and object hierarchy into an if branch in executeTask.

I'm hoping this provides a clearer high level control flow.

object WriteOutput extends Logging {

def write(
sparkSession: SparkSession,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not too happy this function has a lot of arguments

}
}

class WriteJobDescription(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this includes the all description needed in each task for the write job.

writer.writeInternal(internalRow)
}

try {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like i can remove this try/catch

committer,
iterator)
} else {
executeDynamicPartitionWriteTask(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: it'd be better if we can move the task commit and abort logic into this function

description.outputFormatClass, taskAttemptContext, description.path, description.isAppend)
committer.setupTask(taskAttemptContext)

if (description.partitionColumns.isEmpty && description.bucketSpec.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we will run this if-else at executor side right? Not a big deal, and worth it to make the code simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67180 has finished for PR 15551 at commit 140eddb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67182 has finished for PR 15551 at commit 006e5dd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67183 has finished for PR 15551 at commit dc46cc2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin rxin changed the title [SQL] WriterContainer refactoring - WIP [SPARK-18012][SQL] WriterContainer refactoring - WIP Oct 19, 2016
@rxin rxin changed the title [SPARK-18012][SQL] WriterContainer refactoring - WIP [SPARK-18012][SQL] Simplify WriterContainer - WIP Oct 19, 2016
@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67209 has finished for PR 15551 at commit af07e73.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67214 has finished for PR 15551 at commit 455f6e4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait ExecuteWriteTask

Copy link
Contributor

@liancheng liancheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left some minor comments and questions.

SparkHadoopMapRedUtil.commitTask(committer, taskAttemptContext, jobId.getId, taskId.getId)
})(catchBlock = {
// If there is an error, release resource and then abort the task
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we eliminate this try and just put the finally clause into finallyBlock of tryWithSafeFinallyAndFailureCallbacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because we can't abort the task if there is no error (which would happen if we put it in finally).

} catch {
case t: Throwable =>
throw new SparkException("Task failed while writing rows", t)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the only purpose of this outermost try is to wrap SparkException over any thrown exception.

Curious about the contract here, when do we want a SparkException rather than some random exception class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm preserving old behavior here.


// Returns the data columns to be written given an input row
val getOutputRow =
UnsafeProjection.create(description.nonPartitionColumns, description.allColumns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Indentation is off.

@rxin rxin changed the title [SPARK-18012][SQL] Simplify WriterContainer - WIP [SPARK-18012][SQL] Simplify WriterContainer Oct 20, 2016
@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67223 has finished for PR 15551 at commit d405a3c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67226 has finished for PR 15551 at commit cfba7bb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67227 has finished for PR 15551 at commit 0ca0c81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

LGTM

@asfgit asfgit closed this in f313117 Oct 20, 2016
@rxin
Copy link
Contributor Author

rxin commented Oct 20, 2016

cc @tejasapatil fyi on the change

@tejasapatil
Copy link
Contributor

@rxin : Thanks for notifying me.

robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?
This patch refactors WriterContainer to simplify the logic and make control flow more obvious.The previous code setup made it pretty difficult to track the actual dependencies on variables and setups because the driver side and the executor side were using the same set of variables.

## How was this patch tested?
N/A - this should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#15551 from rxin/writercontainer-refactor.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
This patch refactors WriterContainer to simplify the logic and make control flow more obvious.The previous code setup made it pretty difficult to track the actual dependencies on variables and setups because the driver side and the executor side were using the same set of variables.

## How was this patch tested?
N/A - this should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#15551 from rxin/writercontainer-refactor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants