Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6369] [SQL] Uses commit coordinator to help committing Hive and Parquet tables #5139

Closed
wants to merge 3 commits into from

Conversation

liancheng
Copy link
Contributor

This PR leverages the output commit coordinator introduced in #4066 to help committing Hive and Parquet tables.

This PR extracts output commit code in SparkHadoopWriter.commit to SparkHadoopMapRedUtil.commitTask, and reuses it for committing Parquet and Hive tables on executor side.

TODO

  • Add tests

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #29004 has started for PR 5139 at commit dfdf3ef.

  • This patch merges cleanly.

@marmbrus
Copy link
Contributor

@aarondav if you have time, I'd appreciate your input here.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #29004 has finished for PR 5139 at commit dfdf3ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29004/
Test PASSed.

@liancheng
Copy link
Contributor Author

@JoshRosen It would be great if you can also help reviewing this. Thanks in advance!

sparkTaskContext.attemptNumber())
}

def commitTask(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation to this guy mentioning what it means to commitTask (i.e., we may contact the driver to become authorized to commit to ensure speculative tasks do not override each other, and that this may cause us to abort the task by throwing a CommitDeniedException if we cannot become authorized as such), pointing to the JIRA that this fixes (the original one).

@aarondav
Copy link
Contributor

LGTM from my side, but @JoshRosen should confirm the driver side should be happy with this. Only comment was that now that it's extracted and used in a common location, we need to make sure its API is well-documented.

@liancheng
Copy link
Contributor Author

@aarondav Thanks! Added javadoc for this.

@SparkQA
Copy link

SparkQA commented Mar 30, 2015

Test build #29405 has started for PR 5139 at commit 9a4b82b.

@SparkQA
Copy link

SparkQA commented Mar 30, 2015

Test build #29405 has finished for PR 5139 at commit 9a4b82b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29405/
Test PASSed.

* the driver in order to determine whether this attempt can commit (please see SPARK-4879 for
* details).
*
* Commit output coordinator is only contacted when the following two configurations are both set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Commit output coordinator" -> "Output commit coordinator"

@JoshRosen
Copy link
Contributor

LGTM from a Spark core point-of-view. One of the biggest risks here is passing the Long-valued parameters in the wrong order, but it looks like we've done it correctly here. I suppose that another risk might be calling the commit function with values of jobId, splitId, and attemptId that don't match / correspond to the ones used in the MapReduceTaskAttemptContext (that would undermine the whole scheme because the coordination wouldn't necessarily be guarding the right output paths), but our usage here looks fine as far as I can tell.

@liancheng liancheng changed the title [SPARK-6369] [SQL] [WIP] Uses commit coordinator to help committing Hive and Parquet tables [SPARK-6369] [SQL] Uses commit coordinator to help committing Hive and Parquet tables Mar 30, 2015
@SparkQA
Copy link

SparkQA commented Mar 30, 2015

Test build #29434 has started for PR 5139 at commit 72eb628.

@liancheng
Copy link
Contributor Author

Thanks @JoshRosen! Fixed the typo. I'm merging this to master and 1.3.

asfgit pushed a commit that referenced this pull request Mar 30, 2015
…d Parquet tables

This PR leverages the output commit coordinator introduced in #4066 to help committing Hive and Parquet tables.

This PR extracts output commit code in `SparkHadoopWriter.commit` to `SparkHadoopMapRedUtil.commitTask`, and reuses it for committing Parquet and Hive tables on executor side.

TODO

- [ ] Add tests

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5139)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5139 from liancheng/spark-6369 and squashes the following commits:

72eb628 [Cheng Lian] Fixes typo in javadoc
9a4b82b [Cheng Lian] Adds javadoc and addresses @aarondav's comments
dfdf3ef [Cheng Lian] Uses commit coordinator to help committing Hive and Parquet tables

(cherry picked from commit fde6945)
Signed-off-by: Cheng Lian <lian@databricks.com>
@asfgit asfgit closed this in fde6945 Mar 30, 2015
@liancheng liancheng deleted the spark-6369 branch March 30, 2015 23:52
@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29434 has finished for PR 5139 at commit 72eb628.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29434/
Test PASSed.

yu-iskw pushed a commit to yu-iskw/spark that referenced this pull request Apr 3, 2015
…d Parquet tables

This PR leverages the output commit coordinator introduced in apache#4066 to help committing Hive and Parquet tables.

This PR extracts output commit code in `SparkHadoopWriter.commit` to `SparkHadoopMapRedUtil.commitTask`, and reuses it for committing Parquet and Hive tables on executor side.

TODO

- [ ] Add tests

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5139)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes apache#5139 from liancheng/spark-6369 and squashes the following commits:

72eb628 [Cheng Lian] Fixes typo in javadoc
9a4b82b [Cheng Lian] Adds javadoc and addresses @aarondav's comments
dfdf3ef [Cheng Lian] Uses commit coordinator to help committing Hive and Parquet tables
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants