New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6369] [SQL] Uses commit coordinator to help committing Hive and Parquet tables #5139
Conversation
Test build #29004 has started for PR 5139 at commit
|
@aarondav if you have time, I'd appreciate your input here. |
Test build #29004 has finished for PR 5139 at commit
|
Test PASSed. |
@JoshRosen It would be great if you can also help reviewing this. Thanks in advance! |
sparkTaskContext.attemptNumber()) | ||
} | ||
|
||
def commitTask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation to this guy mentioning what it means to commitTask (i.e., we may contact the driver to become authorized to commit to ensure speculative tasks do not override each other, and that this may cause us to abort the task by throwing a CommitDeniedException if we cannot become authorized as such), pointing to the JIRA that this fixes (the original one).
LGTM from my side, but @JoshRosen should confirm the driver side should be happy with this. Only comment was that now that it's extracted and used in a common location, we need to make sure its API is well-documented. |
@aarondav Thanks! Added javadoc for this. |
Test build #29405 has started for PR 5139 at commit |
Test build #29405 has finished for PR 5139 at commit
|
Test PASSed. |
* the driver in order to determine whether this attempt can commit (please see SPARK-4879 for | ||
* details). | ||
* | ||
* Commit output coordinator is only contacted when the following two configurations are both set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Commit output coordinator" -> "Output commit coordinator"
LGTM from a Spark core point-of-view. One of the biggest risks here is passing the Long-valued parameters in the wrong order, but it looks like we've done it correctly here. I suppose that another risk might be calling the commit function with values of jobId, splitId, and attemptId that don't match / correspond to the ones used in the MapReduceTaskAttemptContext (that would undermine the whole scheme because the coordination wouldn't necessarily be guarding the right output paths), but our usage here looks fine as far as I can tell. |
Test build #29434 has started for PR 5139 at commit |
Thanks @JoshRosen! Fixed the typo. I'm merging this to master and 1.3. |
…d Parquet tables This PR leverages the output commit coordinator introduced in #4066 to help committing Hive and Parquet tables. This PR extracts output commit code in `SparkHadoopWriter.commit` to `SparkHadoopMapRedUtil.commitTask`, and reuses it for committing Parquet and Hive tables on executor side. TODO - [ ] Add tests <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5139) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5139 from liancheng/spark-6369 and squashes the following commits: 72eb628 [Cheng Lian] Fixes typo in javadoc 9a4b82b [Cheng Lian] Adds javadoc and addresses @aarondav's comments dfdf3ef [Cheng Lian] Uses commit coordinator to help committing Hive and Parquet tables (cherry picked from commit fde6945) Signed-off-by: Cheng Lian <lian@databricks.com>
Test build #29434 has finished for PR 5139 at commit
|
Test PASSed. |
…d Parquet tables This PR leverages the output commit coordinator introduced in apache#4066 to help committing Hive and Parquet tables. This PR extracts output commit code in `SparkHadoopWriter.commit` to `SparkHadoopMapRedUtil.commitTask`, and reuses it for committing Parquet and Hive tables on executor side. TODO - [ ] Add tests <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5139) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes apache#5139 from liancheng/spark-6369 and squashes the following commits: 72eb628 [Cheng Lian] Fixes typo in javadoc 9a4b82b [Cheng Lian] Adds javadoc and addresses @aarondav's comments dfdf3ef [Cheng Lian] Uses commit coordinator to help committing Hive and Parquet tables
This PR leverages the output commit coordinator introduced in #4066 to help committing Hive and Parquet tables.
This PR extracts output commit code in
SparkHadoopWriter.commit
toSparkHadoopMapRedUtil.commitTask
, and reuses it for committing Parquet and Hive tables on executor side.TODO