-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-18493] Add missing python APIs: withWatermark and checkpoint to dataframe #15921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @davies for PySpark changes |
|
No, I don't think we need to throw any exceptions. Watermarks are defined at batch boundaries, so it would just have no affect for a batch job. We should make sure that the batch planner knows that it can elide the operator (it probably does not today). |
|
Test build #68799 has finished for PR 15921 at commit
|
|
Test build #68801 has finished for PR 15921 at commit
|
|
Test build #68814 has finished for PR 15921 at commit
|
|
@marmbrus Why don't we want to throw exceptions? Wouldn't it help users catch errors early. |
|
Test build #68826 has finished for PR 15921 at commit
|
|
Test build #68866 has finished for PR 15921 at commit
|
|
Test build #68867 has finished for PR 15921 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Just several nits.
| * Returns a checkpointed version of this Dataset. | ||
| * Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate | ||
| * the logical plan of this Dataset, which is especially useful in iterative algorithms where the | ||
| * plan may grow exponentially. It will be saved to a file inside the checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a file -> files.
Each partition will be saved to one file.
| * Returns a checkpointed version of this Dataset. | ||
| * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the | ||
| * logical plan of this Dataset, which is especially useful in iterative algorithms where the | ||
| * plan may grow exponentially. It will be saved to a file inside the checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a file -> files.
Each partition will be saved to one file.
python/pyspark/sql/dataframe.py
Outdated
| def checkpoint(self, eager=True): | ||
| """Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the | ||
| logical plan of this DataFrame, which is especially useful in iterative algorithms where the | ||
| plan may grow exponentially. It will be saved to a file inside the checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a file -> files.
Each partition will be saved to one file.
|
Test build #68868 has finished for PR 15921 at commit
|
|
Test build #68950 has finished for PR 15921 at commit
|
|
thanks @gatorsmile and @tdas. I addressed your comments. The semantics look a lot cleaner now. That doesn't still mean it's clean though :P |
|
Test build #68959 has finished for PR 15921 at commit
|
|
Hallelujah! @zsxwing shall we merge this? |
|
LGTM. Merging to master and 2.1. |
…o dataframe ## What changes were proposed in this pull request? This PR adds two of the newly added methods of `Dataset`s to Python: `withWatermark` and `checkpoint` ## How was this patch tested? Doc tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15921 from brkyvz/py-watermark. (cherry picked from commit 97a8239) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…o dataframe ## What changes were proposed in this pull request? This PR adds two of the newly added methods of `Dataset`s to Python: `withWatermark` and `checkpoint` ## How was this patch tested? Doc tests Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#15921 from brkyvz/py-watermark.
…o dataframe ## What changes were proposed in this pull request? This PR adds two of the newly added methods of `Dataset`s to Python: `withWatermark` and `checkpoint` ## How was this patch tested? Doc tests Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#15921 from brkyvz/py-watermark.
What changes were proposed in this pull request?
This PR adds two of the newly added methods of
Datasets to Python:withWatermarkandcheckpointHow was this patch tested?
Doc tests