Skip to content

Conversation

@brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Nov 17, 2016

What changes were proposed in this pull request?

This PR adds two of the newly added methods of Datasets to Python:
withWatermark and checkpoint

How was this patch tested?

Doc tests

@brkyvz brkyvz changed the title Add missing python APIs: withWatermark and checkpoint to dataframe [SPARK-18493] Add missing python APIs: withWatermark and checkpoint to dataframe Nov 17, 2016
@brkyvz
Copy link
Contributor Author

brkyvz commented Nov 17, 2016

cc @davies for PySpark changes
cc @liancheng for checkpoint API and javadoc update
cc @marmbrus for withWatermark API. Question here: should we throw an analysis exception if the Dataset that withWatermark is called on is non-streaming?

@marmbrus
Copy link
Contributor

No, I don't think we need to throw any exceptions. Watermarks are defined at batch boundaries, so it would just have no affect for a batch job.

We should make sure that the batch planner knows that it can elide the operator (it probably does not today).

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68799 has finished for PR 15921 at commit 96da9e9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68801 has finished for PR 15921 at commit 9fc8aff.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2016

Test build #68814 has finished for PR 15921 at commit 7d7bc4d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 18, 2016

@marmbrus Why don't we want to throw exceptions? Wouldn't it help users catch errors early.

@SparkQA
Copy link

SparkQA commented Nov 18, 2016

Test build #68826 has finished for PR 15921 at commit da5de14.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2016

Test build #68866 has finished for PR 15921 at commit 3e4f7c1.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2016

Test build #68867 has finished for PR 15921 at commit 8ee9e2c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Just several nits.

* Returns a checkpointed version of this Dataset.
* Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate
* the logical plan of this Dataset, which is especially useful in iterative algorithms where the
* plan may grow exponentially. It will be saved to a file inside the checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a file -> files.

Each partition will be saved to one file.

* Returns a checkpointed version of this Dataset.
* Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
* logical plan of this Dataset, which is especially useful in iterative algorithms where the
* plan may grow exponentially. It will be saved to a file inside the checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a file -> files.

Each partition will be saved to one file.

def checkpoint(self, eager=True):
"""Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
logical plan of this DataFrame, which is especially useful in iterative algorithms where the
plan may grow exponentially. It will be saved to a file inside the checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a file -> files.

Each partition will be saved to one file.

@SparkQA
Copy link

SparkQA commented Nov 18, 2016

Test build #68868 has finished for PR 15921 at commit 306f7fd.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2016

Test build #68950 has finished for PR 15921 at commit 9de8518.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Nov 21, 2016

thanks @gatorsmile and @tdas. I addressed your comments. The semantics look a lot cleaner now. That doesn't still mean it's clean though :P

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #68959 has finished for PR 15921 at commit c7c046f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Nov 22, 2016

Hallelujah! @zsxwing shall we merge this?

@zsxwing
Copy link
Member

zsxwing commented Nov 22, 2016

LGTM. Merging to master and 2.1.

@asfgit asfgit closed this in 97a8239 Nov 22, 2016
asfgit pushed a commit that referenced this pull request Nov 22, 2016
…o dataframe

## What changes were proposed in this pull request?

This PR adds two of the newly added methods of `Dataset`s to Python:
`withWatermark` and `checkpoint`

## How was this patch tested?

Doc tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #15921 from brkyvz/py-watermark.

(cherry picked from commit 97a8239)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
…o dataframe

## What changes were proposed in this pull request?

This PR adds two of the newly added methods of `Dataset`s to Python:
`withWatermark` and `checkpoint`

## How was this patch tested?

Doc tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#15921 from brkyvz/py-watermark.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…o dataframe

## What changes were proposed in this pull request?

This PR adds two of the newly added methods of `Dataset`s to Python:
`withWatermark` and `checkpoint`

## How was this patch tested?

Doc tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#15921 from brkyvz/py-watermark.
@brkyvz brkyvz deleted the py-watermark branch February 3, 2019 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants