New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-1630] Adds API for defining Splittable DoFns using Python SDK. #3882
Conversation
See https://s.apache.org/splittable-do-fn-python-sdk for the design. This PR and the above doc were updated to reflect following recent updates to Splittable DoFn. * Support for ProcessContinuations * Support for dynamically updating output watermark irrespective of the output element production. This will be followed by a PR that adds support for reading Splittable DoFns using DirectRunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
need to write a Splittable DoFn. | ||
|
||
Not all runners support Splittable DoFn. See the capability matrix | ||
(a href="https://beam.apache.org/documentation/runners/capability-matrix/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matrix is currently focused on Java, so this is a bit misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this paragraph.
A Splittable DoFn must provide suitable overrides for the following methods | ||
of the ``DoFn`` class. | ||
* new_tracker() | ||
* restriction_coder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java, restriction_coder() and split() are not required (have defaults). Will this be addressed in Python too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made those two methods optional and added default implementations.
pass | ||
|
||
@staticmethod | ||
def stop(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary in Python? Seems like with the "last yielded element is a ProcessContinuation" thing, instead of saying yield stop() you can simply not yield anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense to remove this. Removed and updated docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
need to write a Splittable DoFn. | ||
|
||
Not all runners support Splittable DoFn. See the capability matrix | ||
(a href="https://beam.apache.org/documentation/runners/capability-matrix/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this paragraph.
A Splittable DoFn must provide suitable overrides for the following methods | ||
of the ``DoFn`` class. | ||
* new_tracker() | ||
* restriction_coder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made those two methods optional and added default implementations.
pass | ||
|
||
@staticmethod | ||
def stop(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense to remove this. Removed and updated docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, delegating the rest of the review to Robert.
Friendly ping :) |
See following documents for more details. | ||
* https://s.apache.org/splittable-do-fn | ||
* https://s.apache.org/splittable-do-fn-python-sdk | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document which methods maybe called concurrently (and hence need locking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"""Returns the current restriction. | ||
|
||
Returns a restriction accurately describing the full range of work the | ||
current ``DoFn.process()`` call will do, including already completed work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be updated over time? Should we note that anything in this restriction is a candidate for removal, and try_claim must be called before actually doing any work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't have try_claim() (and other methods needed for dynamic work rebalancing). For now I mentioned that current restriction might be updated dynamically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...due to concurrent invocation of other methods...
I think the fact that this may be called from other threads is key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
fully read. | ||
|
||
Returns: ``True`` if current restriction has been fully processed. | ||
Raises ValueError: if there is still any unclaimed work remaining in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it legal for a DoFn to return before fully processing the restriction?
Seems like it'd be better to call checkpoint() after the process has completed, at which point, if the entire restriction was processed, the current (remaining) restriction would be None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the runner will take a checkpoint after the processing has completed. This method will be called after that as an additional check to make sure that there are no data remaining in the current restriction after checkpoint is taken.
sdks/python/apache_beam/io/iobase.py
Outdated
""" | ||
raise NotImplementedError | ||
|
||
def checkpoint(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about try_split? I'd like to avoid redundancy with that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't support dynamic work rebalancing (neither does Java SDF). I think we should have further discussions before finalizing the API needed for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't introduce a method that we know will be redundant with future developments.
single parameter of type ``Timestamp`` or as an integer that gives the | ||
watermark in number of seconds. | ||
|
||
** Splittable DoFns ** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps move the Args section up to here (or maybe even higher). It does feel odd that 99% of DoFns won't be splittable, but 80% of this docstring is about splittable DoFns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, done :).
yield restriction | ||
|
||
@staticmethod | ||
def resume(resume_delay=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be used as
return self.resume(...)
Maybe it'd be better to put this as a static method on ProcessContinuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -189,6 +276,54 @@ def finish_bundle(self): | |||
""" | |||
pass | |||
|
|||
def new_tracker(self, restriction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear that all of these methods are only used in the splitable case. Maybe some documentation? (Maybe the names should all reflect this, e.g. containing the word "splittable"?)
I might go so far as to put these in a mixin class (interface) that would be checked for if the splittable argument was used, which we could also then to enforce that they're implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated documentation. I don't think it makes sense to move these to a MixIn since we don't expect any type other than DoFn to have this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage is that we can make these many methods abstract such that the user won't forget any of them when they need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. PTAL.
See following documents for more details. | ||
* https://s.apache.org/splittable-do-fn | ||
* https://s.apache.org/splittable-do-fn-python-sdk | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"""Returns the current restriction. | ||
|
||
Returns a restriction accurately describing the full range of work the | ||
current ``DoFn.process()`` call will do, including already completed work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't have try_claim() (and other methods needed for dynamic work rebalancing). For now I mentioned that current restriction might be updated dynamically.
sdks/python/apache_beam/io/iobase.py
Outdated
""" | ||
raise NotImplementedError | ||
|
||
def checkpoint(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't support dynamic work rebalancing (neither does Java SDF). I think we should have further discussions before finalizing the API needed for that.
single parameter of type ``Timestamp`` or as an integer that gives the | ||
watermark in number of seconds. | ||
|
||
** Splittable DoFns ** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, done :).
yield restriction | ||
|
||
@staticmethod | ||
def resume(resume_delay=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -189,6 +276,54 @@ def finish_bundle(self): | |||
""" | |||
pass | |||
|
|||
def new_tracker(self, restriction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated documentation. I don't think it makes sense to move these to a MixIn since we don't expect any type other than DoFn to have this functionality.
fully read. | ||
|
||
Returns: ``True`` if current restriction has been fully processed. | ||
Raises ValueError: if there is still any unclaimed work remaining in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the runner will take a checkpoint after the processing has completed. This method will be called after that as an additional check to make sure that there are no data remaining in the current restriction after checkpoint is taken.
Failure seems to be unrelated. Retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't realize I handn't sent these comments out yet.
""" | ||
|
||
def __init__(self, should_resume, resume_delay=0): | ||
"""Initializes a ProcessContinuation object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the "should_resume" parameter if it's never returned (yielded) from a DoFn that shouldn't?
sdks/python/apache_beam/io/iobase.py
Outdated
""" | ||
raise NotImplementedError | ||
|
||
def checkpoint(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't introduce a method that we know will be redundant with future developments.
@@ -189,6 +276,54 @@ def finish_bundle(self): | |||
""" | |||
pass | |||
|
|||
def new_tracker(self, restriction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage is that we can make these many methods abstract such that the user won't forget any of them when they need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
sdks/python/apache_beam/io/iobase.py
Outdated
""" | ||
raise NotImplementedError | ||
|
||
def checkpoint(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
robertwb wrote:
We shouldn't introduce a method that we know will be redundant with future developments.
Seems like, based on previous discussions [1], we are moving towards maintaining both methods (try_split() and checkpoint()) due to the need for calling checkpoint() after a resume() ? Also Java API currently has the checkpoint() method [2], we should add this to Python for consistency and if needed update both APIs (again maintaining consistency) after future discussions/experiments ?
[1] https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit
[2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/RestrictionTracker.java#L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantically, checkpoint() enforces that immediately following claim (try_claim() once we have that method) should be rejected while try_split(0) will not enforce that.
""" | ||
|
||
def __init__(self, should_resume, resume_delay=0): | ||
"""Initializes a ProcessContinuation object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
robertwb wrote:
Do we need the "should_resume" parameter if it's never returned (yielded) from a DoFn that shouldn't?
Removed.
@@ -189,6 +276,54 @@ def finish_bundle(self): | |||
""" | |||
pass | |||
|
|||
def new_tracker(self, restriction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
robertwb wrote:
The advantage is that we can make these many methods abstract such that the user won't forget any of them when they need them.
I'm not 100% sure of this one. Seems like you are suggesting to ask users to directly implement the MixIn (say SplittableMixIn). But this means that every SDF author will have to implement two classes due to the need to make it easy for the SDK to perform validation. I think it might be better to make the API easy for users (just implement DoFn and implement required SDF method if needed) and make validation the hard (slightly) way.
To follow up, I'm OK with leaving Regarding the mixin/documentation issues, here's a better idea: Rather than take a contentless |
Thanks. Added 'RestrictionProvider' and updated documentation accordingly. PTAL. |
Retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
"""Returns the current restriction. | ||
|
||
Returns a restriction accurately describing the full range of work the | ||
current ``DoFn.process()`` call will do, including already completed work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...due to concurrent invocation of other methods...
I think the fact that this may be called from other threads is key.
Splittable ``DoFn``s. | ||
|
||
To denote a ``DoFn`` class to be Splittable ``DoFn``, ``DoFn.process()`` | ||
method of that class should have exactly one parameter of a type that is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one parameter whose default value is an instance of RestrictionProvider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gives the watermark in number of seconds. | ||
""" | ||
|
||
def new_tracker(self, restriction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/new/create/ (which is consistent with the rest of our API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
""" | ||
raise NotImplementedError | ||
|
||
def restriction_coder(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this one last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
``DoFn.SideInputParam``: a side input that may be used when processing. | ||
``DoFn.TimestampParam``: timestamp of the input element. | ||
``DoFn.WindowParam``: ``Window`` the input element belongs to. | ||
An object of type ``RestrictionProvider``: having a parameter of a type that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An iobase.RestrictionProvider
instance: a restriction tracker will be provided here to allow treatment as a Splittable `DoFn``.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. PTAL.
"""Returns the current restriction. | ||
|
||
Returns a restriction accurately describing the full range of work the | ||
current ``DoFn.process()`` call will do, including already completed work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Splittable ``DoFn``s. | ||
|
||
To denote a ``DoFn`` class to be Splittable ``DoFn``, ``DoFn.process()`` | ||
method of that class should have exactly one parameter of a type that is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gives the watermark in number of seconds. | ||
""" | ||
|
||
def new_tracker(self, restriction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
""" | ||
raise NotImplementedError | ||
|
||
def restriction_coder(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
``DoFn.SideInputParam``: a side input that may be used when processing. | ||
``DoFn.TimestampParam``: timestamp of the input element. | ||
``DoFn.WindowParam``: ``Window`` the input element belongs to. | ||
An object of type ``RestrictionProvider``: having a parameter of a type that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Retest this please |
Jenkins failure above is unrelated. Retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
sdks/python/apache_beam/io/iobase.py
Outdated
``current_restriction()`` and the return value of this method invocation | ||
combined. | ||
|
||
This method must be called at most once on a given ``RestrictionTracker`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can't enforce or require this, as we may call checkpint immediately before a resume is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks. |
See https://s.apache.org/splittable-do-fn-python-sdk for the design.
This PR and the above doc were updated to reflect following recent updates to Splittable DoFn.
This will be followed by a PR that adds support for reading Splittable DoFns using DirectRunner.