-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-6695] Latest PTransform for Python SDK #8206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Run Portable_Python PreCommit |
|
Run Python PreCommit |
|
Run RAT PreCommit |
|
Run Python_PVR_Flink PreCommit |
|
Run Portable_Python PreCommit |
|
R: @robinyqiu |
|
Hi Tanay! Thanks for the contribution. Currently the Python preCommit are all failling. I can review the code while we wait for the preCommit to pass. |
|
Hi @robinyqiu! Sorry I was AFK. I'll fix the failing Python Pre-Commit before the review. I'll ping you once the tests pass. |
abb6583 to
8d9b97a
Compare
|
Hi @robinyqiu the tests have passed. It is ready for review :) |
Added Latest PTransform and Combine Fns for the Python SDK. Latest PTransform is used to compute the element(s) with the latest timestamp from a PCollection.
|
Observation from working on this: I have a PR on my fork with a failing test to illustrate this: ttanay#6 I'm not sure whether a PCollection of TimestampedValues would be of much use. I found a tuple of (value, timestamp) to be a useful replacement for it. |
| class LatestTest(unittest.TestCase): | ||
|
|
||
| def test_globally(self): | ||
| l = [window.GlobalWindows.windowed_value(1, 100), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe shuffle the example inputs a bit to make them look more random? Like [(2, 100), (3, 300), (1, 200)]? Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I'll make the change.
| output = self.fn.extract_output(accumulator) | ||
| self.assertEquals(output, 1) | ||
|
|
||
| def test_input_type_violation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test should go to LatestTest instead of LatestCombineFnTest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test - test_input_type_violation is used to test the type imposed on the input to the CombineFn here.
The test - test_per_key_type_violation is used to check that the input is a KV pair in the DoFn - add_timestamp of class PerKey here.
I think test_per_key_type_violation can be written better to display the same behaviour. I'll make that change.
|
Run Portable_Python PreCommit |
|
Run RAT PreCommit |
|
Run Python_PVR_Flink PreCommit |
|
Run Python PreCommit |
1 similar comment
|
Run Python PreCommit |
|
Hey @robinyqiu I have made the changes. PTAL. 🚀 |
| @staticmethod | ||
| def add_timestamp(element, timestamp=core.DoFn.TimestampParam): | ||
| _check_instance_type(KV[T, T], element) | ||
| (K, V) = element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider changing these local variables to (k, v) since K, V are type variables already defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or in a similar vein to element, timestamp etc, use key, value = element. Also I believe you do not need the paranthesis on the left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank You @robinyqiu and @aaltay. I'll make the change.
| | core.CombinePerKey(LatestCombineFn())) | ||
|
|
||
|
|
||
| @with_input_types(KV[T, T]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using KV as type hint seems a bit weird to me because the input is really a (element, timestamp) tuple. Maybe we can just use something like Tuple[E, T] as input type and E as output type. For more about Python type hints, see: https://beam.apache.org/documentation/sdks/python-type-safety/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It should be a Tuple rather than a KV.
Also, only types int, float, long, Timestamp and Duration can be interpreted as timestamps[1] so they can be compared to window.MIN_TIMESTAMP in the accumulator. So, having the timestamp's type as T would be less restrictive.
Does it make sense to create a new type - TimestampT/TimestampType = Union[int, float, long, Timestamp, Duration] in typehints.py or the combiners module? This would make the input type Tuple[T, TimestampType].
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/timestamp.py#L71
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good suggestion. We can add the definition in this module.
Why is Duration also allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Timestamp and Duration have an attribute - micros. This is used to compare the two.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/timestamp.py#L161-L165
| from apache_beam.typehints import Union | ||
| from apache_beam.typehints import with_input_types | ||
| from apache_beam.typehints import with_output_types | ||
| from apache_beam.typehints.decorators import _check_instance_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions whose names start with underscore are intended to be used only internally. So we shouldn't import _check_instance_type from another package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using the decorator with_input_types. But, it was comparing the typehint with WindowedValue, whereas, the type of the element in the DoFn was a tuple which was actually the WindowedValue object's value attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_input_types won't work here. Are there other public functions that does the check available? I still believe importing private functions is not a good approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no public functions that do the same/something similar, except for with_input_types AFAIK.
I think the problem here is that we want to check the type of the element which is different at pipeline construction-time and runtime.
While creating the PCollection, the type of the elements is WindowedValue and inside the DoFn, the type of the element is Tuple.
If the input is an object of WindowedValue, any DoFn will have its element param as the value attribute of the WindowedValue. I think the type evaluated at pipeline construction-time should also be the same.
If that is the case, then with_input_types will work as intended here.
| @staticmethod | ||
| def add_timestamp(element, timestamp=core.DoFn.TimestampParam): | ||
| _check_instance_type(KV[T, T], element) | ||
| _check_instance_type(Tuple[T, T], element) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you have accidentally changed KV to Tuple here, I think here we do want to check the input has KV[K, V] type.
| def expand(self, pcoll): | ||
| return (pcoll | ||
| | core.ParDo(self.add_timestamp) | ||
| .with_output_types(Tuple[T, Tuple[T, TimestampType]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. I believe it should be KV[K, Tuple[T, TimestampType]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. My bad. I'll change it.
|
Hi @robinyqiu |
|
LGTM, and thank you very much for making these changes! |
|
Thank you @robinyqiu! 😄 |
| pc = p | Create(l_int) | ||
| _ = pc | beam.ParDo(combine.Latest.PerKey.add_timestamp) | ||
|
|
||
| with self.assertRaises(TypeCheckError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is being tested here? Why passing KV values here will result in an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to test that the _check_instance_type used above raises an Exception. The last positional argument to window.GlobalWindows.windowed_value is the timestamp of each element. So, the input PCollections are those that violate the KV constraint.
|
|
||
| @staticmethod | ||
| def add_timestamp(element, timestamp=core.DoFn.TimestampParam): | ||
| _check_instance_type(KV[K, V], element) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop this. The reason is, during pipeline construction these validation will happen, the best thing to do at transform level is to use typehints as much as possible to express what the transform is doing. These are limited but it is better than introducing one off checks inside transforms.
See Top transform for an example: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/combiners.py#L314
(This will also be good for not using an internal function.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank You @aaltay! I'll go through it and make the necessary changes.
|
@aaltay @robinyqiu The need for type validation inside the staticmethod # Elements to create PCollection from
elem_list = [window.GlobalWindows.windowed_value(('a', 1), 300),
window.GlobalWindows.windowed_value(('b', 3), 100),
window.GlobalWindows.windowed_value(('a', 2), 200)]
with TestPipeline() as p:
pc = p | Create(elem_list)
latest = pc | combine.Latest.PerKey()
assert_that(latest, equal_to([('a', 1), ('b', 3)]))In the staticmethod Is this expected behaviour? If not, I'd love to fix this. It would solve the problem. |
|
Try this instead: In your example, the inputs you are passing are not in type Tuple[K, V]. For example, whatever test you come up with, it should be valid to do Create(elem_list) | GroupByKey(), and that input will work on PerKey transform. I will defer to @robinyqiu for the remainder of the review. |
|
Thank You @aaltay 😄 I tried using TimestampedValue, but, I need to set the timestamp so it is accessible as the TimestampParam in the DoFn - @robinyqiu I've removed the call to |
|
Thank you @ttanay for your patience. TimestampedValue is supposed to pack its timestamp. I expected that TimestampParam should work with that, if not, that might be a bug. |
|
Hi @aaltay sorry for late reply. |
|
Hi @ttanay , I spent some time to investigate into the issues. I am sharing my findings here and I hope it will be helpful.
In other words, the action item is that we should add type annotations here @with_input_types(T)
@with_output_types(T)
class Globally(ptransform.PTransform): ...and here @with_input_types(KV[K, V])
@with_output_types(KV[K, V])
class PerKey(ptransform.PTransform): ...This is the intended behavior for these transforms. I know that the current tests will break after we add these annotations, but I think the right thing to do is that we make these changes and fix the tests (how we create a PCollection of timestamped values/kvs using
I have tried it and the following test passed: def test_per_key(self):
l = [window.TimestampedValue(('a', 1), 300),
window.TimestampedValue(('b', 3), 100),
window.TimestampedValue(('a', 2), 200)]
with TestPipeline() as p:
pc = p | Create(l) | Map(lambda x: x)
latest = pc | combine.Latest.PerKey()
assert_that(latest, equal_to([('a', 1), ('b', 3)]))
|
|
Thank You @robinyqiu! The action points and the hack were really helpful. I now understand why the hack was neccessary. This approach of type-checking looks much cleaner! |
| def test_globally_empty(self): | ||
| l = [] | ||
| with TestPipeline() as p: | ||
| pc = p | Create(l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add an explicit Map here as well, similar to what we did below, for the sake of consistency.
(This test is passing now only because in the assert_that function there is an implicit Map transform added to the pipeline.)
| window.GlobalWindows.windowed_value(('b', 3), 100), | ||
| window.GlobalWindows.windowed_value(('a', 2), 200)] | ||
| with TestPipeline() as p: | ||
| pc = p | Create(l) | Map(lambda x: x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a short comment here to explain why this hack is needed. Same above.
| class LatestTest(unittest.TestCase): | ||
|
|
||
| def test_globally(self): | ||
| l = [window.GlobalWindows.windowed_value(3, 100), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using TimestampedValue instead of windowed_value? I think it will make the code easier to understand. Same below.
|
Made the changes. @robinyqiu PTAL :) |
|
Thank you @ttanay and @robinyqiu. This was a great collaboration between you two! |
Added Latest PTransform and Combine Fns for the Python SDK.
Latest PTransform is used to compute the element(s) with the
latest timestamp from a PCollection.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.