-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-6619] [BEAM-6593] Add example integration tests to postcommit #8076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Run Python PostCommit |
|
Run Python PostCommit |
49cbdeb to
a12b64b
Compare
|
Run Python PostCommit |
fc57d0e to
98d7f0f
Compare
|
Run Python PostCommit |
9a05dd2 to
9aa9f37
Compare
|
Run Python PostCommit |
tvalentyn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @Juta for the changes and the clean up. Left some comments, please take a look.
|
|
||
| def process(self, elem): | ||
| try: | ||
| if isinstance(elem, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always decode here? It is better to have a clear expectation of input arguments as much as possible on Python 3: either always encoded bytes, or always strings, but not mixing the two.
|
|
||
| def process(self, elem): | ||
| try: | ||
| if isinstance(elem, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always decode here? It is better to have a clear expectation of input arguments as much as possible on Python 3: either always encoded bytes, or always strings, but not mixing the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IT test uses pubsub as input source while the unit test currently passes strings. That's why we cannot always decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow - could you please elaborate - what exactly in either pubsub or unit test makes it so that the input type is not consistent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for leader_board and game_stats is used both with strings https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/game/leader_board_test.py#L45 and with pubsub messages: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/game/leader_board_it_test.py#L146.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the fix would then be to:
- Expect
elemto be a string. - Decode the messages as they arrive from PubSub using
| 'DecodeString' >> Map(lambda b: b.decode('utf-8')in the example pipeline. Note that there are several example pipelines that will need this change.
Note that we used to have ReadStringsFromPubSub method that was deprecated:
beam/sdks/python/apache_beam/io/gcp/pubsub.py
Line 194 in a5ed104
| @deprecated(since='2.7.0', extra_message='Use ReadFromPubSub instead.') |
We may want to revisit this deprecation in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tvalentyn I applied this fix
|
|
||
| # Input event containing user, team, score, processing time, window start. | ||
| INPUT_EVENT = 'user1,teamA,10,%d,2015-11-02 09:09:28.224' | ||
| INPUT_EVENT = b'user1,teamA,10,%d,2015-11-02 09:09:28.224' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this should be bytes. Does self.pub_client.publish require bytes? If so, we should encode before passing data to that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes self.pub_client.publish requires bytes. In this case I think specifying the input event as bytes is what is expected because it is directly passed to the client. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the question is then, to the humans who read and edit this code, should INPUT_EVENT be a textual date or encoded data? I think it's easier to consider it text, up until it's time to feed it to pubsub, then when we can encode it to bytes.
I'd keep INPUT_EVENT as is change publishing to:
event = self.INPUT_EVENT % self._test_timestamp
self.pub_client.publish(event.encode('utf-8'))
I would expect users to follow a similar pattern in their pipelines, and they might refer to beam examples for guidance, so I suggest to change this.
|
|
||
| # Input events containing user, team, score, processing time, window start. | ||
| INPUT_EVENT = 'user1,teamA,10,%d,2015-11-02 09:09:28.224' | ||
| INPUT_EVENT = b'user1,teamA,10,%d,2015-11-02 09:09:28.224' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this should be bytes. Does self.pub_client.publish require bytes? If so, we should encode before passing data to that method,
To quote https://docs.python.org/3/howto/pyporting.html , Decode binary data to text as soon as possible, encode text as binary data as late as possible
|
Run Python PostCommit |
|
Run Python PostCommit |
c11d669 to
c548b69
Compare
|
Run Python PostCommit |
|
Run Python PostCommit |
tvalentyn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Juta!
|
Thank you @Juta and @tvalentyn |
This is is part of a series of PRs with goal to make Apache Beam PY3 compatible. The proposal with the outlined approach has been documented here: https://s.apache.org/beam-python-3.
This PR adds a more integration tests to the postcommit jobs on direct and dataflow runners.
Post-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.