[BEAM-3744] Python PubSub API Fixes and Tests#5952
[BEAM-3744] Python PubSub API Fixes and Tests#5952charlesccychen merged 5 commits intoapache:masterfrom
Conversation
- Add WriteToPubSub, which can write messages with attributes. - [BEAM-4536] Fix and re-enable with_attributes support for reads and write. - Add pubsub_integration_test: - Does reads and writes with and without attributes. - Uses timestamp attributes and label ids features. - Only runs using TestDataflowRunner for now (DirectRunner-based testing TBD). - Deprecate ReadStringsFromPubSub, WriteStringsToPubSub. - Update examples to use newer (non-deprecated) PubSub PTransforms. - Misc: Rename payload -> data, since payload is data+attributes, according to the official google-cloud-pubsub client.
| def expand(self, pcoll): | ||
| if self.with_attributes: | ||
| pcoll = pcoll | 'ToProtobuf' >> Map(self.to_proto) | ||
| pcoll.element_type = six.binary_type |
There was a problem hiding this comment.
Is this valid? Shouldn't the output type be some protobuf type?
There was a problem hiding this comment.
After some investigation I believe that this is correct.
The output of this mapping is a string (uses the proto object's SerializeToString()).
On Dataflow, this string is passed to a runner harness written in Java, so it must be a serialized protobuf to be understood.
|
It also looks like there is a test failure: https://builds.apache.org/job/beam_PreCommit_Python_Commit/421/testReport/junit/apache_beam.io.gcp.pubsub_test/TestPubsubMessage/test_proto_conversion/ |
charlesccychen
left a comment
There was a problem hiding this comment.
Thanks, LGTM after minor nits.
| if transform.source.timestamp_attribute is not None: | ||
| step.add_property(PropertyNames.PUBSUB_TIMESTAMP_ATTRIBUTE, | ||
| transform.source.timestamp_attribute) | ||
| logging.info('pubsub source') |
| if transform.source.with_attributes: | ||
| # Setting this property signals Dataflow runner to return full | ||
| # PubsubMessages instead of just the payload. | ||
| # PubsubMessages instead of just data. |
There was a problem hiding this comment.
"the payload data" for clarity? This is the official doc for PubsubMessage: https://cloud.google.com/pubsub/docs/reference/rest/v1/PubsubMessage
| if transform.source.timestamp_attribute is not None: | ||
| step.add_property(PropertyNames.PUBSUB_TIMESTAMP_ATTRIBUTE, | ||
| transform.source.timestamp_attribute) | ||
| logging.info('pubsub source') |
| if transform.source.with_attributes: | ||
| # Setting this property signals Dataflow runner to return full | ||
| # PubsubMessages instead of just the payload. | ||
| # PubsubMessages instead of just data. |
|
Thank you! This LGTM. |
|
Thank you! I appreciate this. |
Fixes attributes and adds an integration test for Dataflow.
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username) to look at it.Post-Commit Tests Status (on master branch)