[BEAM-10962] Add Multiple PubSub reader to Python SDK #12930

InigoSJ · 2020-09-24T15:18:26Z

As discussed over Dev mail, a very common use case in Dataflow / Beam is reading from multiple PubSub topics/subscriptions and flatten them out.

The PR adds the PTransform to do so.

It takes two parameters:

source_list: List of topics/subscriptions.
with_context: option to return a key-value pair of the form (source, actual message)

Other options from ReadFromPubSub can also be included as kwargs except topic or subscription.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

InigoSJ · 2020-09-24T15:24:50Z

R: @chamikaramj

Adding you as reviewer since you replied in the email thread. Thanks a lot!

Let me know if there's something I should change.

codecov · 2020-09-24T15:37:42Z

Codecov Report

Merging #12930 (34b92bf) into master (3d6cc0e) will increase coverage by 0.02%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master   #12930      +/-   ##
==========================================
+ Coverage   82.48%   82.51%   +0.02%     
==========================================
  Files         455      455              
  Lines       54876    54924      +48     
==========================================
+ Hits        45266    45322      +56     
+ Misses       9610     9602       -8

Impacted Files	Coverage Δ
sdks/python/apache_beam/io/gcp/pubsub.py	`91.24% <87.50%> (-1.07%)`	⬇️
...hon/apache_beam/runners/worker/bundle_processor.py	`94.47% <0.00%> (+0.13%)`	⬆️
sdks/python/apache_beam/runners/common.py	`89.20% <0.00%> (+0.44%)`	⬆️
...ks/python/apache_beam/runners/worker/sdk_worker.py	`90.11% <0.00%> (+0.47%)`	⬆️
.../python/apache_beam/transforms/periodicsequence.py	`98.24% <0.00%> (+1.75%)`	⬆️
...ks/python/apache_beam/runners/worker/data_plane.py	`91.31% <0.00%> (+1.79%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70ddf84...17af76c. Read the comment docs.

chamikaramj · 2020-09-28T15:05:41Z

R: @boyuanzz

Passing to Boyuan who is working on updating PubSub source/sink for Dataflow.

InigoSJ · 2020-09-28T15:15:22Z

I was discussing this PR with Pablo and maybe it would be better to add it within the ReadFromPubSub itself rather than with a different PTransform. I see advantages in both approaches:

Modifying ReadFromPubSub

Just one PTransfrom that does both things
Less documentation needed

New PTransform

Easier to mantain, since MultipleReadFromPubSub would only expand from ReadFromPubSub, all modifications from ReadFromPubSub would directly be added.
Easier on different runners: if I'm not mistaken, Dataflow performs some overrides to ReadFromPubSub, so using MultipleReadFromPubSub would not be affected by this (since it expands it). Considering Dataflow is probably the main runner for this operation, we should consider this.
Less overhead: Having ReadFromPubSub take both list of topics/subscriptions and single topics/subscriptions may be a bit too much

It should not be hard to move this PR from a different PTransform to inside ReadFromPubSub. So please let me know what do you think about it.

Thanks!

chamikaramj · 2020-09-28T15:32:57Z

The usual pattern for sources is.

(1) A transform that reads from a given source config
(2) A "ReadAll" transform that reads a PCollection of configs.

Given that PubSub is a native transform for Dataflow though, we cannot really implement (2).

I'm not really sure a composite that just wraps a Flatten adds much value since pipeline authors can easily do that themselves (we can do that for every other transform as well but that will just clutter the API in my opinion).

(2) above will be more useful and will enable new use-cases. But we cannot really do that for Dataflow.

InigoSJ · 2020-09-29T12:02:34Z

The usual pattern for sources is.

(1) A transform that reads from a given source config
(2) A "ReadAll" transform that reads a PCollection of configs.

Given that PubSub is a native transform for Dataflow though, we cannot really implement (2).

I'm not really sure a composite that just wraps a Flatten adds much value since pipeline authors can easily do that themselves (we can do that for every other transform as well but that will just clutter the API in my opinion).

(2) above will be more useful and will enable new use-cases. But we cannot really do that for Dataflow.

I agree with you that adding a similar concept for all IOs would clutter the API. The main idea here is that this use case is widely shared and a lot of users are doing it themselves. This PTransform would speedup a lot their code and it would help organizing the sources better (rather than a wide pipeline graph, the sources are separated by topic/subs and by project).

Anyhow, I do understand your concern, let me know if I should proceed (by fixing the errors) or close the PR.

Thanks again!

boyuanzz · 2020-10-01T00:03:03Z

Except what Cham has mentioned, another thing is current implementation of MultipleReadFromPubSub only can configure multiple ReadFromPubSub with the same attribute, like the same with_attributes, timestamp_label, id_label, which is not ideal. Given that ReadPubSub is a native transform for Dataflow, having MultipleReadFromPubSub seems like the only solution for now. I'm thinking we could create a PubSubSourceDescriptor which includes topic, subscription and other attributes. And we expose add API from MultipleReadFromPubSub to allow end users to add a new Read.

…pubsub-reader

InigoSJ · 2020-10-09T08:33:04Z

Run Python PreCommit

InigoSJ · 2020-10-09T12:09:47Z

retest this please

InigoSJ · 2020-10-09T12:43:15Z

@boyuanzz , I am not aware on how to retry the 3 failing tests, they were successful in the previous commit, but with the new one (that only adds the line to disable Pylint in a block) they failed. Any idea?

Thanks!

InigoSJ · 2020-10-11T20:15:32Z

Run Python PreCommit

InigoSJ · 2020-10-12T10:00:24Z

Run Python PreCommit

boyuanzz · 2020-10-12T17:03:18Z

The Python Unit tests is not triggered by Run Python PreCommit. You can click Details and there is a re-run all tasks button to rerun these tests.

sdks/python/apache_beam/io/gcp/pubsub.py

boyuanzz · 2020-10-19T18:53:27Z

sdks/python/apache_beam/io/gcp/pubsub.py

+      step_name_base = 'PubSub %s/project:%s' % (source_type, source_project)
+      read_step_name = '%s/Read %s' % (step_name_base, source_name)
+
+      if source_type == 'topics':


PubSubSource has the similar checking logic as here. We should be able to move this check.

sdks/python/apache_beam/io/gcp/pubsub.py

InigoSJ · 2020-11-04T08:08:46Z

Run Python_PVR_Flink PreCommit

InigoSJ · 2020-11-09T11:18:46Z

Run Python_PVR_Flink PreCommit

boyuanzz

Please address comments and resolve the commit conflicts. The PVR flink should not be related.

sdks/python/apache_beam/io/gcp/pubsub.py

boyuanzz · 2020-11-13T00:07:09Z

sdks/python/apache_beam/io/gcp/pubsub.py

+  timestamp_attribute: str = None
+
+
+class MultipleReadFromPubSub(PTransform):


Please add a code snippet about how to use this transform and highlight the benefit of using this transform compared to the ReadFromPubSub.

boyuanzz · 2020-11-13T00:08:45Z

sdks/python/apache_beam/io/gcp/pubsub.py

+  def expand(self, pcol):
+    sources_pcol = []
+    for source in self.source_list:
+      source_split = source.source.split('/')


Instead of using split, can we use re.match(TOPIC_REGEXP, source.source) and re.match(SUBSCRIPTION_REGEXP, source.source) as well?

I used a new regex (PUBSUB_DESCRIPTOR_REGEXP) that is valid for both, so I could use match.group to check if topic or subscription, let me know what you think.

boyuanzz · 2020-11-13T00:14:09Z

Run Python_PVR_Flink PreCommit

boyuanzz · 2020-11-19T00:16:15Z

Please address the conflicting files. @InigoSJ

boyuanzz

Thanks! This PR almost looks good me except one additional comment and merge conflicts.

boyuanzz · 2020-11-19T00:20:16Z

sdks/python/apache_beam/io/gcp/pubsub.py

@@ -20,6 +20,27 @@
 Cloud Pub/Sub sources and sinks are currently supported only in streaming
 pipelines, during remote execution.

+Multiple Read from Pub/Sub


Please move this into the MultipleReadFromPubSub class py doc.

boyuanzz · 2020-11-20T20:07:42Z

Run Python_PVR_Flink PreCommit

boyuanzz

Thanks for your contribution! Before merging, please squash all commits into one commit. You can do so by:

git rebase -i HEAD~14
git push -f

[BEAM-10962] Add Multiple PubSub reader to Python SDK

72b8894

probot-autolabeler bot added gcp io python labels Sep 24, 2020

Merge branch 'master' into multi-pubsub-reader

296f3a4

InigoSJ added 2 commits October 8, 2020 18:48

[BEAM-10962] Add Multiple PubSub reader to Python SDK

50aaf77

Merge remote-tracking branch 'origin/multi-pubsub-reader' into multi-…

d0263c7

…pubsub-reader

aaltay requested a review from boyuanzz October 8, 2020 18:41

[BEAM-10962] Add Multiple PubSub reader to Python SDK

34b92bf

boyuanzz requested changes Oct 19, 2020

View reviewed changes

InigoSJ added 2 commits November 3, 2020 16:45

[BEAM-10962] Add Multiple PubSub reader to Python SDK

e046a63

[BEAM-10962] Add Multiple PubSub reader to Python SDK

660cdf5

boyuanzz reviewed Nov 13, 2020

View reviewed changes

[BEAM-10962] Add Multiple PubSub reader to Python SDK

dfc57ee

InigoSJ added 2 commits November 18, 2020 10:46

[BEAM-10962] Add Multiple PubSub reader to Python SDK

0a0dc6b

[BEAM-10962] Add Multiple PubSub reader to Python SDK

c33e97f

boyuanzz reviewed Nov 19, 2020

View reviewed changes

InigoSJ and others added 4 commits November 19, 2020 12:05

[BEAM-10962] Add Multiple PubSub reader to Python SDK

2508a4f

[BEAM-10962] Add Multiple PubSub reader to Python SDK

2e7bc4a

[BEAM-10962] Add Multiple PubSub reader to Python SDK

a279c7d

Merge branch 'master' into multi-pubsub-reader

17af76c

boyuanzz approved these changes Nov 20, 2020

View reviewed changes

boyuanzz merged commit 682f2ea into apache:master Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-10962] Add Multiple PubSub reader to Python SDK #12930

[BEAM-10962] Add Multiple PubSub reader to Python SDK #12930

InigoSJ commented Sep 24, 2020

InigoSJ commented Sep 24, 2020

codecov bot commented Sep 24, 2020 •

edited

chamikaramj commented Sep 28, 2020

InigoSJ commented Sep 28, 2020

chamikaramj commented Sep 28, 2020

InigoSJ commented Sep 29, 2020

boyuanzz commented Oct 1, 2020 •

edited

InigoSJ commented Oct 9, 2020

InigoSJ commented Oct 9, 2020

InigoSJ commented Oct 9, 2020

InigoSJ commented Oct 11, 2020

InigoSJ commented Oct 12, 2020

boyuanzz commented Oct 12, 2020

boyuanzz Oct 19, 2020

InigoSJ commented Nov 4, 2020

InigoSJ commented Nov 9, 2020

boyuanzz left a comment

boyuanzz Nov 13, 2020

boyuanzz Nov 13, 2020

InigoSJ Nov 18, 2020

boyuanzz commented Nov 13, 2020

boyuanzz commented Nov 19, 2020

boyuanzz left a comment

boyuanzz Nov 19, 2020

boyuanzz commented Nov 20, 2020

boyuanzz left a comment

		timestamp_attribute: str = None


		class MultipleReadFromPubSub(PTransform):

[BEAM-10962] Add Multiple PubSub reader to Python SDK #12930

[BEAM-10962] Add Multiple PubSub reader to Python SDK #12930

Conversation

InigoSJ commented Sep 24, 2020

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

InigoSJ commented Sep 24, 2020

codecov bot commented Sep 24, 2020 • edited

Codecov Report

chamikaramj commented Sep 28, 2020

InigoSJ commented Sep 28, 2020

chamikaramj commented Sep 28, 2020

InigoSJ commented Sep 29, 2020

boyuanzz commented Oct 1, 2020 • edited

InigoSJ commented Oct 9, 2020

InigoSJ commented Oct 9, 2020

InigoSJ commented Oct 9, 2020

InigoSJ commented Oct 11, 2020

InigoSJ commented Oct 12, 2020

boyuanzz commented Oct 12, 2020

boyuanzz Oct 19, 2020

Choose a reason for hiding this comment

InigoSJ commented Nov 4, 2020

InigoSJ commented Nov 9, 2020

boyuanzz left a comment

Choose a reason for hiding this comment

boyuanzz Nov 13, 2020

Choose a reason for hiding this comment

boyuanzz Nov 13, 2020

Choose a reason for hiding this comment

InigoSJ Nov 18, 2020

Choose a reason for hiding this comment

boyuanzz commented Nov 13, 2020

boyuanzz commented Nov 19, 2020

boyuanzz left a comment

Choose a reason for hiding this comment

boyuanzz Nov 19, 2020

Choose a reason for hiding this comment

boyuanzz commented Nov 20, 2020

boyuanzz left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 24, 2020 •

edited

boyuanzz commented Oct 1, 2020 •

edited