-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-10962] Add Multiple PubSub reader to Python SDK #12930
Conversation
R: @chamikaramj Adding you as reviewer since you replied in the email thread. Thanks a lot! Let me know if there's something I should change. |
Codecov Report
@@ Coverage Diff @@
## master #12930 +/- ##
==========================================
+ Coverage 82.48% 82.51% +0.02%
==========================================
Files 455 455
Lines 54876 54924 +48
==========================================
+ Hits 45266 45322 +56
+ Misses 9610 9602 -8
Continue to review full report at Codecov.
|
R: @boyuanzz Passing to Boyuan who is working on updating PubSub source/sink for Dataflow. |
I was discussing this PR with Pablo and maybe it would be better to add it within the Modifying ReadFromPubSub
New PTransform
It should not be hard to move this PR from a different Thanks! |
The usual pattern for sources is. (1) A transform that reads from a given source config Given that PubSub is a native transform for Dataflow though, we cannot really implement (2). I'm not really sure a composite that just wraps a Flatten adds much value since pipeline authors can easily do that themselves (we can do that for every other transform as well but that will just clutter the API in my opinion). (2) above will be more useful and will enable new use-cases. But we cannot really do that for Dataflow. |
I agree with you that adding a similar concept for all IOs would clutter the API. The main idea here is that this use case is widely shared and a lot of users are doing it themselves. This PTransform would speedup a lot their code and it would help organizing the sources better (rather than a wide pipeline graph, the sources are separated by topic/subs and by project). Anyhow, I do understand your concern, let me know if I should proceed (by fixing the errors) or close the PR. Thanks again! |
Except what Cham has mentioned, another thing is current implementation of |
Run Python PreCommit |
retest this please |
@boyuanzz , I am not aware on how to retry the 3 failing tests, they were successful in the previous commit, but with the new one (that only adds the line to disable Pylint in a block) they failed. Any idea? Thanks! |
Run Python PreCommit |
1 similar comment
Run Python PreCommit |
The Python Unit tests is not triggered by |
step_name_base = 'PubSub %s/project:%s' % (source_type, source_project) | ||
read_step_name = '%s/Read %s' % (step_name_base, source_name) | ||
|
||
if source_type == 'topics': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PubSubSource
has the similar checking logic as here. We should be able to move this check.
Run Python_PVR_Flink PreCommit |
1 similar comment
Run Python_PVR_Flink PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments and resolve the commit conflicts. The PVR flink should not be related.
timestamp_attribute: str = None | ||
|
||
|
||
class MultipleReadFromPubSub(PTransform): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a code snippet about how to use this transform and highlight the benefit of using this transform compared to the ReadFromPubSub
.
def expand(self, pcol): | ||
sources_pcol = [] | ||
for source in self.source_list: | ||
source_split = source.source.split('/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using split
, can we use re.match(TOPIC_REGEXP, source.source)
and re.match(SUBSCRIPTION_REGEXP, source.source)
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a new regex (PUBSUB_DESCRIPTOR_REGEXP
) that is valid for both, so I could use match.group
to check if topic or subscription, let me know what you think.
Run Python_PVR_Flink PreCommit |
Please address the conflicting files. @InigoSJ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This PR almost looks good me except one additional comment and merge conflicts.
@@ -20,6 +20,27 @@ | |||
Cloud Pub/Sub sources and sinks are currently supported only in streaming | |||
pipelines, during remote execution. | |||
|
|||
Multiple Read from Pub/Sub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this into the MultipleReadFromPubSub
class py doc.
Run Python_PVR_Flink PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! Before merging, please squash all commits into one commit. You can do so by:
git rebase -i HEAD~14
git push -f
As discussed over Dev mail, a very common use case in Dataflow / Beam is reading from multiple PubSub topics/subscriptions and flatten them out.
The PR adds the PTransform to do so.
It takes two parameters:
source_list
: List of topics/subscriptions.with_context
: option to return a key-value pair of the form (source, actual message)Other options from
ReadFromPubSub
can also be included askwargs
excepttopic
orsubscription
.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.