-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-3079] add Samza runner #4340
Conversation
Nice! I'll take a look. Have patience with my review - it is a pretty big PR :-) |
Reviewed 11 of 61 files at r1. a discussion (no related file): runners/pom.xml, line 48 at r1 (raw file):
I want to note that we are currently dropping Java 7 support so that this can stay here. But it is worth seeing the runners/samza/.gitignore, line 1 at r1 (raw file):
Out of curiosity - where does this come from? runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file):
To make sure I understand this - it means that every Samza source will be set to this same max parallelism? Comments from Reviewable |
Review status: 11 of 61 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. runners/pom.xml, line 48 at r1 (raw file): Previously, kennknowles (Kenn Knowles) wrote…
Thanks for letting me know. runners/samza/.gitignore, line 1 at r1 (raw file): Previously, kennknowles (Kenn Knowles) wrote…
This is the folder for the RocksDb local state files of Samza processors. By default it will be created under the current folder, which is runners/samza/ when running the tests. In real deployment we set a Samza env var for the location. runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file): Previously, kennknowles (Kenn Knowles) wrote…
Right. This is similar to Flink's parallism, and it's used by all the sources in Samza. The subsequent Samza tasks will be created based on the actual splits returned from the source. Do we allow user to set splits/parallism at individual source level? I am curious how other runners like dataflow split the sources. Comments from Reviewable |
Review status: 11 of 61 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file): Previously, xinyuiscool wrote…
One more question: for Samza, it'll be nice we know the partitions of a Unbounded/Bounded source if it is partitioned, e.g Kafka. With this I can split a source by its partition count by default and each partition will be a split. This is how Samza itself works today by default, so it's consistent with the current model. Seems right now I don't see such metadata being exposed in the general UnboundedSource/BoundedSource. Comments from Reviewable |
Reviewed 49 of 61 files at r1, 1 of 1 files at r2. a discussion (no related file): runners/samza/.gitignore, line 1 at r1 (raw file): Previously, xinyuiscool wrote…
Perhaps SamzaPipelineOptions is a good place to manage this and set it up to be a tmpdir in local tests? runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file): Previously, xinyuiscool wrote…
@rangadi any comment on this? I think that the way it works it that the source itself will split into the default number. Dataflow used to do this during translation, which was not a great thing. Now splittable DoFn makes it necessary to do it dynamically. Comments from Reviewable |
runners/samza/pom.xml
Outdated
<profiles> | ||
<profile> | ||
<id>local-validates-runner-tests</id> | ||
<activation><activeByDefault>false</activeByDefault></activation> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is on a feature branch, let's set this to true
I've updated the |
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. a discussion (no related file): Previously, kennknowles (Kenn Knowles) wrote…
Thanks for the heads up. Can I take a look early next week or do you need feedback sooner? Comments from Reviewable |
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file): Previously, kennknowles (Kenn Knowles) wrote…
The UnboundedSource interface is somewhat similar to hadoop. split() method passes in 'desiredNumSplits' which is a hint from the runner to the source. The source can try to obey or return any number splits that makes sense for the source. Runner decides how to map those splits into its internal processing parallelism. In that sense does 'maxSourceParallism' directly influence 'desiredNumSplits'? How it works in : Streaming applications in Dataflow: Its 'desiredNumSplits' is based on max number of cores the jobs is configured (due to autoscaling, actually number of cores might be fewer). If the source returns more splits than suggested, I think Dataflow tries to run all of the splits in parallel (need to check if there is a limit). Comments from Reviewable |
The |
Rebased with master and samza is with all the other runners in pom.xml. Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. runners/samza/.gitignore, line 1 at r1 (raw file): Previously, kennknowles (Kenn Knowles) wrote…
I agree, let me put up some changes for it. runners/samza/pom.xml, line 40 at r2 (raw file): Previously, kennknowles (Kenn Knowles) wrote…
Sure, just committed the change. runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file): Previously, rangadi (Raghu Angadi) wrote…
@rangadi you're right about maxSourceParallism, which is used to set the "desiredNumSplits" so the splits are bounded by this number. In samza, this will decide the number of tasks for a job. What I was saying (I think Kenneth meant the same) is that it'll be super nice that BEAM provides an API for the source to split into a default number, like the number of partitions in Kafka. So as a user, he doesn't need to bother finding out the number of each input kafka topic parittions when he runs the job. For LinkedIn, this is very valuable since a Kafka topic might have different partitions in different fabrics, so a default will be very helpful to the users. That's also the behavior in Samza today if the user doesn't provide a customized grouper of partitions: each partition becomes a task (or a split in this case). Comments from Reviewable |
Review status: 0 of 77 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed. runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file): Previously, xinyuiscool wrote…
Thanks. I am not sure if I you would like me to address any specific question. Please let me know. If the purpose is to let the user set number of tasks for a job, may be the option could be named '(max)NumberOfTasks'. I am just thinking aloud with little overall context. Comments from Reviewable |
mvn install -pl runners/samza --also-make -DskipTests -Dcheckstyle.skip -Dfindbugs.skip -Drat.skip -Dmdeps.analyze.skip
mvn verify -P local-validates-runner-tests -pl runners/samza/ I think this is definitely ready to go in. Reviewed 3 of 64 files at r3, 11 of 11 files at r4, 2 of 2 files at r5, 61 of 61 files at r6. runners/samza/pom.xml, line 40 at r2 (raw file): Previously, xinyuiscool wrote…
Ah, I realized that since the Jenkins job has narrowed its scope, it will not pick this up. I will run myself and then we will put it into the gradle build later. runners/samza/src/main/java/org/apache/beam/runners/samza/SamzaPipelineOptions.java, line 46 at r1 (raw file): Previously, rangadi (Raghu Angadi) wrote…
If I understand everything correctly, things work in the best way: you don't need this pipeline option because For user-defined By the way this is not a blocker for getting this onto a branch. It will be better to review specific pull requests on focused changes. Comments from Reviewable |
Can you fix this:
|
@kennknowles : Added the licenses. Thanks! |
OK, the maven failure is legitimate due to findbugs. We can fix it on the feature branch, actually. It will be nicer to do them one at a time, and we should first integrate to gradle. |
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue.mvn clean verify
to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.This PR adds Samza runner to BEAM. The overall design is here. The Samza runner supports most of the BEAM transformations, side input/output, and unbounded/bounded sources. The features not in the scope of this PR are:
Integration tests are verified by running mvn install -P local-validates-runner-tests.
@kennknowles : Please help us take a look. Thanks!