[BEAM-2795] Use portable constructs in Flink batch translator#4343
[BEAM-2795] Use portable constructs in Flink batch translator#4343kennknowles merged 2 commits intoapache:masterfrom
Conversation
|
I've left a protobuf round-trip in this change to ensure that all tests pass on rehydrated pipelines. I can remove this before submitting. I've left a few TODOs inline as questions to the reviewer. Finally, I'm not sure stylistically if it's appropriate to use these try-catch blocks everywhere. How is this normally handled in Beam? Should I bother with better error messages? |
kennknowles
left a comment
There was a problem hiding this comment.
Seems fine to me. I think @aljoscha should take a look.
| (PCollection<?>) application.getInputs().get(new TupleTag<>(sideInputTag)), | ||
| "no input with tag %s", | ||
| sideInputTag); | ||
| // TODO: Should ParDoTranslation#viewFromProto live elsewhere? |
There was a problem hiding this comment.
Yea, seems like it could like in something like PCollectionViewTranslation
|
|
||
| @Override | ||
| public Map<TupleTag<?>, PValue> getAdditionalInputs() { | ||
| // TODO: This was ripped from ParDoTranslation. Is this correct? |
There was a problem hiding this comment.
Yes, this looks fine to me. Probably this could also be a helper that is shared between ParDo and Combine, since the canonical definition of how these side inputs should work is based on the composite definition of Combine as a GBK followed by a ParDo.
There was a problem hiding this comment.
Where should such a helper live?
There was a problem hiding this comment.
Any suggestions? Based on the size of this code, I don't think it's worth refactoring into a brand new class. I'll leave it here unless you have a better suggestion.
| LOG.info(node.getTransform().getClass().toString()); | ||
| throw new UnsupportedOperationException("The transform " + transform | ||
| String transformUrn = PTransformTranslation.urnForTransform(transform); | ||
| LOG.info(transformUrn); |
There was a problem hiding this comment.
Logging seems redundant with the exception?
There was a problem hiding this comment.
Yes, that was probably my mistake, leaving that log output in.
| Map<TupleTag<?>, PValue> outputs = context.getOutputs(transform); | ||
|
|
||
| TupleTag<?> mainOutputTag; | ||
| try { |
There was a problem hiding this comment.
Kind of cluttery having this construct everywhere. I wonder if there is a way to lift the exception catching.
There was a problem hiding this comment.
Is java 8 supported by all the submodules? The only clean way I can think of is to demote the exception to runtime using a lambda wrapper.
There was a problem hiding this comment.
I've gone the lambda route to address this. It's fairly general and will be needed again in the streaming translator. However, I've left it inside of FlinkBatchTransformTranslators for now because it's unclear to me whether Java 8 is allowed globally.
Please advise on a better location/name.
There was a problem hiding this comment.
Looks like I wrote that too soon. The Maven build appears to be configured to use -source 1.7, at least for this module. I'm not sure what else to do.
|
run flink validatesrunner |
aljoscha
left a comment
There was a problem hiding this comment.
I second @kennknowles's comments and made some of my own. Overall I think this looks good but it's currently failing the Jenkins hooks.
| sideInputTag); | ||
| // TODO: Should ParDoTranslation#viewFromProto live elsewhere? | ||
| views.add( | ||
| ParDoTranslation.viewFromProto(sideInput, sideInputTag, originalPCollection, combineProto, components)); |
There was a problem hiding this comment.
checkstyle violation: line's too long
There was a problem hiding this comment.
I was wondering why this wasn't caught at compile time. Forgot I had been disabling checkstyle for faster builds. ;)
| LOG.info(node.getTransform().getClass().toString()); | ||
| throw new UnsupportedOperationException("The transform " + transform | ||
| String transformUrn = PTransformTranslation.urnForTransform(transform); | ||
| LOG.info(transformUrn); |
There was a problem hiding this comment.
Yes, that was probably my mistake, leaving that log output in.
| } | ||
| } | ||
|
|
||
| // TODO: Why does the UnionCoder order have to match the output map order? Why does this |
There was a problem hiding this comment.
The map of tags is created here:
And it's used again here to create individual Flink DataSets for each of the output tag indices:
If the order, i.e. the index, changes in between then the mapping to outputs won't be correct anymore.
There was a problem hiding this comment.
That first link is exactly what's happening above. Unfortunately, it's insufficient in getting the union coder to work properly. Leaving that as-is breaks when I do a protobuf round-trip. I have to create a union coder that lists individual coders in the same order they appear in the output map in order to get tests to pass. My question is: why is this only necessary when using a rehydrated pipeline?
There was a problem hiding this comment.
I think it might be because rehydration messes up the order of the individual coders. I also just realised that the code that constructs the outputMap is making sure to put the main coder at index 0 while the code that constructs the lists of union coders doesn't do it. I think this just happened to work because the main-output coder always was at index 0.
There was a problem hiding this comment.
To be more specific: I think rehydration changes the order so that the main input is no longer at index 0.
There was a problem hiding this comment.
Ah, this is interesting. Is there any requirement that the rest of the outputs match the order of their respective coders?
There was a problem hiding this comment.
UnionCoder requires that the order of the inputs / outputs for the tags to match because the union coder encodes values in a specific order and when reading them in needs to decode them in that same order.
There was a problem hiding this comment.
OK, thanks for confirming this. I'll leave the fix as-is and remove the comment.
bsidhom
left a comment
There was a problem hiding this comment.
Responding to comments.
The main thing to note here is that I've added a dependency on Java 8. Our current Flink dependency (1.4.0) doesn't require this, but Flink 1.5.0 will. I think it makes sense to start moving in this direction, but it's not required yet.
If we need to stick with Java 7, then I can bring the try/catch boilerplate back.
|
We do need to stick with Java 7 until the vote concludes. But this PR could perhaps wait for that result rather than reverting. |
|
|
||
| @Override | ||
| public Map<TupleTag<?>, PValue> getAdditionalInputs() { | ||
| // TODO: This was ripped from ParDoTranslation. Is this correct? |
There was a problem hiding this comment.
Any suggestions? Based on the size of this code, I don't think it's worth refactoring into a brand new class. I'll leave it here unless you have a better suggestion.
runners/flink/pom.xml
Outdated
| <groupId>org.apache.maven.plugins</groupId> | ||
| <artifactId>maven-compiler-plugin</artifactId> | ||
| <configuration> | ||
| <source>1.8</source> |
There was a problem hiding this comment.
For some reason, compilation fails on my machine even with Java 8 set for the runners/flink module. I'm reverting it for now and adding a TODO to refactor it once Java 8 is the default.
|
From what I can tell, Jenkins is failing for some reason unrelated to these changes. Let me know if this is not the case. |
|
Rebasing to see if it helps with the Jenkins issue. |
8878f94 to
a65e98f
Compare
|
OK, at least some of the old failures were fixed. The gradle build now fails due what looks like a corrupt dependency file. The maven build fails due to unspecified "dependency problems". |
a65e98f to
e8e36bb
Compare
|
run Flink ValidatesRunner |
|
Friendly ping. |
|
The result of the Flink ValidatesRunner is gone since the last commit was pushed but it was https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_ValidatesRunner_Flink/4665/ so this is as green as it gets for now. |
|
What's up with the pre-commit failures? Other than that I think this LGTM! (Sorry for the tardy responses, I was traveling and I will be on vacation until end of next week.) |
|
Run Flink ValidatesRunner |
|
There's a known issue with Dataflow right now causing the WordCountIT to fail - it is because of the moster classpath containing all of Flink, Spark, Apex, Gearpump, and Dataflow deps in the same Maven profile. Gradle runs the same tests in a better way so we know they pass. And the gradle build is failing only because of a known broken MqttIO test. I will sickbay all of these today; didn't get to them yesterday. So, yea, this is green as far as I am concerned. |
|
Ben - can you reorganize the commits into a history of meaningful changes, squashing in any that were just incremental changes during development and review. |
a1cd53f to
5d14026
Compare
|
I've rebased and cleaned this up. Should be ready for merging. |
5d14026 to
ee95b95
Compare
|
It does still include a commit and a revert of that commit. Seems silly to add both. |
|
I left that so it's easy enough to cherry pick the change back in when we switch to Java 8. I'll just remove it altogether. |
…ation `CombineTranslation` uses a new side input extractor modeled after `ParDoTranslation#getSideInputs`. The `RawCombine` rehydrated transform exposes side inputs via `getAdditionalInputs`. Side inputs were not previously exposed as "additional" inputs, so portable translators could not properly extract the main output collection when side inputs were used. `ParDoTranslation.viewFromProto` was used all over this package for general view translations. This method has been moved into a new `PCollectionViewTranslation` class.
This was tested by round-tripping batch pipelines to and from protobuf form. It works with both real Java pipelines and rehydrated pipelines. References and downcasts to specific transform subclasses are replaced with generic `PTransform`s. Transform metadata is now accessed through the translation utilities under `org.apache.beam.runners.core.construction`. The `ParDo` union coder is picky about ordering. It appears that coders must appear at the same indexes as their respective output collection tags. This ordering is now preserved.
ee95b95 to
5ddf50b
Compare
|
I've removed the commit/revert pair. |
|
This broke the checkstyle in Flink, filed BEAM-3478 and cut PR/4410 |
This was tested by round-tripping batch pipelines to and from protobuf form. It works with both real Java pipelines and rehydrated pipelines.
References and downcasts to specific transform subclasses are replaced with generic
PTransforms. Transform metadata is now accessed through the translation utilities underorg.apache.beam.runners.core.construction.CombineTranslationuses a new side input extractor modeled afterParDoTranslation#getSideInputs.The
RawCombinerehydrated transform exposes side inputs viagetAdditionalInputs. Side inputs were not previously exposed as "additional" inputs, soFlinkBatchTranslationContext#getInputcould not properly extract the main output collection when side inputs were used.The
ParDounion coder is picky about ordering. It appears that coders must appear at the same indexes as their respective output collection tags. This ordering is now preserved.Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue.mvn clean verifyto make sure basic checks pass. A more thorough check will be performed on your pull request automatically.