[BEAM-2021] Make Most StandardCoders CustomCoders#2668
[BEAM-2021] Make Most StandardCoders CustomCoders#2668tgroh wants to merge 8 commits intoapache:masterfrom
Conversation
6fc50b2 to
ca05834
Compare
Standard Coders have a defined serialization format and are understood within the Runner API, Custom Coders are not. Move existing "StandardCoders" to extend CustomCoder, and remove custom cloud object related serialization logic where possible. Still remaining: Splitting the CustomCoder side of the class hierarchy from the StandardCoder side of the hierarchy, moving IterableLikeCoder to be a CustomCoder, and have IterableCoder forward to an internal implementation (to ensure it remains a StandardCoder).
ca05834 to
032e1ef
Compare
kennknowles
left a comment
There was a problem hiding this comment.
What does this improve? Can you clarify what the meaning of the changes are here? Are they just temporary as things are put in place?
As I understand it, all CustomCoder subclasses will share a single coder URN (something like "urn:beam:coder:javasdk") and the payload will contain a Java serialized blob, hence the data will only really be readable by the Java SDK harness. In some sense, this is what "custom coder" means at the level of the portable Beam model.
The other possibility is that a coder has a meaningful URN (like "urn:beam:coder:map") and payload/components determined by that.
So the latter is more meaningful, portable, and compact, just less convenient for one-offs. So I want to understand the rationale for moving things basically from the better case to the worse case. I'm ready to believe it is mandatory for some reason...
| * @param <T> the type of the values being transcoded | ||
| */ | ||
| public class LengthPrefixCoder<T> extends StandardCoder<T> { | ||
| public class LengthPrefixCoder<T> extends CustomCoder<T> { |
There was a problem hiding this comment.
Doesn't this one, in particular, need to have a well-defined URN?
| } | ||
|
|
||
| @Override | ||
| public Collection<String> getAllowedEncodings() { |
There was a problem hiding this comment.
Incidentally, getEncodingId should be removed. Possibly replaced by a URN, but that can live in the CoderEncoder. And getAllowedEncodings fails to address the "version N+2" problem effectively, so I think that design is DOA.
| * @param <V> the type of the values of the KVs being transcoded | ||
| */ | ||
| public class MapCoder<K, V> extends StandardCoder<Map<K, V>> { | ||
| public class MapCoder<K, V> extends CustomCoder<Map<K, V>> { |
There was a problem hiding this comment.
I don't see the value in making this a CustomCoder. Why?
| * @param <T> the type of the values being transcoded | ||
| */ | ||
| public class NullableCoder<T> extends StandardCoder<T> { | ||
| public class NullableCoder<T> extends CustomCoder<T> { |
There was a problem hiding this comment.
Ditto, here. Even though null is a Java-specific interpretation, it is universal to have a coder that is "either a thing or nothing".
|
The reasoning to swap everything to be a custom coder is to remove the need to create URNs and specifications for all the non-Fn API coders. Swapping to using customcoder is one way of freeing us up in the future to define a specification equivalent to what it encodes and provide an URN. |
|
OK, that sounds fine as a temporary measure. Obviously many of these coders already exist as a well-defined language-independent format. But it is fine to not commit to it right now. |
|
(We can also always just bump the URN version) |
|
There is no restriction that a class that extends My understanding of the "meaning" of |
| * @deprecated For {@code AvroCoder} internal use only. | ||
| */ | ||
| // TODO: once we can remove this deprecated function, inline in constructor. | ||
| @Deprecated |
There was a problem hiding this comment.
Can we make this method package private and marked with @VisibleForTesting now since they were marked @deprecated for that reason
There was a problem hiding this comment.
https://issues.apache.org/jira/browse/BEAM-2077 for followup. Re-deprecated in the meantime.
| * @deprecated For {@code AvroCoder} internal use only. | ||
| */ | ||
| // TODO: once we can remove this deprecated function, inline in constructor. | ||
| @Deprecated |
There was a problem hiding this comment.
Can we make this method package private and marked with @VisibleForTesting now since they were marked @deprecated for that reason
There was a problem hiding this comment.
https://issues.apache.org/jira/browse/BEAM-2077 for followup. Re-deprecated in the meantime.
| } | ||
|
|
||
| @Override | ||
| public void verifyDeterministic() throws NonDeterministicException { |
There was a problem hiding this comment.
I think this method should still exist.
There was a problem hiding this comment.
I don't believe this Coder should exist; rm'd
|
LGTM |
13d9b79 to
880d800
Compare
|
LGTM again, rebase/merge pain |
|
Another problem is that compatibility across e.g. Flink savepoints or Dataflow's pipeline update depends on the coder having the same representation and transparent components. This may make them less stable. |
|
Changes Unknown when pulling 880d800 on tgroh:fewer_standard_coders into ** on apache:master**. |
|
Changes Unknown when pulling 880d800 on tgroh:fewer_standard_coders into ** on apache:master**. |
|
Changes Unknown when pulling 880d800 on tgroh:fewer_standard_coders into ** on apache:master**. |
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
[BEAM-<Jira issue #>] Description of pull requestmvn clean verify. (Even better, enableTravis-CI on your fork and ensure the whole test matrix passes).
<Jira issue #>in the title with the actual Jira issuenumber, if there is one.
Individual Contributor License Agreement.
Standard Coders have a defined serialization format and are understood
within the Runner API, Custom Coders are not. Move existing
"StandardCoders" to extend CustomCoder, and remove custom cloud object
related serialization logic where possible.