Skip to content

[BEAM-2021] Make Most StandardCoders CustomCoders#2668

Closed
tgroh wants to merge 8 commits intoapache:masterfrom
tgroh:fewer_standard_coders
Closed

[BEAM-2021] Make Most StandardCoders CustomCoders#2668
tgroh wants to merge 8 commits intoapache:masterfrom
tgroh:fewer_standard_coders

Conversation

@tgroh
Copy link
Member

@tgroh tgroh commented Apr 24, 2017

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

Standard Coders have a defined serialization format and are understood
within the Runner API, Custom Coders are not. Move existing
"StandardCoders" to extend CustomCoder, and remove custom cloud object
related serialization logic where possible.

@tgroh tgroh force-pushed the fewer_standard_coders branch from 6fc50b2 to ca05834 Compare April 25, 2017 01:08
Standard Coders have a defined serialization format and are understood
within the Runner API, Custom Coders are not. Move existing
"StandardCoders" to extend CustomCoder, and remove custom cloud object
related serialization logic where possible.

Still remaining: Splitting the CustomCoder side of the class hierarchy
from the StandardCoder side of the hierarchy, moving IterableLikeCoder
to be a CustomCoder, and have IterableCoder forward to an internal
implementation (to ensure it remains a StandardCoder).
@tgroh tgroh force-pushed the fewer_standard_coders branch from ca05834 to 032e1ef Compare April 25, 2017 01:21
@tgroh
Copy link
Member Author

tgroh commented Apr 25, 2017

R: @lukecwik @kennknowles

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this improve? Can you clarify what the meaning of the changes are here? Are they just temporary as things are put in place?

As I understand it, all CustomCoder subclasses will share a single coder URN (something like "urn:beam:coder:javasdk") and the payload will contain a Java serialized blob, hence the data will only really be readable by the Java SDK harness. In some sense, this is what "custom coder" means at the level of the portable Beam model.

The other possibility is that a coder has a meaningful URN (like "urn:beam:coder:map") and payload/components determined by that.

So the latter is more meaningful, portable, and compact, just less convenient for one-offs. So I want to understand the rationale for moving things basically from the better case to the worse case. I'm ready to believe it is mandatory for some reason...

* @param <T> the type of the values being transcoded
*/
public class LengthPrefixCoder<T> extends StandardCoder<T> {
public class LengthPrefixCoder<T> extends CustomCoder<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this one, in particular, need to have a well-defined URN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

@Override
public Collection<String> getAllowedEncodings() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, getEncodingId should be removed. Possibly replaced by a URN, but that can live in the CoderEncoder. And getAllowedEncodings fails to address the "version N+2" problem effectively, so I think that design is DOA.

* @param <V> the type of the values of the KVs being transcoded
*/
public class MapCoder<K, V> extends StandardCoder<Map<K, V>> {
public class MapCoder<K, V> extends CustomCoder<Map<K, V>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the value in making this a CustomCoder. Why?

* @param <T> the type of the values being transcoded
*/
public class NullableCoder<T> extends StandardCoder<T> {
public class NullableCoder<T> extends CustomCoder<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, here. Even though null is a Java-specific interpretation, it is universal to have a coder that is "either a thing or nothing".

@lukecwik
Copy link
Member

The reasoning to swap everything to be a custom coder is to remove the need to create URNs and specifications for all the non-Fn API coders. Swapping to using customcoder is one way of freeing us up in the future to define a specification equivalent to what it encodes and provide an URN.

@kennknowles
Copy link
Member

OK, that sounds fine as a temporary measure. Obviously many of these coders already exist as a well-defined language-independent format. But it is fine to not commit to it right now.

@kennknowles
Copy link
Member

(We can also always just bump the URN version)

@tgroh
Copy link
Member Author

tgroh commented Apr 25, 2017

There is no restriction that a class that extends CustomCoder be a class that serializes purely via Java serialization. The mechanism to actually register custom urn-based serializers isn't implemented yet. Coders which extend CustomCoder and have no specifically defined Serialization mechanism (e.g. shared urn and payload format) will be serialized via Java, but the idea is that we'll be able to register additional coder serializations (e.g. NullableCoder, MapCoder, etc)

My understanding of the "meaning" of StandardCoder is that it's a coder that must be understood by the runner.

* @deprecated For {@code AvroCoder} internal use only.
*/
// TODO: once we can remove this deprecated function, inline in constructor.
@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this method package private and marked with @VisibleForTesting now since they were marked @deprecated for that reason

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/BEAM-2077 for followup. Re-deprecated in the meantime.

* @deprecated For {@code AvroCoder} internal use only.
*/
// TODO: once we can remove this deprecated function, inline in constructor.
@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this method package private and marked with @VisibleForTesting now since they were marked @deprecated for that reason

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/BEAM-2077 for followup. Re-deprecated in the meantime.

}

@Override
public void verifyDeterministic() throws NonDeterministicException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should still exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this Coder should exist; rm'd

@lukecwik
Copy link
Member

LGTM

@tgroh tgroh force-pushed the fewer_standard_coders branch from 13d9b79 to 880d800 Compare April 25, 2017 21:59
@lukecwik
Copy link
Member

LGTM again, rebase/merge pain

@asfgit asfgit closed this in 0d69611 Apr 25, 2017
@kennknowles
Copy link
Member

kennknowles commented Apr 26, 2017

Another problem is that compatibility across e.g. Flink savepoints or Dataflow's pipeline update depends on the coder having the same representation and transparent components. This may make them less stable.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 880d800 on tgroh:fewer_standard_coders into ** on apache:master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 880d800 on tgroh:fewer_standard_coders into ** on apache:master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 880d800 on tgroh:fewer_standard_coders into ** on apache:master**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants