[BEAM-7886] Make row coder a standard coder and implement in Python #9188

TheNeuralBit · 2019-07-30T00:43:07Z

Defines a URN, beam:coder:row:v1, to represent row coder as a standard coder.
Implements row coder as a standard coder in Java (previously it was a custom coder). It is now serialized to a portable representation based on the Schema specification defined in https://s.apache.org/beam-schemas
Adds very basic support for Beam Schemas in the Python SDK based on typing (detailed below)
Implements row coder in Python (currently only strings, integers, doubles, and lists are supported)
Adds some simple test cases for row coder to standard_coders.yaml to verify compatibility between the Java and Python implementations.

Beam Schemas in Python

As noted above this PR adds basic support for Beam Schemas in Python. Currently this relies on Python's typing module as a native Python representation of Beam Schemas.

apache_beam.typehints.schemas includes two functions, typing_from_runner_api and typing_to_runner_api which convert supported typing instances to/from portable schemas. Primitive types are mapped to numpy types (e.g. np.int32, np.double), Arrays are mapped to typing.List[T], Maps are mapped to typing.Mapping[K,V], and Rows are mapped to typing.NamedTuple. Logical types are not yet supported.

With the changes in this PR it's possible to use Python's row coder for simple structured data types with code like the following:

import apache_beam as beam
from apache_beam import coders
from typing import NamedTuple
from typing import Optional
import numpy as np

class Movie(NamedTuple):
  name: np.unicode
  year: Optional[np.int16]

# The class/type annotation syntax doesn't work in Python 2. Instead you can use:
# Movie = NamedTuple('Movie', [('name', np.unicode), ('year', Optional[np.int16])]

coders.registry.register_coder(Movie, coders.RowCoder)

# Create a PCollection with a NamedTuple type to assign a Schema
movies = p | 'create movies' >> beam.Map(some_function).with_output_types(Movie)

# Retrieve the original type by accessing movies.element_type

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Gearpump	Samza
Go	---	---	---	---
Java
Python	---		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

TheNeuralBit · 2019-07-30T01:03:00Z

Run Portable_Python PreCommit

TheNeuralBit · 2019-07-31T17:23:42Z

Run Python PreCommit

TheNeuralBit · 2019-08-02T16:50:08Z

Run Java PreCommit

sdks/python/apache_beam/coders/row_coder_test.py

sdks/python/apache_beam/typehints/schemas.py

robinyqiu

Did a partial review on the Java and portable side. Will review the Python side code later.

model/pipeline/src/main/proto/beam_runner_api.proto

...struction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslators.java

...ction-java/src/test/java/org/apache/beam/runners/core/construction/CoderTranslationTest.java

...nstruction-java/src/test/java/org/apache/beam/runners/core/construction/CommonCoderTest.java

sdks/python/apache_beam/coders/row_coder_test.py

TheNeuralBit · 2019-08-14T00:42:16Z

R: @reuvenlax would you mind taking a look at the model changes and the Java changes? I can separate out the relevant commits in their own PR if that helps.

reuvenlax · 2019-08-19T22:28:01Z

taking a look

udim · 2019-08-21T16:21:30Z

I'm reviewing the Python changes

TheNeuralBit · 2019-08-21T16:38:33Z

Thank you both! Let me know if you have any questions

udim

Only a partial review. I need to look more closely at the files under typehints/.

sdks/python/apache_beam/coders/row_coder.py

sdks/python/apache_beam/typehints/schemas.py

sdks/python/apache_beam/coders/row_coder_test.py

sdks/python/apache_beam/coders/row_coder.py

sdks/python/setup.py

udim · 2019-08-21T17:46:25Z

sdks/python/apache_beam/typehints/schemas.py

+PRIMITIVE_TO_ATOMIC_TYPE.update({
+    # In python 3, this is a no-op because str == unicode,
+    # but in python 2 it overrides the bytes -> BYTES mapping.
+    str: schema_pb2.AtomicType.STRING,


Is there a test that does conversion to/from these types?
I suspect that in Python 2 a str such as '\xff' will fail because it's not valid UTF-8.
In other words, str should be invalid or converted to AtomicType.BYTES in Python 2.

So the runtime types don't necessarily line up exactly with the typing representation of the schema. For example even though a schema may have an attribute with np.int* type, we still actually produce and consume int instances, and never use np.int* instances at runtime.

In this case, the typing might say str, but RowCoder uses StrUtf8Coder to produce/consume instances of past.builtins.unicode at runtime.

I agree this could be a little confusing for users. We discussed it on the ML and @robertwb suggested this approach:

In both Python 2 and Python 3 one would use str for STRING, it would decode to
past.builtins.unicode. This seems to capture the intent better than
mapping str to BYTES in Python 2 only.)

There are tests over in row_coder_test.py, and in standard_coders.yaml/standard_coders_test.py

Thinking about this some more, what about just rejecting str in Python 2 (forcing the user to say unicode). We can loosen things if this becomes too cumbersome in the future (but going the other way is backwards incompatible).

Do we really need to support python2? If this is going to be a burden in general, I would rather not add support for it.

I think it'll be harder to consistently exclude support for Python 2 for all schema use.

I think that makes sense. It doesn't seem too onerous to ask people to use unicode in python 2. And that's a good point that it's a backwards compatible change if we find out otherwise. I pushed a commit that does this: ecaf73c

sdks/python/apache_beam/coders/row_coder.py

udim · 2019-08-21T19:18:54Z

sdks/python/apache_beam/coders/row_coder_test.py

+    ("aliases", typing.List[unicode]),
+])
+
+coders_registry.register_coder(Person, RowCoder)


Is this required for regular use?

Yes it is right now. I was hesitant to make RowCoder the default coder for any NamedTuple sub-class, and I thought this was a good way to make it opt in.

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/PortableSchemaCoder.java

reuvenlax · 2019-08-23T00:13:10Z

Trying to think of a better name than PortableSchemaCoder, but I guess this is fine for now.

aaltay · 2019-08-23T20:32:12Z

Can we just SchemaCoder as the name? I agree that this does not matter much. Would changing this in the future will be difficult?

reuvenlax · 2019-08-23T21:34:35Z

This is a subclass of schema coder. Ideally we should get rid of the current row coder (make it a utility class) and call this RowCoder.

…

On Fri, Aug 23, 2019, 1:32 PM Ahmet Altay ***@***.***> wrote: Can we just SchemaCoder as the name? I agree that this does not matter much. Would changing this in the future will be difficult? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9188?email_source=notifications&email_token=AFAYJVNELJCWWFZB27BISMLQGBCNXA5CNFSM4IHXLAS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5BIFVQ#issuecomment-524452566>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFAYJVKRR4FGBIY2OBBKJWLQGBCNXANCNFSM4IHXLASQ> .

TheNeuralBit · 2019-08-23T23:08:38Z

Let me see if I can put together a patch that does that. I think I could just get rid of the current RowCoder and move it's logic to SchemaCoder and RowCoderGenerator.

TheNeuralBit · 2019-08-27T17:21:25Z

Run Java PreCommit

TheNeuralBit · 2019-09-26T21:36:14Z

Apologies for letting this go stale. After BEAM-8111 I wanted to make sure we had some better test coverage on the Java side.

@udim, @aaltay, @robertwb, and/or @chadrik - would you mind taking another look at the Python changes now?

robertwb

Just a couple of comments, but this is looking pretty good to me.

robertwb · 2019-10-07T20:54:47Z

...struction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslators.java

+    return new CoderTranslator<RowCoder>() {
+      @Override
+      public List<? extends Coder<?>> getComponents(RowCoder from) {
+        return ImmutableList.of();


So for the time being, we're inlining everything, rather than using components. Was there a bug tracking doing better for this?

Yes right now there's just a fixed mapping from fieldtype to coder. There's not a bug filed for using components, I was thinking that we would just continue inlining everything. Do you think we should plan on using components instead? What does that get us?

For coders, if one had a coder T one was likely to have KV<K, T> for various K, an Iterable, WindowedValue for possibly several window types, and various other permutations. Coupled with the fact that leaf coders were often huge serialized blobs made for some pretty significant savings.

Maybe this'll be less of an issue in the streaming world. I think it should not be a blocker assuming we'll be able to update this in the (short-term) future.

robertwb · 2019-10-07T20:58:14Z

...nstruction-java/src/test/java/org/apache/beam/runners/core/construction/CommonCoderTest.java

+      case FLOAT:
+        return Float.parseFloat((String) value);
+      case DOUBLE:
+        return Double.parseDouble((String) value);


Why are these strings?

I tried to be consistent with the YAML representations used for other coders in convertValues. We use strings for DoubleCoder there, presumably to avoid the possibility of running into precision errors? It looks like yaml does support floating point but explicitly states it doesn't specify a required accuracy for implementations.

@lukecwik added the parseDouble line I linked in #8205, maybe he can clarify?

YAML is JSON which supports doubles (really, as its only data type). @lukecwik I'm curious, but again this is not a blocker.

Ok, I created BEAM-8437 lets take this conversation over there.

sdks/python/apache_beam/coders/row_coder.py

robertwb · 2019-10-07T21:50:59Z

sdks/python/apache_beam/coders/row_coder.py

+
+    # Note that if this coder's schema has *fewer* attributes than the encoded
+    # value, we just need to ignore the additional values, which will occur
+    # here because we only decode as many values as we have coders for.


I don't see how we can safely do this, as we have to pull the extra values off the stream, right?

That's a good point.. I was just trying to replicate what is implemented in java's RowCoderGenerator. It seems like this would be an issue over there as well, unless something else is consuming the unread bytes? Could there be logic to do that when there's a length-prefix?

I'd be fine with just leaving this out for now and filing a jira if we can't get a satisfying answer.

robertwb · 2019-10-07T21:51:51Z

sdks/python/apache_beam/coders/row_coder_test.py

+    ("aliases", typing.List[unicode]),
+])
+
+coders_registry.register_coder(Person, RowCoder)


sdks/python/apache_beam/coders/standard_coders_test.py

sdks/python/apache_beam/typehints/native_type_compatibility.py

TheNeuralBit · 2019-10-12T23:20:30Z

@robertwb could you take another look?

robertwb

LGTM, thanks.

TheNeuralBit · 2019-10-23T22:36:33Z

Run Python PreCommit

TheNeuralBit · 2019-10-24T00:46:04Z

Run Python PreCommit

TheNeuralBit · 2019-10-24T17:57:53Z

Run Python PreCommit

TheNeuralBit · 2019-10-25T00:55:15Z

Run Python PreCommit

TheNeuralBit · 2019-10-25T18:26:34Z

Finally got CI green! @robertwb is this ok to merge now?

tweise · 2019-11-04T03:49:37Z

Excited to see this happening. Perhaps a good time to squash the fixup commits?

robertwb · 2019-11-04T18:12:22Z

Yes, go ahead and squash into meaningful commits and merge.

…esentation

Implements the beam:coder:row:v1 standard coder (the same serialization format as Java's RowCoder). Used to encode namedtuples.

TheNeuralBit · 2019-11-06T01:56:31Z

Run Python PreCommit

TheNeuralBit · 2019-11-06T18:24:35Z

Run Python PreCommit

TheNeuralBit · 2019-11-06T18:25:57Z

@robertwb I squashed it down to reasonable commits, so I think this is ready to merge. I think the python failures have been flakes, hopefully it will pass this time.

reuvenlax · 2019-11-06T19:04:39Z

I can merge this once Python passes

TheNeuralBit · 2019-11-07T17:53:46Z

🎉 Thanks everyone!

TheNeuralBit changed the title ~~Make row coder a standard coder and implement in Python~~ WIP: Make row coder a standard coder and implement in Python Jul 30, 2019

TheNeuralBit force-pushed the row-coder-standard branch 2 times, most recently from 29fddd1 to f381123 Compare July 31, 2019 00:46

TheNeuralBit force-pushed the row-coder-standard branch from 74804ea to 9e730ab Compare August 2, 2019 18:14

aaltay requested review from robertwb, udim and chamikaramj August 2, 2019 21:54

TheNeuralBit changed the title ~~WIP: Make row coder a standard coder and implement in Python~~ [BEAM-7886] Make row coder a standard coder and implement in Python Aug 2, 2019

chadrik reviewed Aug 3, 2019

View reviewed changes

sdks/python/apache_beam/coders/row_coder_test.py Outdated Show resolved Hide resolved

chadrik reviewed Aug 3, 2019

View reviewed changes

sdks/python/apache_beam/typehints/schemas.py Outdated Show resolved Hide resolved

chadrik reviewed Aug 3, 2019

View reviewed changes

sdks/python/apache_beam/typehints/schemas.py Outdated Show resolved Hide resolved

chadrik reviewed Aug 3, 2019

View reviewed changes

sdks/python/apache_beam/typehints/schemas.py Outdated Show resolved Hide resolved

chadrik mentioned this pull request Aug 4, 2019

[7746] Create a more user friendly external transform API #9098

Merged

robinyqiu reviewed Aug 8, 2019

View reviewed changes

TheNeuralBit commented Aug 10, 2019

View reviewed changes

sdks/python/apache_beam/coders/row_coder_test.py Outdated Show resolved Hide resolved

TheNeuralBit force-pushed the row-coder-standard branch 2 times, most recently from 17bfdbb to 2571d89 Compare August 13, 2019 00:53

udim requested changes Aug 21, 2019

View reviewed changes

reuvenlax reviewed Aug 21, 2019

View reviewed changes

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/PortableSchemaCoder.java Outdated Show resolved Hide resolved

kennknowles requested a review from udim October 7, 2019 20:23

robertwb reviewed Oct 7, 2019

View reviewed changes

robertwb approved these changes Oct 18, 2019

View reviewed changes

TheNeuralBit mentioned this pull request Oct 22, 2019

[BEAM-7738] Add external transform support to PubsubIO #9268

Merged

TheNeuralBit force-pushed the row-coder-standard branch from c2d8fde to 90f8f25 Compare October 22, 2019 23:10

TheNeuralBit added 7 commits November 5, 2019 14:39

Add beam:coder:row:v1 to standard coders

f80b50d

Java: Implement RowCoder fn api translation

05ff238

Java: Add suport for beam:coder:row:v1 in CommonCoderTest

de196e0

Python: Add conversions between native types and protobuf schema repr…

83322b9

…esentation

Python: Add RowCoder

38459c0

Implements the beam:coder:row:v1 standard coder (the same serialization format as Java's RowCoder). Used to encode namedtuples.

Python: Add support for beam:coder:row:v1 in standard_coders_test.py

bbf46c2

Add simple beam:coder:row:v1 test to standard_coders.yaml

f7ce06d

TheNeuralBit force-pushed the row-coder-standard branch from 90f8f25 to f7ce06d Compare November 5, 2019 22:51

reuvenlax merged commit 01726e9 into apache:master Nov 6, 2019

damccorm mentioned this pull request Jun 4, 2022

Add support for remaining data types in python RowCoder #19815

Open

[BEAM-7886] Make row coder a standard coder and implement in Python #9188

[BEAM-7886] Make row coder a standard coder and implement in Python #9188

Conversation

TheNeuralBit commented Jul 30, 2019 • edited Loading

Beam Schemas in Python

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

TheNeuralBit commented Jul 30, 2019

TheNeuralBit commented Jul 31, 2019

TheNeuralBit commented Aug 2, 2019

robinyqiu left a comment

Choose a reason for hiding this comment

TheNeuralBit commented Aug 14, 2019

reuvenlax commented Aug 19, 2019

udim commented Aug 21, 2019

TheNeuralBit commented Aug 21, 2019

udim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit Aug 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuvenlax commented Aug 23, 2019

aaltay commented Aug 23, 2019

reuvenlax commented Aug 23, 2019 via email

TheNeuralBit commented Aug 23, 2019

TheNeuralBit commented Aug 27, 2019

TheNeuralBit commented Sep 26, 2019

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit commented Oct 12, 2019

robertwb left a comment

Choose a reason for hiding this comment

TheNeuralBit commented Oct 23, 2019

TheNeuralBit commented Oct 24, 2019

TheNeuralBit commented Oct 24, 2019

TheNeuralBit commented Oct 25, 2019

TheNeuralBit commented Oct 25, 2019

tweise commented Nov 4, 2019

robertwb commented Nov 4, 2019

TheNeuralBit commented Nov 6, 2019

TheNeuralBit commented Nov 6, 2019

TheNeuralBit commented Nov 6, 2019

reuvenlax commented Nov 6, 2019

TheNeuralBit commented Nov 7, 2019

TheNeuralBit commented Jul 30, 2019 •

edited

Loading

TheNeuralBit Aug 21, 2019 •

edited

Loading