[BEAM-5953] Fix py3 type error in bundle_processor #7521

markflyhigh · 2019-01-16T00:07:53Z

This is one step to enable Python 3 pipeline running on DataflowRunner. This change is to fix error in BEAM-5953#comment.

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

markflyhigh · 2019-01-16T17:34:07Z

+R: @robertwb @tvalentyn

tvalentyn · 2019-01-16T18:33:18Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

@@ -590,9 +590,11 @@ def get_coder(self, coder_id):
    if coder_proto.spec.spec.urn:
      return self.context.coders.get_by_id(coder_id)
    else:
+      payload = coder_proto.spec.spec.payload
+      if isinstance(payload, bytes):


Do we need the if? Is it not always bytes?

I'm not a hundred per cent sure. @robertwb Do we need check here?

Since this decoding is only needed in python 3, I prefer to add version check here. Same thing for chagnes to operation_specs

sdks/python/apache_beam/runners/worker/bundle_processor.py

markflyhigh · 2019-01-16T18:44:50Z

sdks/python/apache_beam/runners/worker/operation_specs.py

@@ -354,7 +354,7 @@ def get_coder_from_spec(coder_spec):

  # We pass coders in the form "<coder_name>$<pickled_data>" to make the job
  # description JSON more readable.
-  return coders.coders.deserialize_coder(coder_spec['@type'])
+  return coders.coders.deserialize_coder(coder_spec['@type'].encode('utf-8'))


deserialize_coder only accept bytes. If we decode payload above, need to convert to bytes back here.

tvalentyn · 2019-01-16T21:22:21Z

sdks/python/apache_beam/runners/worker/operation_specs.py

@@ -354,7 +354,10 @@ def get_coder_from_spec(coder_spec):

  # We pass coders in the form "<coder_name>$<pickled_data>" to make the job
  # description JSON more readable.
-  return coders.coders.deserialize_coder(coder_spec['@type'])
+  coder = coder_spec['@type']
+  if not isinstance(coder, bytes):


Do we understand in which codepath we happen to populate coders_spec without encoding it to bytes? If so, can we encode at the creation time? I think it would be easier to reason about SDK internals if we can state that this method always expects the same datatype (a bytestring) as input.

Here coder_spec is a cloud object dictionary, typically parsed from JSON, hence the unicode. We can unconditionally encode this for deserialization, but it's quite possible that utf-8 would not be the "right" encoding in this case for pickle.

Due to issues of passing arbitrary bytes through cloud protos, we actually base64 encode our serialized data in internal.pickler.loads/dumps, including here. As such, it should be safe to encode this with 'ascii' which would throw errors if there happen to be any higher code points (which there should not be, but if any creep in, it'd be better to have an explicit error here than a harder-to-deciper one later).

tvalentyn · 2019-01-16T22:29:49Z

sdks/python/apache_beam/runners/worker/bundle_processor.py

-      return operation_specs.get_coder_from_spec(
-          json.loads(coder_proto.spec.spec.payload))
+      payload = coder_proto.spec.spec.payload
+      if isinstance(payload, bytes) and sys.version_info[0] == 3:


I think we should not be checking whether input is bytes on Python 3, and should consistently expect the same datatype as input. Can we change this to:

if sys.version_info[0] > 2: # json.loads() does not accept `bytes` on some versions of Python 3. payload = payload.decode('utf-8')

Payload is always bytes, we should unconditionally decode this before passing to json for 2 and 3.

Of course what would be better is if the Dataflow runner harness was fixed to always use proper coder URNs rather than stuffing json-ized dataflow v1b3 cloud proto representations into the payload. Can you find/file a JIRA for this?

sg. Also filed https://issues.apache.org/jira/browse/BEAM-6506

markflyhigh · 2019-01-25T06:03:48Z

PTAL @tvalentyn @robertwb

robertwb · 2019-01-25T09:34:12Z

Run Python PostCommit

robertwb

LGTM, assuming post-commit tests (triggered) also pass.

tvalentyn reviewed Jan 16, 2019

View reviewed changes

sdks/python/apache_beam/runners/worker/bundle_processor.py Show resolved Hide resolved

markflyhigh commented Jan 16, 2019

View reviewed changes

markflyhigh force-pushed the py-df-worker-py3-2 branch from 0e90215 to ee9fbac Compare January 16, 2019 20:07

tvalentyn reviewed Jan 16, 2019

View reviewed changes

markflyhigh force-pushed the py-df-worker-py3-2 branch from ee9fbac to 6189694 Compare January 25, 2019 00:59

Fix py3 type error in bundle_processor

a519547

markflyhigh force-pushed the py-df-worker-py3-2 branch from 6189694 to a519547 Compare January 25, 2019 01:52

robertwb approved these changes Jan 25, 2019

View reviewed changes

robertwb merged commit 07c014e into apache:master Jan 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-5953] Fix py3 type error in bundle_processor #7521

[BEAM-5953] Fix py3 type error in bundle_processor #7521

markflyhigh commented Jan 16, 2019

markflyhigh commented Jan 16, 2019 •

edited

Loading

tvalentyn Jan 16, 2019

markflyhigh Jan 16, 2019

markflyhigh Jan 16, 2019 •

edited

Loading

markflyhigh Jan 16, 2019

tvalentyn Jan 16, 2019

robertwb Jan 18, 2019

tvalentyn Jan 16, 2019 •

edited

Loading

robertwb Jan 18, 2019

robertwb Jan 18, 2019

markflyhigh Jan 25, 2019

markflyhigh commented Jan 25, 2019 •

edited

Loading

robertwb commented Jan 25, 2019

robertwb left a comment

[BEAM-5953] Fix py3 type error in bundle_processor #7521

[BEAM-5953] Fix py3 type error in bundle_processor #7521

Conversation

markflyhigh commented Jan 16, 2019

Post-Commit Tests Status (on master branch)

markflyhigh commented Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markflyhigh Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvalentyn Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markflyhigh commented Jan 25, 2019 • edited Loading

robertwb commented Jan 25, 2019

robertwb left a comment

Choose a reason for hiding this comment

markflyhigh commented Jan 16, 2019 •

edited

Loading

markflyhigh Jan 16, 2019 •

edited

Loading

tvalentyn Jan 16, 2019 •

edited

Loading

markflyhigh commented Jan 25, 2019 •

edited

Loading