[BEAM-4236, BEAM-2927] Make Python SDK side inputs work with non well known coders and also work with Dataflow by robertwb · Pull Request #5302 · apache/beam

robertwb · 2018-05-08T08:35:57Z

Several issues addressed:

Lack of support for mapping windows
relying on reading the side input specification that was being serialized within the DoFn and not read from the pipeline description when using the Fn API
eagerly pre-fetching the side input when the process bundle had zero elements which lead to a state request for a bundle containing zero elements

Streaming Dataflow job w/ Fn API side inputs: 2018-05-09_14_40_12-2860575533296523734 (note SIDE_INPUT being logged)
Batch Dataflow job w/ legacy side inputs: 2018-05-09_14_30_24-10730914164827927481

Follow this checklist to help us incorporate your contribution quickly and easily:

robertwb · 2018-05-08T08:40:03Z

sdks/python/apache_beam/pvalue.py


  @staticmethod
-  def from_runner_api(proto, context):
+  def from_runner_api(proto, coder):


Given the coder comes from the PCollection and can be mutated, it's probably better to remove the coder from the side input object altogether, keeping the signature of this method as is, and instead passing it when we bind the PCollection.

Seems this could also obviate the need for passing the transform proto itself in the PTransform.from_runner_api too.

Looks much cleaner, trying this out now.

robertwb · 2018-05-08T08:45:53Z

sdks/python/apache_beam/runners/common.py

    input_args = input_args if input_args else []
    input_kwargs = input_kwargs if input_kwargs else {}

-    if not self.has_windowed_inputs:


This is an important optimization for batch pipelines that use side inputs (e.g the TFX stuff). I see now how we were requesting side inputs in start(), but they're not window dependent. Perhaps we could defer this optimization to the first element that is processed. (At least a TODO would be in order.)

Like you suggest, checking for at least one element or loading on first element should work, just trying to get everything working E2E before I try to improve this optimization.

robertwb · 2018-05-08T08:56:02Z

sdks/python/apache_beam/transforms/core.py

  def _pardo_fn_data(self):
-    si_tags_and_types = None
    windowing = None
-    return self.fn, self.args, self.kwargs, si_tags_and_types, windowing


This is is the cause of the

File "/usr/local/google/home/lcwik/git/beam/sdks/python/apache_beam/runners/worker/operations.py", line 360, in start pickler.loads(self.spec.serialized_fn)) ValueError: need more than 4 values to unpack

failure. I don't remember if we completely removed its use in the legacy worker, but if so, we can probably remove it there too rather than re-introduce it here.

I was able to remove this in the few places it was referenced.

charlesccychen · 2018-05-08T20:59:25Z

The following sample fails to run at this commit:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
 
from __future__ import absolute_import
 
import argparse
import logging
 
import six
 
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.testing.util import assert_that, equal_to
 
 
def run(argv=None):
  """Build and run the pipeline."""
 
  pipeline_options = PipelineOptions(argv)
  pipeline_options.view_as(SetupOptions).save_main_session = True
  pipeline_options.view_as(StandardOptions).streaming = True
  p = beam.Pipeline(options=pipeline_options)
 
  main = p | 'MainCreate' >> beam.Create(['a', 'b'])
 
  single_side = p | 'SingletonSide' >> beam.Create(['x'])
  main | beam.Map(lambda x, side: (x, side),
                      beam.pvalue.AsSingleton(single_side))
  # assert_that(
  #     main | beam.Map(lambda x, side: (x, side),
  #                     beam.pvalue.AsSingleton(single_side)),
  #     equal_to([('a', 'x'), ('b', 'x')]),
  #     label='AssertSingleton')
 
  # iter_side = p | 'IterSide' >> beam.Create(['x', 'y', 'z'])
  # assert_that(
  #     main | beam.Map(lambda x, side: (x, sorted(side)),
  #                     beam.pvalue.AsIter(iter_side)),
  #     equal_to([('a', ['x', 'y', 'z']), ('b', ['x', 'y', 'z'])]),
  #     label='AssertIter')
 
  # multimap_side = p | 'MultimapSide' >> beam.Create(
  #     [('a', 'aa'), ('b', 'bb'), ('a', 'aaa')])
  # assert_that(
  #     main | beam.Map(lambda x, side: (x, sorted(side[x])),
  #                     beam.pvalue.AsMultiMap(multimap_side)),
  #     equal_to([('a', ['aa', 'aaa']), ('b', ['bb'])]),
  #     label='AssertMultimap')
 
  result = p.run()
  result.wait_until_finish()
 
 
if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  run()

The following error is emitted:

INFO:root:LCWIKA [u'ref_PCollection_PCollection_2', u'ref_PCollection_PCollection_1']
INFO:root:LCWIKB [u'ref_PCollection_PCollection_3', u'ref_PCollection_PCollection_2', u'ref_PCollection_PCollection_1']
INFO:root:LCWIKC {u'side0': unique_name: "18SingletonSide/Read.None"
coder_id: "eNprYEpOyczJ0QMRXPE5+Ykp8SWVBalchQyhXMElRZl56SFAbiFjayFTUCGzHgCKDA/Z"
is_bounded: BOUNDED
windowing_strategy_id: "ref_Windowing_Windowing_1"
}
INFO:root:LCWIKD [u'ref_Coder_GlobalWindowCoder_1']
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/google/home/ccy/git/beam/sdks/python/apache_beam/examples/streaming_wordcount.py", line 72, in <module>
    run()
  File "/usr/local/google/home/ccy/git/beam/sdks/python/apache_beam/examples/streaming_wordcount.py", line 66, in run
    result = p.run()
  File "apache_beam/pipeline.py", line 389, in run
    self.to_runner_api(), self.runner, self._options).run(False)
  File "apache_beam/pipeline.py", line 618, in from_runner_api
    context.transforms.get_by_id(root_transform_id)]
  File "apache_beam/runners/pipeline_context.py", line 85, in get_by_id
    self._id_to_proto[id], self._pipeline_context)
  File "apache_beam/pipeline.py", line 863, in from_runner_api
    part = context.transforms.get_by_id(transform_id)
  File "apache_beam/runners/pipeline_context.py", line 85, in get_by_id
    self._id_to_proto[id], self._pipeline_context)
  File "apache_beam/pipeline.py", line 854, in from_runner_api
    transform=ptransform.PTransform.from_runner_api(proto, context),
  File "apache_beam/transforms/ptransform.py", line 560, in from_runner_api
    context)
  File "apache_beam/transforms/core.py", line 904, in from_runner_api_parameter
    for tag in pardo_payload.side_inputs.keys()
  File "apache_beam/transforms/core.py", line 904, in <dictcomp>
    for tag in pardo_payload.side_inputs.keys()
  File "apache_beam/runners/pipeline_context.py", line 85, in get_by_id
    self._id_to_proto[id], self._pipeline_context)
KeyError: u'eNprYEpOyczJ0QMRXPE5+Ykp8SWVBalchQyhXMElRZl56SFAbiFjayFTUCGzHgCKDA/Z'

lukecwik · 2018-05-09T17:39:21Z

R: @charlesccychen

lukecwik · 2018-05-09T17:50:08Z

CC: @pabloem

pabloem · 2018-05-09T17:51:17Z

Question:
Is this going to change how batch side inputs work in Python Dataflow?

Where can I find the c9ode that will fetch the side input from now on?

Will we not be using the existing code path in runners/worker/sideinputs.py ?

lukecwik · 2018-05-09T18:36:09Z

@pabloem Your right, this breaks the way that the Dataflow SDK does legacy worker batch side inputs.

lukecwik · 2018-05-09T20:20:59Z

Reverted the removal of tags_and_types and restored the ability for the legacy batch Python Dataflow code paths.

lukecwik · 2018-05-09T21:49:06Z

The hardest part was understanding the changes that I could make that would be compatible with Dataflow.

charlesccychen · 2018-05-09T23:09:35Z

For the record, the sample in #5302 (comment) still fails at this PR for an unrelated reason, likely having to do with PTransform replacement logic.

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/google/home/ccy/git/beam/sdks/python/apache_beam/examples/streaming_wordcount.py", line 72, in <module>
    run()
  File "/usr/local/google/home/ccy/git/beam/sdks/python/apache_beam/examples/streaming_wordcount.py", line 66, in run
    result = p.run()
  File "apache_beam/pipeline.py", line 389, in run
    self.to_runner_api(), self.runner, self._options).run(False)
  File "apache_beam/pipeline.py", line 402, in run
    return self.runner.run_pipeline(self)
  File "apache_beam/runners/dataflow/dataflow_runner.py", line 347, in run_pipeline
    super(DataflowRunner, self).run_pipeline(pipeline)
  File "apache_beam/runners/runner.py", line 170, in run_pipeline
    pipeline.visit(RunVisitor(self))
  File "apache_beam/pipeline.py", line 430, in visit
    self._root_transform().visit(visitor, self, visited)
  File "apache_beam/pipeline.py", line 785, in visit
    part.visit(visitor, pipeline, visited)
  File "apache_beam/pipeline.py", line 788, in visit
    visitor.visit_transform(self)
  File "apache_beam/runners/runner.py", line 165, in visit_transform
    self.runner.run_transform(transform_node)
  File "apache_beam/runners/runner.py", line 208, in run_transform
    return m(transform_node)
  File "apache_beam/runners/dataflow/dataflow_runner.py", line 581, in run_ParDo
    input_step = self._cache.get_pvalue(transform_node.inputs[0])
  File "apache_beam/runners/runner.py", line 286, in get_pvalue
    value_with_refcount = self._cache[self.key(pvalue)]
KeyError: (u'SingletonSide/Decode Values', None)

charlesccychen · 2018-05-09T23:59:00Z

The issue with the KeyError above happens because of incorrect refcount numbers in PValueCache. The entry is garbage-collected because its refcount reaches zero before it can be used in the side input. Does anyone know why we have refcounts in PValueCache at all? My hunch is that this is not necessary, and is an artifact of when PValueCache was used to hold actual result elements. I am inclined to remove this feature of PValueCache. Does anyone have an opinion here?

(CC: @robertwb @aaltay)

aaltay · 2018-05-10T00:47:41Z

I don't know why PValueCache has a refcount. I have the same feeling as you about why it has this feature.

lukecwik · 2018-05-10T18:32:57Z

An issue with virtualenv exists preventing the Dataflow Python postcommit from running, more details on BEAM-4249

aaltay · 2018-05-10T18:44:53Z

@charlesccychen could you run the equivalent of python post commit in your own environment. Let's not wait until we get Jenkin's fixes.

charlesccychen · 2018-05-11T00:17:42Z

@aaltay: Yes.

@robertwb: I noticed that you very recently touched the ref counting logic in 00f3e22fccb. Can you chime in on whether we still need refcounts in PValueCache?

robertwb · 2018-05-14T19:46:25Z

The refcounts should only be needed for the old DirectRunner.

lukecwik · 2018-05-16T17:22:54Z

Run Python Dataflow ValidatesRunner

lukecwik · 2018-05-16T17:23:03Z

run python precommit

robertwb

All LGTM, thanks!

robertwb · 2018-05-16T18:44:10Z

sdks/python/apache_beam/runners/portability/fn_api_runner.py


    safe_coders = {}

+    def fix_pcoll_coder(pcoll, pipeline_components):


Maybe call this length_prefix_unknown_coders?

…the usage of tags_and_types from the serialized DoFnInfo

This might also resolve BEAM-4782. Removes unused _input_element_coder, whose usage was removed in PR apache#5302.

@robertwb

This might also resolve BEAM-4782. Removes unused _input_element_coder, whose usage was removed in PR apache#5302. Includes squashed commits by @robertwb.

robertwb commented May 8, 2018

View reviewed changes

lukecwik force-pushed the side_input2 branch 2 times, most recently from 62ea484 to 1bc4716 Compare May 9, 2018 16:37

lukecwik changed the title ~~Side input2~~ [BEAM-4236, BEAM-2927] Make Python SDK side inputs work with non well known coders and also work with Dataflow May 9, 2018

lukecwik force-pushed the side_input2 branch from 1bc4716 to 4cd6013 Compare May 9, 2018 17:34

lukecwik requested a review from aaltay May 9, 2018 17:38

lukecwik force-pushed the side_input2 branch from 816009a to 078ab49 Compare May 9, 2018 20:19

lukecwik force-pushed the side_input2 branch 3 times, most recently from e7c750f to 93fcc6c Compare May 9, 2018 21:24

charlesccychen mentioned this pull request May 15, 2018

Remove garbage collection from PValueCache lukecwik/incubator-beam#11

Merged

robertwb commented May 16, 2018

View reviewed changes

Add window mapping transform.

5857cee

lukecwik and others added 2 commits May 16, 2018 11:50

Use the coder from the side input PCollection definition by removing …

f124c6f

…the usage of tags_and_types from the serialized DoFnInfo

Remove garbage collection from PValueCache

708a79e

lukecwik force-pushed the side_input2 branch from 42af4be to 708a79e Compare May 16, 2018 18:50

lukecwik merged commit 1e69890 into apache:master May 16, 2018

udim added a commit to udim/beam that referenced this pull request May 22, 2019

[BEAM-6429] side-input type coercion for multimaps

904f414

This might also resolve BEAM-4782. Removes unused _input_element_coder, whose usage was removed in PR apache#5302.

udim mentioned this pull request May 22, 2019

[BEAM-6429] side-input type coercion for multimaps #8654

Merged

3 tasks

udim added a commit to udim/beam that referenced this pull request May 23, 2019

[BEAM-6429] side-input type coercion for multimaps

0f9f757

This might also resolve BEAM-4782. Removes unused _input_element_coder, whose usage was removed in PR apache#5302. Includes squashed commits by @robertwb.


		safe_coders = {}

		def fix_pcoll_coder(pcoll, pipeline_components):

Conversation

robertwb commented May 8, 2018 • edited by lukecwik Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesccychen commented May 8, 2018

Uh oh!

lukecwik commented May 9, 2018

Uh oh!

lukecwik commented May 9, 2018

Uh oh!

pabloem commented May 9, 2018

Uh oh!

lukecwik commented May 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukecwik commented May 9, 2018

Uh oh!

lukecwik commented May 9, 2018

Uh oh!

charlesccychen commented May 9, 2018

Uh oh!

charlesccychen commented May 9, 2018

Uh oh!

aaltay commented May 10, 2018

Uh oh!

lukecwik commented May 10, 2018

Uh oh!

aaltay commented May 10, 2018

Uh oh!

charlesccychen commented May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertwb commented May 14, 2018

Uh oh!

lukecwik commented May 16, 2018

Uh oh!

lukecwik commented May 16, 2018

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

robertwb commented May 8, 2018 •

edited by lukecwik

Loading

lukecwik commented May 9, 2018 •

edited

Loading

charlesccychen commented May 11, 2018 •

edited

Loading