[BEAM-4236, BEAM-2927] Make Python SDK side inputs work with non well known coders and also work with Dataflow#5302
Conversation
sdks/python/apache_beam/pvalue.py
Outdated
|
|
||
| @staticmethod | ||
| def from_runner_api(proto, context): | ||
| def from_runner_api(proto, coder): |
There was a problem hiding this comment.
Given the coder comes from the PCollection and can be mutated, it's probably better to remove the coder from the side input object altogether, keeping the signature of this method as is, and instead passing it when we bind the PCollection.
There was a problem hiding this comment.
Seems this could also obviate the need for passing the transform proto itself in the PTransform.from_runner_api too.
There was a problem hiding this comment.
Looks much cleaner, trying this out now.
| input_args = input_args if input_args else [] | ||
| input_kwargs = input_kwargs if input_kwargs else {} | ||
|
|
||
| if not self.has_windowed_inputs: |
There was a problem hiding this comment.
This is an important optimization for batch pipelines that use side inputs (e.g the TFX stuff). I see now how we were requesting side inputs in start(), but they're not window dependent. Perhaps we could defer this optimization to the first element that is processed. (At least a TODO would be in order.)
There was a problem hiding this comment.
Like you suggest, checking for at least one element or loading on first element should work, just trying to get everything working E2E before I try to improve this optimization.
| def _pardo_fn_data(self): | ||
| si_tags_and_types = None | ||
| windowing = None | ||
| return self.fn, self.args, self.kwargs, si_tags_and_types, windowing |
There was a problem hiding this comment.
This is is the cause of the
File "/usr/local/google/home/lcwik/git/beam/sdks/python/apache_beam/runners/worker/operations.py", line 360, in start
pickler.loads(self.spec.serialized_fn))
ValueError: need more than 4 values to unpack
failure. I don't remember if we completely removed its use in the legacy worker, but if so, we can probably remove it there too rather than re-introduce it here.
There was a problem hiding this comment.
I was able to remove this in the few places it was referenced.
|
The following sample fails to run at this commit: #
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import argparse
import logging
import six
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.testing.util import assert_that, equal_to
def run(argv=None):
"""Build and run the pipeline."""
pipeline_options = PipelineOptions(argv)
pipeline_options.view_as(SetupOptions).save_main_session = True
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
main = p | 'MainCreate' >> beam.Create(['a', 'b'])
single_side = p | 'SingletonSide' >> beam.Create(['x'])
main | beam.Map(lambda x, side: (x, side),
beam.pvalue.AsSingleton(single_side))
# assert_that(
# main | beam.Map(lambda x, side: (x, side),
# beam.pvalue.AsSingleton(single_side)),
# equal_to([('a', 'x'), ('b', 'x')]),
# label='AssertSingleton')
# iter_side = p | 'IterSide' >> beam.Create(['x', 'y', 'z'])
# assert_that(
# main | beam.Map(lambda x, side: (x, sorted(side)),
# beam.pvalue.AsIter(iter_side)),
# equal_to([('a', ['x', 'y', 'z']), ('b', ['x', 'y', 'z'])]),
# label='AssertIter')
# multimap_side = p | 'MultimapSide' >> beam.Create(
# [('a', 'aa'), ('b', 'bb'), ('a', 'aaa')])
# assert_that(
# main | beam.Map(lambda x, side: (x, sorted(side[x])),
# beam.pvalue.AsMultiMap(multimap_side)),
# equal_to([('a', ['aa', 'aaa']), ('b', ['bb'])]),
# label='AssertMultimap')
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()The following error is emitted: |
62ea484 to
1bc4716
Compare
|
CC: @pabloem |
|
Question: Where can I find the c9ode that will fetch the side input from now on? Will we not be using the existing code path in runners/worker/sideinputs.py ? |
|
@pabloem Your right, this breaks the way that the Dataflow SDK does legacy worker batch side inputs. |
|
Reverted the removal of tags_and_types and restored the ability for the legacy batch Python Dataflow code paths. |
e7c750f to
93fcc6c
Compare
|
The hardest part was understanding the changes that I could make that would be compatible with Dataflow. |
|
For the record, the sample in #5302 (comment) still fails at this PR for an unrelated reason, likely having to do with PTransform replacement logic. |
|
The issue with the KeyError above happens because of incorrect refcount numbers in PValueCache. The entry is garbage-collected because its refcount reaches zero before it can be used in the side input. Does anyone know why we have refcounts in PValueCache at all? My hunch is that this is not necessary, and is an artifact of when PValueCache was used to hold actual result elements. I am inclined to remove this feature of PValueCache. Does anyone have an opinion here? |
|
I don't know why PValueCache has a refcount. I have the same feeling as you about why it has this feature. |
|
An issue with virtualenv exists preventing the Dataflow Python postcommit from running, more details on BEAM-4249 |
|
@charlesccychen could you run the equivalent of python post commit in your own environment. Let's not wait until we get Jenkin's fixes. |
|
@aaltay: Yes. @robertwb: I noticed that you very recently touched the ref counting logic in 00f3e22fccb. Can you chime in on whether we still need refcounts in PValueCache? |
|
The refcounts should only be needed for the old DirectRunner. |
|
Run Python Dataflow ValidatesRunner |
|
run python precommit |
robertwb
left a comment
There was a problem hiding this comment.
All LGTM, thanks!
|
|
||
| safe_coders = {} | ||
|
|
||
| def fix_pcoll_coder(pcoll, pipeline_components): |
There was a problem hiding this comment.
Maybe call this length_prefix_unknown_coders?
…the usage of tags_and_types from the serialized DoFnInfo
This might also resolve BEAM-4782. Removes unused _input_element_coder, whose usage was removed in PR apache#5302.
This might also resolve BEAM-4782. Removes unused _input_element_coder, whose usage was removed in PR apache#5302. Includes squashed commits by @robertwb.
Several issues addressed:
Streaming Dataflow job w/ Fn API side inputs: 2018-05-09_14_40_12-2860575533296523734 (note SIDE_INPUT being logged)
Batch Dataflow job w/ legacy side inputs: 2018-05-09_14_30_24-10730914164827927481
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue../gradlew buildto make sure basic checks pass. A more thorough check will be performed on your pull request automatically.