Skip to content

[BEAM-1925] Updates DoFn invocation logic to be more extensible.#2519

Closed
chamikaramj wants to merge 8 commits intoapache:masterfrom
chamikaramj:sdf_direct_runner2
Closed

[BEAM-1925] Updates DoFn invocation logic to be more extensible.#2519
chamikaramj wants to merge 8 commits intoapache:masterfrom
chamikaramj:sdf_direct_runner2

Conversation

@chamikaramj
Copy link
Contributor

Adds following abstractions.

DoFnSignature: describes the signature of a given DoFn object.
DoFnInvoker: defines a particular way for invoking DoFn methods.

I believe existing tests cover the updated code paths.

@chamikaramj
Copy link
Contributor Author

R: @sb2nov @robertwb

@asfbot
Copy link

asfbot commented Apr 13, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9482/

Build result: FAILURE

[...truncated 1.51 MB...] at hudson.remoting.UserRequest.perform(UserRequest.java:50) at hudson.remoting.Request$2.run(Request.java:336) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoExecutionException: Archetype IT 'basic' failed: Execution failure: exit code = 1 at org.apache.maven.archetype.mojos.IntegrationTestMojo.execute(IntegrationTestMojo.java:269) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 more2017-04-13T00:46:12.122 [ERROR] 2017-04-13T00:46:12.122 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2017-04-13T00:46:12.122 [ERROR] 2017-04-13T00:46:12.122 [ERROR] For more information about the errors and possible solutions, please read the following articles:2017-04-13T00:46:12.122 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException2017-04-13T00:46:12.122 [ERROR] 2017-04-13T00:46:12.122 [ERROR] After correcting the problems, you can resume the build with the command2017-04-13T00:46:12.122 [ERROR] mvn -rf :beam-sdks-java-maven-archetypes-examples-java8channel stoppedSetting status of ea54211 to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9482/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

@chamikaramj
Copy link
Contributor Author

Retest this please

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 70.099% when pulling ea54211 on chamikaramj:sdf_direct_runner2 into dc672f4 on apache:master.

@asfbot
Copy link

asfbot commented Apr 13, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/9489/
--none--

self.process(windowed_value)

def process(self, windowed_value):
self._invoke_process_method(windowed_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this extra invoke function? Can we move the contents of _invoke_process_method here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this function.


arguments, _, _, defaults = self.dofn.get_function_arguments('process')
def invoke_finish_bundle(self, process_output_fn):
defaults = self.signature.start_bundle_method.defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finish_bundle_method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.finish_bundle_method = None
self.do_fn = None

@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just do all this in the __init__ instead of having a static method ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.dofn_process = fn.process
def invoke_start_bundle(self, process_output_fn):
defaults = self.signature.start_bundle_method.defaults
defaults = defaults if defaults else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this as signature takes care of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def _invoke_bundle_method(self, method):
self.window_fn = windowing.windowfn

self.do_fn_invoker = DoFnInvoker.create_invoker(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move the use_simple_invoker boolean inside this as that makes it a pure factory .. the runner shouldn't decide what invoker is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



cdef class DoFnInvoker(object):
cpdef invoke_process(self, WindowedValue element, process_output_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do static methods need to be specified here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried by keep getting errors. Left a TODO for now. I don't think this will have an significant impact on performance anyways.


class DoFnRunner(Receiver):
"""A helper class for executing ParDo operations.
class Method(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is rather generic, perhaps DoFnMethodWrapper or similar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also unclear here what args and defaults are supposed to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.do_fn = None

@staticmethod
def create_signature(do_fn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like all of this would be better put in DoFnSignature's constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def create_invoker(
signature, use_simple_invoker, context, side_inputs, input_args,
input_kwargs):
if use_simple_invoker:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring?

Also, it feels like "use_simple_invoker" (and perhaps other arguments) should be deduced here, not passed in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

I removed use_simple_invoker. Other arguments have to be passed in here or during process call. Seems like passing here is the better option.

self.dofn_process = fn.process
def invoke_start_bundle(self, process_output_fn):
defaults = self.signature.start_bundle_method.defaults
defaults = defaults if defaults else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalize to the empty list (or better, empty tuple) earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def invoke_start_bundle(self, process_output_fn):
defaults = self.signature.start_bundle_method.defaults
defaults = defaults if defaults else []
args = [self.context if d == core.DoFn.ContextParam else d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the validation that ContextParam is the only valid pluggable value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this validation currently, right ? I added this to DoFnSignature class.

default_arg_values = signature.process_method.defaults
self.has_windowed_inputs = (self.has_windowed_inputs or
core.DoFn.WindowParam in defaults)
core.DoFn.WindowParam in default_arg_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid re-assigning values (here and elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

super(PerWindowInvoker, self).__init__(signature)
self.side_inputs = side_inputs
self.context = context
self.has_windowed_inputs = not all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the has_windowed_inputs bit should instead be a third class, not a bit in the second Invoker class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the rational for a third class ? I don't see a justification for another class here. I might be missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're branching on this in a couple of places. However, looking at again, I think that's OK (and the primary branch depends on what's known only at runtime, i.e. the number of actual windows for this element).

self._dofn_per_window_invoker(element)
self.logging_context = get_logging_context(logger, step_name=step_name)

# TODO(sourabh): Deprecate the use of context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this happening by first stable release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done. I can clean it up after the PR is merged

self.defaults = defaults
self._method_value = method_value

def __call__(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me from a performance standpoint. One of the key reasons to have different invokers (methods) was to avoid the generic overhead for the simple case(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you meant performance could be impacted due to using a callable here ?

I removed call and replaced it with a call() method which is also more readable.

@chamikaramj
Copy link
Contributor Author

cc: @jkff

@chamikaramj chamikaramj force-pushed the sdf_direct_runner2 branch 2 times, most recently from 0128f6b to 9c49ed4 Compare April 20, 2017 00:32
Copy link
Contributor Author

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. PTAL

self.finish_bundle_method = None
self.do_fn = None

@staticmethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


class DoFnRunner(Receiver):
"""A helper class for executing ParDo operations.
class Method(object):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.do_fn = None

@staticmethod
def create_signature(do_fn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.dofn_process = fn.process
def invoke_start_bundle(self, process_output_fn):
defaults = self.signature.start_bundle_method.defaults
defaults = defaults if defaults else []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.dofn_process = fn.process
def invoke_start_bundle(self, process_output_fn):
defaults = self.signature.start_bundle_method.defaults
defaults = defaults if defaults else []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.process(windowed_value)

def process(self, windowed_value):
self._invoke_process_method(windowed_value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this function.

self.defaults = defaults
self._method_value = method_value

def __call__(self, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you meant performance could be impacted due to using a callable here ?

I removed call and replaced it with a call() method which is also more readable.

default_arg_values = signature.process_method.defaults
self.has_windowed_inputs = (self.has_windowed_inputs or
core.DoFn.WindowParam in defaults)
core.DoFn.WindowParam in default_arg_values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



cdef class DoFnInvoker(object):
cpdef invoke_process(self, WindowedValue element, process_output_fn)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried by keep getting errors. Left a TODO for now. I don't think this will have an significant impact on performance anyways.

super(PerWindowInvoker, self).__init__(signature)
self.side_inputs = side_inputs
self.context = context
self.has_windowed_inputs = not all(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the rational for a third class ? I don't see a justification for another class here. I might be missing something.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 70.146% when pulling 9c49ed4 on chamikaramj:sdf_direct_runner2 into 3ef614c on apache:master.

Copy link
Contributor

@sb2nov sb2nov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small comments now. This makes the code really readable.

cdef Receiver main_receivers
cdef DoFnInvoker do_fn_invoker

cpdef process(self, WindowedValue element)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the variable name here to reflect the python file change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def _dofn_simple_invoker(self, element):
self._process_outputs(element, self.dofn_process(element.value))
def invoke_process(self, element, process_output_fn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element -> windowed_value ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class SimpleInvoker(DoFnInvoker):
"""An invoker that processes elements ignoring windowing information."""

def invoke_process(self, element, process_output_fn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element -> windowed_value ??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


cdef DoFnSignature signature

cpdef invoke_process(self, WindowedValue element, process_output_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element -> windowed_value ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

A DoFnInvoker describes a particular way for invoking methods of a DoFn
represented by a given DoFnSignature."""

def __init__(self, signature):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should keep the constructor across all invoker implementations and then each subclass can choose to use it or not ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understood this comment. This constructor is already used by sub-classes.

# Also cache all the placeholders needed in the process function.

# Fill in sideInputs if they are globally windowed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this empty line as the comment is related

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


global_window = GlobalWindow()

args = input_args if input_args else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changing to input_args. I think we can improve it a bit more as it is still confusing to understand what is input_args/args/final_args ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if not kwargs:
self._process_outputs(element, self.dofn_process(*args))
process_output_fn(element, self.signature.process_method.call(args, {}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this case is possible so we should just pass kwargs directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



cdef class SimpleInvoker(DoFnInvoker):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invoke_process method here ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



cdef class PerWindowInvoker(DoFnInvoker):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invoke_process method here ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a microbenchmark that instantiates a DoFnRunner and calls it with trivial DoFns of various signatures and compares timings before and after.

"""An invoker that processes elements ignoring windowing information."""

def invoke_process(self, element, process_output_fn):
process_output_fn(element, self.signature.process_method.call(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, this is performance critical code. "def call" vs call doesn't make much of a difference here, but what does matter is that we're creating a new list containing the single element.value, creating a new empty dictionary, then invoking this via _method_value(*input_args, **input_kwargs). In other words, passing through DoFnMethodWrapper is negating much if not all of the benefits of having a special SimpleInvoker. (We could probably simply eliminate/inline this class altogether. On that note I think it'd be fine for now if start/finish always ran the generic code as long as process was as fast as possible if that kept the code simpler.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could get rid of the extra per-element overhead you mentioned above by caching the process method within invokers. There is still an extra method call compared to the current implementation (calling a method within DoFnInvoker instead of a method within DoFnRunner). But this does not seem to be significant based on results of some benchmarks I ran. I'll post a comment with benchmark results.

Adds following abstractions.

DoFnSignature: describes the signature of a given DoFn object.
DoFnInvoker: defines a particular way for invoking DoFn methods.
Copy link
Contributor Author

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

# start_bundle and finish_bundle methods should only have ContextParam as a
# default argument.
self._validate_start_bundle()
self._validate_finish_bundle()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

We can validate at construction time by building a signature before job submission.

A DoFnInvoker describes a particular way for invoking methods of a DoFn
represented by a given DoFnSignature."""

def __init__(self, signature):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understood this comment. This constructor is already used by sub-classes.


cdef DoFnSignature signature

cpdef invoke_process(self, WindowedValue element, process_output_fn)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



cdef class SimpleInvoker(DoFnInvoker):
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



cdef class PerWindowInvoker(DoFnInvoker):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def _dofn_simple_invoker(self, element):
self._process_outputs(element, self.dofn_process(element.value))
def invoke_process(self, element, process_output_fn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if not kwargs:
self._process_outputs(element, self.dofn_process(*args))
process_output_fn(element, self.signature.process_method.call(args, {}))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Also cache all the placeholders needed in the process function.

# Fill in sideInputs if they are globally windowed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


global_window = GlobalWindow()

args = input_args if input_args else []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""An invoker that processes elements ignoring windowing information."""

def invoke_process(self, element, process_output_fn):
process_output_fn(element, self.signature.process_method.call(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could get rid of the extra per-element overhead you mentioned above by caching the process method within invokers. There is still an extra method call compared to the current implementation (calling a method within DoFnInvoker instead of a method within DoFnRunner). But this does not seem to be significant based on results of some benchmarks I ran. I'll post a comment with benchmark results.

@chamikaramj
Copy link
Contributor Author

I ran some benchmarks to compare results with and without this PR. See following for results.

https://docs.google.com/document/d/1SZa6C3a7EHy9-qE_9QTBRxyN3DR9yenLKLjK2wXNkhw/edit?usp=sharing

Based on these results, this PR does not seem to be adding extra overhead. Please let me know if you want me to run additional experiments.

@chamikaramj
Copy link
Contributor Author

PTAL.

@sb2nov
Copy link
Contributor

sb2nov commented Apr 22, 2017

Looks good to me

@robertwb
Copy link
Contributor

robertwb commented Apr 22, 2017

The benchmarks are probably running uncompiled, and the direct runner does not do fusion (as well as having a lot more overhead than our worker code).

https://gist.github.com/robertwb/5b64829fa91d9a61f55886e5ff6b1f8c shows on average a 30% overhead with this change. Once we have an external worker runner, this'll be easier to test.

@chamikaramj
Copy link
Contributor Author

Thanks. I'll try to determine the reason for the performance difference here and get back to you.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 70.043% when pulling 49cd48a on chamikaramj:sdf_direct_runner2 into 781c155 on apache:master.

@chamikaramj
Copy link
Contributor Author

Thanks Robert for creating the benchmark. So the values I get when running your benchmark are following.

Around 0.2 sec when running without this PR.
Around 0.3 sec with running with this PR.

So, to clarify, the overhead seems to be 0.1 sec per 10k elements per DoFn which is within the variation of my experiment. (and was not significant when running jobs such as WordCount and BigShuffle) using Dataflow runner (both have several ParDo/Map steps).

Job With PR Without PR
WordCount 1GB (10 workers) 11 mins 9 secs 12 mins 42 sec
WordCount 100GB (100 workers) 2 hours 16 mins 2 hours 8 mins
BigShuffle 100GB (100 workers) 20 mins 25 sec 20 mins 22 sec

@chamikaramj
Copy link
Contributor Author

The extra overhead mentioned above for SimpleInvoker seems to be due process invocation of DoFnRunner now involving calling a method of a DoFnInvoker object instead of directly invoking a cached process method. (overhead of doing invoker.invoke_process() instead of process()). To remove this overhead we'll have to remove modularity of invocation logic introduced by this CL. I don't think the amount of overhead observed in Robert's benchmark (0.1 sec per 10k invocations of process() method) justify totally removing modularity/structure of our DoFn invocation logic. Robert WDYT ?

This avoid the overhead of passing around (and calliing) a
Python callable, completely resolving the performance regression.
Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
self.args = args
self.defaults = defaults
self.method_value = method_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: order the same as the argument order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""

def __init__(self, do_fn):
# We add a property here for all methods defined by Beam DoFn features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omit these unused assignments, they are written to their final values below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cdef public object defaults
cdef public object method_value

cpdef call(self, list args, dict kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just remove this method and inline its two uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pass

def _validate_bundle_method(self):
assert core.DoFn.ElementParam not in self.start_bundle_method.defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be a positive assertion instead, so any *ParamType that's added wouldn't need to get listed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Updated this to support new params as well.

assert isinstance(do_fn, core.DoFn)
self.do_fn = do_fn

def _create_do_fn_method(do_fn, method_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should just be the DoFnMethodWrapper constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

input_args: arguments to be used when invoking the process method
input_kwargs: kwargs to be used when invoking the process method.
"""
default_arg_values = signature.process_method.defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a method on signature rather than reaching into its innards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine since DoFnMethodWrapper is a known interface.

Statically bind output processor.
@sb2nov
Copy link
Contributor

sb2nov commented Apr 24, 2017

Just a reminder, please squash before merging to make it easier to rollback if need be.

def invoke_start_bundle(self):
"""Invokes the DoFn.start_bundle() method.

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not remove the Args string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since it's empty.

def invoke_finish_bundle(self):
"""Invokes the DoFn.finish_bundle() method.

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove Args string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since it's empty.


self.do_fn_invoker = DoFnInvoker.create_invoker(
do_fn_signature, context, side_inputs, args, kwargs)
self, do_fn_signature, context, side_inputs, args, kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass just the _process_output function instead of self so that dependent class doesn't need to know about _process_output function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unless we want to declare it as a C function pointer (or pay the overhead of calling it as a Python function, which was the improvement here). One could create another object to just hold this method, but that'd probably be overkill. Renaming _process_output now that it's not private could make sense as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think it'll be good to rename it to process_output if we don't want to go down the function pointer router.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, memory usage significantly increases when we pass DoFnRunner to DoFnInvoker (as the output_processor) and maintain a reference to it. I suspect this is due to recursive referencing between DoFnInvoker and DoFnRunner. I could fix this by creating a new OutputProcessor class that contains the process_output method. Processing time improvement introduced by Robert's update stays the same.

Copy link
Contributor Author

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. PTAL.

(I'll squash commits before merging. Leaving as separate commits for now for ease of reviewing).

cdef public object defaults
cdef public object method_value

cpdef call(self, list args, dict kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
self.args = args
self.defaults = defaults
self.method_value = method_value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""

def __init__(self, do_fn):
# We add a property here for all methods defined by Beam DoFn features.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

assert isinstance(do_fn, core.DoFn)
self.do_fn = do_fn

def _create_do_fn_method(do_fn, method_name):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pass

def _validate_bundle_method(self):
assert core.DoFn.ElementParam not in self.start_bundle_method.defaults
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Updated this to support new params as well.

input_args: arguments to be used when invoking the process method
input_kwargs: kwargs to be used when invoking the process method.
"""
default_arg_values = signature.process_method.defaults
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine since DoFnMethodWrapper is a known interface.

def invoke_finish_bundle(self):
"""Invokes the DoFn.finish_bundle() method.

Args:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since it's empty.

def invoke_start_bundle(self):
"""Invokes the DoFn.start_bundle() method.

Args:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since it's empty.

Copy link
Contributor

@sb2nov sb2nov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from two minor comments.

cdef public DoFnMethodWrapper start_bundle_method
cdef public DoFnMethodWrapper finish_bundle_method
cdef public object do_fn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do validate functions need to be mentioned here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to make these cdef methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. So, private methods should not be be defined in pxd files ?

i.endswith('Param') and i != 'ContextParam')]

for param in unsupported_dofn_params:
assert param not in self.start_bundle_method.defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be both start/finish

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

cdef public DoFnMethodWrapper start_bundle_method
cdef public DoFnMethodWrapper finish_bundle_method
cdef public object do_fn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

i.endswith('Param') and i != 'ContextParam')]

for param in unsupported_dofn_params:
assert param not in self.start_bundle_method.defaults
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@chamikaramj
Copy link
Contributor Author

Robert, PTAL.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 70.003% when pulling 60e3977 on chamikaramj:sdf_direct_runner2 into 781c155 on apache:master.

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments, but LGTM. Thanks!

cdef public DoFnMethodWrapper start_bundle_method
cdef public DoFnMethodWrapper finish_bundle_method
cdef public object do_fn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to make these cdef methods.

self._validate_finish_bundle()
self._validate_process()

def _validate_start_bundle(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just inline these three private methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

i.endswith('Param') and i != 'ContextParam')]

for param in unsupported_dofn_params:
assert param not in method_wrapper.defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you're implicitly using the fact that XxxParam is both named XxxParam and has value "XxxParam." At least call this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

cdef public DoFnMethodWrapper start_bundle_method
cdef public DoFnMethodWrapper finish_bundle_method
cdef public object do_fn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. So, private methods should not be be defined in pxd files ?

self._validate_finish_bundle()
self._validate_process()

def _validate_start_bundle(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

i.endswith('Param') and i != 'ContextParam')]

for param in unsupported_dofn_params:
assert param not in method_wrapper.defaults
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@asfgit asfgit closed this in 0094699 Apr 26, 2017
@robertwb
Copy link
Contributor

Done. So, private methods should not be be defined in pxd files ?

The distinction is that only modified methods (or classes) need to be defined in pxd files (e.g. designating that this is a cdef, rather than def, method, and typing its parameters). These validate methods are not performance critical, so leaving them as "ordinary" methods is fine (and less verbose).

@chamikaramj
Copy link
Contributor Author

I see. Thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants