[BEAM-3042] Add tracking of bytes read / time spent when reading side inputs #3943

pabloem · 2017-10-04T22:56:50Z

Testing changes to track msecs and bytes spent while reading from side inputs.

coveralls · 2017-10-05T19:49:48Z

Changes Unknown when pulling 6e2c396 on pabloem:sicounters into ** on apache:master**.

pabloem · 2017-10-06T17:15:35Z

@chamikaramj
Hi Cham. I'm prototyping this change to go in fairly soon. I want to track bytes read from side inputs and msecs blocked waiting for side inputs. Could you take a look?

chamikaramj · 2017-10-06T17:26:44Z

cc: @charlesccychen

pabloem · 2017-10-10T22:18:21Z

Cham has advised me to find a reviewer that is more familiar with side inputs.
r: @charlesccychen could you take a look?

A quick summary of the change:
In the Insights team we're working on a project to keep track of time spent / bytes read from side inputs, so that users may be better informed of bottlenecks in their pipelines.
For this, we track msecs spent blocked waiting for side inputs (so, only when the Queue is empty ahead of time).

An interesting, and perhaps counter-intuitive scenario that we're planning for is the following dummy example:

si_iterable = AsIter(p | Create(long_list))

def emit_side_input(e, si_iter):
  yield si_iter

pcoll = p 
  | Create([0]) 
  | 'step1' >> beam.Map(emit_side_input, si_iter=si_iterable) 
  | 'step2' >> beam.FlatMap(lambda x: list(x))

In this case, step1 emits the iterable, and step2 is the one that actually may be blocked waiting for side inputs. That's why we need to check the current step when the thread blocks.

I'd like a review of the approach, and I'll come back with benchmark results soon. Hope it's not much trouble, Charles. Thanks.
Best
-P.

chamikaramj

Some comments.

Could you create a JIRA with appropriate context ?

chamikaramj · 2017-10-10T22:35:55Z

sdks/python/apache_beam/runners/worker/sideinputs.py

@@ -78,29 +82,51 @@ def _start_reader_threads(self):
      t.start()
      self.reader_threads.append(t)

+  def _get_source_position(self, range_tracker=None, reader=None):
+    if reader:
+      return reader.get_progress().position.byte_offset


What about side input sources that do not have byte offsets ?

I need Charles or Robert to input here. In my understanding, side inputs always use Avro files - so they are always byte-offset-based sources. Would this be correct? @charlesccychen

I think they might now, but we should verify in the case one uses the result of a read (e.g. Create or ReadTextIO) directly as a side input.

Even if it is the case, best to assert this assumption explicitly somewhere.

chamikaramj · 2017-10-10T22:35:55Z

sdks/python/apache_beam/runners/worker/sideinputs.py

    self.sources = sources
    self.num_reader_threads = min(max_reader_threads, len(self.sources))
+    self.read_counter = read_counter or opcounters.TransformIoCounter()
+    # self.read_counter = opcounters.TransformIoCounter()


Why is this commented out ?

This is to allow to test a no-op counter vs the implementation. It will be removed before merging.

chamikaramj · 2017-10-10T22:35:55Z

sdks/python/apache_beam/runners/worker/opcounters.py

@@ -42,6 +43,58 @@ def value(self):
    return self._value


+class TransformIoCounter(object):


Can you add documentation ? Also is this a user-facing interface ? If so this should be discussed more broadly.

This won't be a user-facing interface. It's meant to be used by the IO infrastructure classes. I'll add documentation in a bit.

chamikaramj · 2017-10-10T22:35:55Z

sdks/python/apache_beam/runners/worker/sideinputs.py

+      return reader.get_progress().position.byte_offset
+    else:
+      return range_tracker.position_at_fraction(
+          range_tracker.fraction_consumed()) if range_tracker else 0


Is it valid to return

Sorry forgot to complete this.

Should be: Is it valid to return 0 here should that case be an error ?

chamikaramj · 2017-10-10T22:35:55Z

sdks/python/apache_beam/runners/worker/sideinputs.py

@@ -78,29 +82,51 @@ def _start_reader_threads(self):
      t.start()
      self.reader_threads.append(t)

+  def _get_source_position(self, range_tracker=None, reader=None):


when does this a get range_tracker vs reader ? Please clarify with a comment.

+1. Is there a way to avoid this either/or altogether?

chamikaramj · 2017-10-10T22:35:56Z

sdks/python/apache_beam/runners/worker/operations.py

+          # Inputs are 1-indexed, so we add 1 to i in the side input id
+          counters.side_input_id(self.operation_name, i+1))
+      iterator_fn = sideinputs.get_iterator_fn_for_sources(
+          sources, read_counter=si_counter)


Do we want to use the same counter for all the sources here ?

Yes. A single side input may have different sources, but we want to track bytes/msecs for the side input, not per-source.

chamikaramj · 2017-10-10T22:37:24Z

R: @robertwb or @charlesccychen

robertwb

Some initial comments.

robertwb · 2017-11-20T23:16:28Z

sdks/python/apache_beam/runners/worker/sideinputs.py

@@ -78,29 +82,51 @@ def _start_reader_threads(self):
      t.start()
      self.reader_threads.append(t)

+  def _get_source_position(self, range_tracker=None, reader=None):
+    if reader:
+      return reader.get_progress().position.byte_offset


I think they might now, but we should verify in the case one uses the result of a read (e.g. Create or ReadTextIO) directly as a side input.

Even if it is the case, best to assert this assumption explicitly somewhere.

robertwb · 2017-11-20T23:17:27Z

sdks/python/apache_beam/runners/worker/sideinputs.py

@@ -78,29 +82,51 @@ def _start_reader_threads(self):
      t.start()
      self.reader_threads.append(t)

+  def _get_source_position(self, range_tracker=None, reader=None):


+1. Is there a way to avoid this either/or altogether?

robertwb · 2017-11-20T23:20:33Z

sdks/python/apache_beam/runners/worker/opcounters.py

+    self._state_sampler = state_sampler
+    self._bytes_read_cache = 0
+    self.io_target = io_target
+    self.check_step()


What does this do? Documentation on the parent class would be helpful even if it's not user-facing. The implementation below doesn't look like it's checking stuff.

robertwb · 2017-11-20T23:22:21Z

sdks/python/apache_beam/runners/worker/opcounters.py

+  def __exit__(self, unused_exc_type, unused_exc_value, unused_traceback):
+    self.exit()
+
+  def enter(self):


Are these needed in addition to enter and exit?

robertwb · 2017-11-20T23:25:22Z

sdks/python/apache_beam/runners/worker/sideinputs.py

@@ -128,7 +154,14 @@ def __iter__(self):
    num_readers_finished = 0
    try:
      while True:
-        element = self.element_queue.get()
+        if self.element_queue.empty():
+          # The queue is empty. We check the current state.


I'm not following the relationship here.

robertwb · 2017-11-20T23:26:45Z

sdks/python/apache_beam/runners/worker/sideinputs.py

              returns_windowed_values = reader.returns_windowed_values
              for value in reader:
                if self.has_errored:
-                  # If any reader has errored, just return.
+                  # If any reader has errored, just return.`


Extra backtick.

robertwb · 2017-11-20T23:28:15Z

sdks/python/apache_beam/runners/worker/sideinputs.py

              if self.has_errored:
                # If any reader has errored, just return.
                return
+
+              current_position = self._get_source_position(range_tracker=rt)
+              consumed_bytes = current_position - initial_position


Do we just assume position is in bytes, and can be subtracted?

chamikaramj reviewed Oct 10, 2017

View reviewed changes

pabloem changed the title ~~Adding tracking for bytes and msecs spent while reading from side inputs~~ [BEAM-3042] Add tracking of bytes read / time spent when reading side inputs Oct 10, 2017

pabloem added 2 commits October 18, 2017 11:44

Adding tracking for bytes and msecs spent while reading from side inputs

1339474

Fixing lint issues

902ca7f

pabloem force-pushed the sicounters branch from 6e2c396 to 902ca7f Compare October 18, 2017 18:44

robertwb reviewed Nov 20, 2017

View reviewed changes

pabloem closed this Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-3042] Add tracking of bytes read / time spent when reading side inputs #3943

[BEAM-3042] Add tracking of bytes read / time spent when reading side inputs #3943

pabloem commented Oct 4, 2017

coveralls commented Oct 5, 2017

pabloem commented Oct 6, 2017

chamikaramj commented Oct 6, 2017

pabloem commented Oct 10, 2017

chamikaramj left a comment

chamikaramj Oct 10, 2017

pabloem Oct 11, 2017

robertwb Nov 20, 2017

chamikaramj Oct 10, 2017

pabloem Oct 11, 2017

chamikaramj Oct 10, 2017

pabloem Oct 10, 2017

chamikaramj Oct 10, 2017

chamikaramj Oct 10, 2017

chamikaramj Oct 10, 2017

robertwb Nov 20, 2017

chamikaramj Oct 10, 2017

pabloem Oct 10, 2017

chamikaramj commented Oct 10, 2017

robertwb left a comment

robertwb Nov 20, 2017

robertwb Nov 20, 2017

robertwb Nov 20, 2017

robertwb Nov 20, 2017

robertwb Nov 20, 2017

robertwb Nov 20, 2017

robertwb Nov 20, 2017

		@@ -42,6 +43,58 @@ def value(self):
		return self._value


		class TransformIoCounter(object):

[BEAM-3042] Add tracking of bytes read / time spent when reading side inputs #3943

[BEAM-3042] Add tracking of bytes read / time spent when reading side inputs #3943

Conversation

pabloem commented Oct 4, 2017

coveralls commented Oct 5, 2017

pabloem commented Oct 6, 2017

chamikaramj commented Oct 6, 2017

pabloem commented Oct 10, 2017

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj commented Oct 10, 2017

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment