<a href="https://colab.research.google.com/github/davidcavazos/beam/blob/colab-notebooks/examples/python/notebooks/io/custom-inputs-boundedsource.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom inputs - BoundedSource

`ParDo` is the recommended way to implement a Source, since implementing one can be tricky.

To learn more on when to use a `BoundedSource`, see [When to use the Source interface
](https://beam.apache.org/documentation/io/developing-io-overview/index.html#when-to-use-source).

To learn more about ParDo for custom inputs, see [Custom Inputs - ParDo](https://colab.research.google.com/drive/1r-o2QJ-D-I0TV4NIQR6pAN2t4Pxt2-Ey).

# Setup

First, let's install `apache-beam`.

In [0]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

# Install apache-beam.
run('pip install --quiet apache-beam')

>> pip install --quiet apache-beam



# Example: RangeSource

We create a `BoundedSource` that behaves like Python's [`range`](https://docs.python.org/2.7/library/functions.html#range) built-in function in this example.

First, we'll explore how to iterate through bundles of data.

In [0]:
start = 0
stop = 42
bundle_size = 10

for bundle_start in range(start, stop, bundle_size):
  bundle_stop = min(stop, bundle_start + bundle_size)
  print('{:>2} - {:>2}'.format(bundle_start, bundle_stop))

 0 - 10
10 - 20
20 - 30
30 - 40
40 - 42


Below is what `CountingSource` looks like. We are using [`OffsetRangeTracker`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.range_trackers.html?highlight=range%20tracker#apache_beam.io.range_trackers.OffsetRangeTracker) in this example, but more complex sources may need to [implement their own `RangeTracker`](https://beam.apache.org/documentation/io/developing-io-python/#implementing-the-rangetracker-subclass).

In [0]:
import apache_beam as beam
import logging

class RangeSource(beam.io.iobase.BoundedSource):
  def __init__(self, start, stop, step=1):
    self.start = start
    self.stop = stop
    self.step = step

  def estimate_size(self):
    return (self.stop - self.start) / self.step

  def get_range_tracker(self, start, stop):
    if start is None:
      start = self.start
    if stop is None:
      stop = self.stop
    return beam.io.range_trackers.OffsetRangeTracker(start, stop)

  def read(self, range_tracker):
    start = range_tracker.start_position()
    stop = range_tracker.stop_position()
    for i in range(start, stop, self.step):
      if not range_tracker.try_claim(i):
        return
      yield i

  def split(self, bundle_size, start=None, stop=None):
    if start is None:
      start = self.start
    if stop is None:
      stop = self.stop

    for bundle_start in range(start, stop, bundle_size):
      bundle_stop = min(stop, bundle_start + bundle_size)
      yield beam.io.iobase.SourceBundle(
          weight=bundle_stop - bundle_start,
          source=self,
          start_position=bundle_start,
          stop_position=bundle_stop,
      )

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Read from RangeSource' >> beam.io.Read(RangeSource(10, 100, 10))
      | 'Inspect elements' >> beam.Map(logging.warning)
  )



We can also wrap this process up into a `PTransform` to have a nicer `Source` interface.

In [0]:
class ReadFromRangeSource(beam.PTransform):
  def __init__(self, start, stop, step=1):
    super(ReadFromRangeSource, self).__init__()
    self.start = start
    self.stop = stop
    self.step = step

  def expand(self, pcollection):
    return (
      pcollection
      | 'RangeSource' >> beam.io.Read(RangeSource(self.start, self.stop, self.step))
    )

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Read from RangeSource' >> ReadFromRangeSource(10, 100, 10)
      | 'Inspect elements' >> beam.Map(logging.warning)
  )

