<a href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/io/custom-outputs-pardo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom outputs - ParDo

`ParDo` is the recommended way to implement a Sink, since implementing one can be tricky.

# Setup

First, let's install `apache-beam`.

In [0]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

# Install apache-beam.
run('pip install --quiet apache-beam')

>> pip install --quiet apache-beam



# Example: Write to files

A PCollection might contain more elements than what fit into memory in a single machine. So it's a good idea to break it into batches and then we can deal with each batch independently. To keep things simple, we'll create one file per batch.

A very simple batching strategy is to just assign each element to a random batch.

In [0]:
run('rm -rf outputs')

>> rm -rf outputs



In [0]:
import apache_beam as beam
import logging
import random
import os

if not os.path.exists('outputs'):
  os.makedirs('outputs')

def write_to_file(key_value):
  batch_idx, values = key_value
  with open('outputs/part-{}'.format(batch_idx), 'w') as f:
    for value in values:
      f.write('{}\n'.format(value))

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Create inputs' >> beam.Create([i for i in range(10)])
      | 'Set key to a randomized batch number' >> beam.Map(
          lambda element: (random.randint(1, 3), element))
      | 'Group into batches' >> beam.GroupByKey()
      | 'Write to files' >> beam.ParDo(write_to_file)
  )

# Check the outputs.
!ls -lh outputs/
!head outputs/part*

total 12K
-rw-r--r-- 1 root root  6 Jan 30 00:41 part-1
-rw-r--r-- 1 root root 12 Jan 30 00:41 part-2
-rw-r--r-- 1 root root  2 Jan 30 00:41 part-3
==> outputs/part-1 <==
0
1
5

==> outputs/part-2 <==
2
3
4
7
8
9

==> outputs/part-3 <==
6


However, this requires us to know the number of batches we want to split our data into which might not be possible in a streaming scenario.

Another option is to create batches as elements arrive and yield a list of elements once it has reached the desired size.

In [0]:
run('rm -rf outputs')

>> rm -rf outputs



In [0]:
import apache_beam as beam
import logging
import os

if not os.path.exists('outputs'):
  os.makedirs('outputs')

def write_to_file(key_value):
  batch_idx, values = key_value
  with open('outputs/part-{}'.format(batch_idx), 'w') as f:
    for value in values:
      f.write('{}\n'.format(value))

class GroupIntoBatches(beam.DoFn):
  def __init__(self, n):
    self.n = n
    self.buffer = []
    self.batch_idx = 0

  def process(self, element):
    self.buffer.append(element)
    if len(self.buffer) == self.n:
      yield self.batch_idx, list(self.buffer)
      self.buffer = []
      self.batch_idx += 1

  def finish_bundle(self):
    if len(self.buffer) != 0:
      value = self.batch_idx, list(self.buffer)
      yield beam.utils.windowed_value.WindowedValue(value, -1, [])

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Create inputs' >> beam.Create([i for i in range(10)])
      | 'Group into batches' >> beam.ParDo(GroupIntoBatches(3))
      | 'Write to files' >> beam.ParDo(write_to_file)
  )

# Check the outputs.
!ls -lh outputs/
!head outputs/part*

total 16K
-rw-r--r-- 1 root root 6 Jan 30 00:52 part-0
-rw-r--r-- 1 root root 6 Jan 30 00:52 part-1
-rw-r--r-- 1 root root 6 Jan 30 00:52 part-2
-rw-r--r-- 1 root root 2 Jan 30 00:52 part-3
==> outputs/part-0 <==
0
1
2

==> outputs/part-1 <==
3
4
5

==> outputs/part-2 <==
6
7
8

==> outputs/part-3 <==
9
