<a href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/documentation/transforms/python/elementwise/pardo-py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

<table align="left"><td><a target="_blank" href="https://beam.apache.org/documentation/transforms/python/elementwise/pardo"><img src="https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png" width="32" height="32" />View the docs</a></td></table>

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License")
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# ParDo

<script type="text/javascript">
localStorage.setItem('language', 'language-py')
</script>

<table align="left" style="margin-right:1em">
  <td>
    <a class="button" target="_blank" href="https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.ParDo"><img src="https://beam.apache.org/images/logos/sdks/python.png" width="32px" height="32px" alt="Pydoc"/> Pydoc</a>
  </td>
</table>

<br/><br/><br/>

A transform for generic parallel processing.
A `ParDo` transform considers each element in the input `PCollection`,
performs some processing function (your user code) on that element,
and emits zero or more elements to an output `PCollection`.

See more information in the
[Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/#pardo).

## Setup

To run a code cell, you can click the **Run cell** button at the top left of the cell,
or select it and press **`Shift+Enter`**.
Try modifying a code cell and re-running it to see what happens.

> To learn more about Colab, see
> [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).

First, let's install the `apache-beam` module.

In [None]:
!pip install --quiet -U apache-beam

## Examples

In the following examples, we explore how to create custom `DoFn`s and access
the timestamp and windowing information.

### Example 1: ParDo with a simple DoFn

The following example defines a simple `DoFn` class called `SplitWords`
which stores the `delimiter` as an object field.
The `process` method is called once per element,
and it can yield zero or more output elements.

In [None]:
import apache_beam as beam

class SplitWords(beam.DoFn):
  def __init__(self, delimiter=','):
    self.delimiter = delimiter

  def process(self, text):
    for word in text.split(self.delimiter):
      yield word

with beam.Pipeline() as pipeline:
  plants = (
      pipeline
      | 'Gardening plants' >> beam.Create([
          '🍓Strawberry,🥕Carrot,🍆Eggplant',
          '🍅Tomato,🥔Potato',
      ])
      | 'Split words' >> beam.ParDo(SplitWords(','))
      | beam.Map(print))

<table align="left" style="margin-right:1em">
  <td>
    <a class="button" target="_blank" href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" width="32px" height="32px" alt="View source code"/> View source code</a>
  </td>
</table>

<br/><br/><br/>

### Example 2: ParDo with timestamp and window information

In this example, we add new parameters to the `process` method to bind parameter values at runtime.

* [`beam.DoFn.TimestampParam`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.TimestampParam)
  binds the timestamp information as an
  [`apache_beam.utils.timestamp.Timestamp`](https://beam.apache.org/releases/pydoc/current/apache_beam.utils.timestamp.html#apache_beam.utils.timestamp.Timestamp)
  object.
* [`beam.DoFn.WindowParam`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.WindowParam)
  binds the window information as the appropriate
  [`apache_beam.transforms.window.*Window`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.window.html)
  object.

In [None]:
import apache_beam as beam

class AnalyzeElement(beam.DoFn):
  def process(
      self,
      elem,
      timestamp=beam.DoFn.TimestampParam,
      window=beam.DoFn.WindowParam):
    yield '\n'.join([
        '# timestamp',
        'type(timestamp) -> ' + repr(type(timestamp)),
        'timestamp.micros -> ' + repr(timestamp.micros),
        'timestamp.to_rfc3339() -> ' + repr(timestamp.to_rfc3339()),
        'timestamp.to_utc_datetime() -> ' + repr(timestamp.to_utc_datetime()),
        '',
        '# window',
        'type(window) -> ' + repr(type(window)),
        'window.start -> {} ({})'.format(
            window.start, window.start.to_utc_datetime()),
        'window.end -> {} ({})'.format(
            window.end, window.end.to_utc_datetime()),
        'window.max_timestamp() -> {} ({})'.format(
            window.max_timestamp(), window.max_timestamp().to_utc_datetime()),
    ])

with beam.Pipeline() as pipeline:
  dofn_params = (
      pipeline
      | 'Create a single test element' >> beam.Create([':)'])
      | 'Add timestamp (Spring equinox 2020)' >>
      beam.Map(lambda elem: beam.window.TimestampedValue(elem, 1584675660))
      |
      'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
      | 'Analyze element' >> beam.ParDo(AnalyzeElement())
      | beam.Map(print))

<table align="left" style="margin-right:1em">
  <td>
    <a class="button" target="_blank" href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" width="32px" height="32px" alt="View source code"/> View source code</a>
  </td>
</table>

<br/><br/><br/>

### Example 3: ParDo with DoFn methods

A [`DoFn`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn)
can be customized with a number of methods that can help create more complex behaviors.
You can customize what a worker does when it starts and shuts down with `setup` and `teardown`.
You can also customize what to do when a
[*bundle of elements*](https://beam.apache.org/documentation/runtime/model/#bundling-and-persistence)
starts and finishes with `start_bundle` and `finish_bundle`.

* [`DoFn.setup()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.setup):
  Called *once per `DoFn` instance* when the `DoFn` instance is initialized.
  `setup` need not to be cached, so it could be called more than once per worker.
  This is a good place to connect to database instances, open network connections or other resources.

* [`DoFn.start_bundle()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.start_bundle):
  Called *once per bundle of elements* before calling `process` on the first element of the bundle.
  This is a good place to start keeping track of the bundle elements.

* [**`DoFn.process(element, *args, **kwargs)`**](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.process):
  Called *once per element*, can *yield zero or more elements*.
  Additional `*args` or `**kwargs` can be passed through
  [`beam.ParDo()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.ParDo).
  **[required]**

* [`DoFn.finish_bundle()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.finish_bundle):
  Called *once per bundle of elements* after calling `process` after the last element of the bundle,
  can *yield zero or more elements*. This is a good place to do batch calls on a bundle of elements,
  such as running a database query.

  For example, you can initialize a batch in `start_bundle`,
  add elements to the batch in `process` instead of yielding them,
  then running a batch query on those elements on `finish_bundle`, and yielding all the results.

  Note that yielded elements from `finish_bundle` must be of the type
  [`apache_beam.utils.windowed_value.WindowedValue`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/windowed_value.py).
  You need to provide a timestamp as a unix timestamp, which you can get from the last processed element.
  You also need to provide a window, which you can get from the last processed element like in the example below.

* [`DoFn.teardown()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.teardown):
  Called *once (as a best effort) per `DoFn` instance* when the `DoFn` instance is shutting down.
  This is a good place to close database instances, close network connections or other resources.

  Note that `teardown` is called as a *best effort* and is *not guaranteed*.
  For example, if the worker crashes, `teardown` might not be called.

In [None]:
import apache_beam as beam

class DoFnMethods(beam.DoFn):
  def __init__(self):
    print('__init__')
    self.window = beam.window.GlobalWindow()

  def setup(self):
    print('setup')

  def start_bundle(self):
    print('start_bundle')

  def process(self, element, window=beam.DoFn.WindowParam):
    self.window = window
    yield '* process: ' + element

  def finish_bundle(self):
    yield beam.utils.windowed_value.WindowedValue(
        value='* finish_bundle: 🌱🌳🌍',
        timestamp=0,
        windows=[self.window],
    )

  def teardown(self):
    print('teardown')

with beam.Pipeline() as pipeline:
  results = (
      pipeline
      | 'Create inputs' >> beam.Create(['🍓', '🥕', '🍆', '🍅', '🥔'])
      | 'DoFn methods' >> beam.ParDo(DoFnMethods())
      | beam.Map(print))

<table align="left" style="margin-right:1em">
  <td>
    <a class="button" target="_blank" href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" width="32px" height="32px" alt="View source code"/> View source code</a>
  </td>
</table>

<br/><br/><br/>

> *Known issues:*
>
> * [[BEAM-7340]](https://issues.apache.org/jira/browse/BEAM-7340)
>   `DoFn.teardown()` metrics are lost.

## Related transforms

* [Map](https://beam.apache.org/documentation/transforms/python/elementwise/map) behaves the same, but produces exactly one output for each input.
* [FlatMap](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`,
  but for each input it may produce zero or more outputs.
* [Filter](https://beam.apache.org/documentation/transforms/python/elementwise/filter) is useful if the function is just
  deciding whether to output an element or not.

<table align="left" style="margin-right:1em">
  <td>
    <a class="button" target="_blank" href="https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.ParDo"><img src="https://beam.apache.org/images/logos/sdks/python.png" width="32px" height="32px" alt="Pydoc"/> Pydoc</a>
  </td>
</table>

<br/><br/><br/>

<table align="left"><td><a target="_blank" href="https://beam.apache.org/documentation/transforms/python/elementwise/pardo"><img src="https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png" width="32" height="32" />View the docs</a></td></table>