In [None]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Learn Beam PTransforms

After this notebook, you should be able to:
1. Use user-defined functions in your `PTransforms`
2. Learn Beam SDK composite transforms
3. Create you own composite transforms to simplify your `Pipeline`

For basic Beam `PTransforms`, please check out [this Notebook](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_basics_by_doing.ipynb).

Beam Python SDK also provides [a list of built-in transforms](https://beam.apache.org/documentation/transforms/python/overview/).


## How To Approach This Tutorial

This tutorial was designed for someone who likes to learn by doing. There will be code cells where you can write your own code to test your understanding.

As such, to get the most out of this tutorial, we strongly recommend typing code by hand as you’re working through the tutorial and not using copy/paste. This will help you develop muscle memory and a stronger understanding.

To begin, run the cell below to install and import Apache Beam.

In [None]:
# Run a shell command and import beam
!pip install --quiet apache-beam
import apache_beam as beam
beam.__version__

In [None]:
# Set the logging level to reduce verbose information
import logging

logging.root.setLevel(logging.ERROR)



---



---



## 1. Simple User-Defined Function (UDF)

Some `PTransforms` allow you to run your own functions and user-defined code to specify how your transform is applied. For example, the below `CombineGlobally` transform,

In [None]:
pc = [1, 10, 100, 1000]

# User-defined function
def bounded_sum(values, bound=500):
  return min(sum(values), bound)

small_sum = pc | beam.CombineGlobally(bounded_sum)  # [500]
large_sum = pc | beam.CombineGlobally(bounded_sum, bound=5000)  # [1111]

print(small_sum, large_sum)

## 2. Transforms: ParDo and Combine

A `ParDo` transform considers each element in the input `PCollection`, performs your user code to process each element, and emits zero, one, or multiple elements to an output `PCollection`. `Combine` is another Beam transform for combining collections of elements or values in your data.
Both allow flexible UDF to define how you process the data.

### 2.1 DoFn

DoFn - a Beam Python class that defines a distributed processing function (used in [ParDo](https://beam.apache.org/documentation/programming-guide/#pardo))

In [None]:
data = [1, 2, 3, 4]

# create a DoFn to multiply each element by five
# you can define the procesing code under `process`
class MultiplyByFive(beam.DoFn):
  def process(self, element):
    return [element*5]

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create values' >> beam.Create(data)
      | 'Multiply by 5' >> beam.ParDo(MultiplyByFive())
  )

  outputs | beam.Map(print)

### 2.2 CombineFn

CombineFn - define associative and commutative aggregations (used in [Combine](https://beam.apache.org/documentation/programming-guide/#combine))

In [None]:
data = [1, 2, 3, 4]

# create a CombineFn to get the product of each element
# you need to provide four opeations
class ProductFn(beam.CombineFn):
  def create_accumulator(self):
    # creates a new accumulator to store the initial value
    return 1

  def add_input(self, current_prod, input):
    # adds an input element to an accumulator
    return current_prod*input

  def merge_accumulators(self, accumulators):
    # merge several accumulators into a single accumulator
    prod = 1
    for accu in accumulators:
      prod *= accu
    return prod

  def extract_output(self, prod):
    # performs the final computation
    return prod

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create values' >> beam.Create(data)
      | 'Multiply by 2' >> beam.CombineGlobally(ProductFn())
  )
  outputs | beam.LogElements()


Note: The above `DoFn` and `CombineFn` examples are for demonstration purposes. You could easily achieve the same functionality by using the simple function illustrated in section 1.



---



## 3. Composite Transforms

Now that you've learned the basic `PTransforms`, Beam allows you to simplify the process of processing and transforming your data through [Composite Transforms](https://beam.apache.org/documentation/programming-guide/#composite-transforms).

Composite transforms can nest multiple transforms into a single composite transform, making your code easier to understand.

To see an example of this, let's take a look at how we can improve the `Pipeline` we built to count each word in Shakespeare's *King Lear*.

Below is that `Pipeline` we built in [WordCount tutorial](https://colab.research.google.com/drive/1_EncqFT_SmwXp7wlRqEf39m9efyrmm9p?usp=sharing):

In [None]:
!mkdir -p data
!gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/

In [None]:
import re

# Function used to run and display the result
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part'

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  word_count = (
      pipeline
        | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
        | 'Find words' >> beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
        | 'Pair words with 1' >> beam.Map(lambda word: (word, 1))
        | 'Group and sum' >> beam.CombinePerKey(sum)
        | 'Write results' >> beam.io.WriteToText(outputs_prefix)
  )

# Sample the first 20 results, remember there are no ordering guarantees.
run('head -n 20 {}-00000-of-*'.format(outputs_prefix))

Although the code above is a viable way to design your `Pipeline`, you can see that we use multiple transforms to perform one process:
1. `FlatMap` is used to find words in each line
2. `Map` is used to create key-value pairs with each word where the value is 1
3. `CombinePerKey` is used so that we can then group by each word and count up the sums

All of these `PTransforms` in essence is meant to count each word in *King Lear*. You can simplify the process and combine these three transforms into one by using composite transforms.

There's two ways you can follow:
1. Using Beam SDK's built-in composite transforms
2. Creating your own composite transforms

### 3.1 Beam SDK Composite Transforms
Beam makes it easy for you with its Beam SDK which comes with a package of many useful composite transforms. We will only cover one in this tutorial but to see a list of transforms you can use, see the following API reference page: [Pre-written Beam Transforms for Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.html).


By using a Beam SDK composite transform, you're able to easily combine multiple transforms into one line.

For this tutorial, we will use the SDK-provided [`Count` transform](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.combiners.html#apache_beam.transforms.combiners.Count), which counts each element in the `PCollection`.


```
beam.combiners.Count.PerElement()
```



This `Count` transform performs the work that both the `Map` and `CombinePerKey` transforms from our Word Count `Pipeline` but do it in one line.

Edit the Word Count `Pipeline` below to use a composite transform by implementing Beam's `Count` transform (see above). Applying a composite transform is just like applying a `PTransform` to your `PCollection`.

Below the code cell you will edit is a hidden answer code cell to check your work. If you're stuck, try opening the hint first!

In [None]:
#@title Open code to show the hint

#Hint: Replace the `Map` and `CombinePerKey` transforms with Beam's `Count` transform (see above)*italicized text*

In [None]:
#@title EDIT THIS CODE CELL TO USE beam.combiners.Count.PerElement
# EDIT THIS CODE CELL

inputs_pattern = 'data/*'
outputs_prefix = 'outputs/userans'

with beam.Pipeline() as pipeline:
  word_count = (
      pipeline
        | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
        | 'Find words' >>
        beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
        | 'Pair words with 1' >> beam.Map(lambda word: (word, 1))
        | 'Group and sum' >> beam.CombinePerKey(sum)
        | 'Write results' >> beam.io.WriteToText(outputs_prefix)
  )

# After you're done, check to see if your code outputs
# the same PCollection by uncommenting the code below
'''
# Sample the first 20 results, remember there are no ordering guarantees.
run('head -n 20 {}-00000-of-*'.format(outputs_prefix))
'''

Below is our answer to check your work. It is the Word Count example from above, but they now combine `Map` and `CombinePerKey` into one line using the `Count` composite transform.

In [None]:
#@title Answer
inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part2'

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  word_count = (
      pipeline
        | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
        | 'Find words' >>
        beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
        # Count composite transform from Beam SDK
        | 'Count words' >> beam.combiners.Count.PerElement()
        | 'Write results' >> beam.io.WriteToText(outputs_prefix)
  )

# Sample the first 20 results, remember there are no ordering guarantees.
run('head -n 20 {}-00000-of-*'.format(outputs_prefix))

> Summary: Applying a composite transform is just like applying a `PTransform` to your `PCollection`, but it simplifies the process by combining multiple `PTransforms` in one line.

### 3.2 Creating Your Own Composite Transform

We simplified the original code using a Beam SDK composite transform, but we can simplify it further by creating our own composite transform function.

Below is an example of a composite transform you can create that the Beam SDK does not cover. The function combines the `Count` composite transform you implemented above, as well as the `FlatMap` transform that converts lines of texts into individual words.

Note that because `Count` is itself a composite transform, `CountWords` is also a nested composite transform.

In [None]:
# The CountWords Composite Transform inside the WordCount pipeline.
@beam.ptransform_fn
def CountWords(pcoll):
  return (
      pcoll
      # Convert lines of text into individual words.
      | 'ExtractWords' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
      # Count the number of times each word occurs.
      | beam.combiners.Count.PerElement()
  )

You can then use this `CountWords` composite transform in your `Pipeline`, making your pipeline more visually easy to parse through.

Try editing the Word Count `Pipeline` below to incoporate this transform into the pipeline.

Below the code cell you will edit is a hidden answer code cell to check your work. If you're stuck, try opening the hint first!

In [None]:
#@title Open code to show the hint

#Hint: The newly defined transform combines the Count and FlatMap transform.
#Replace the `FlatMap` and `Count` transforms with CountWords() (see above)*italicized text*

In [None]:
#@title EDIT THIS CODE CELL TO USE YOUR `CountWords`
# EDIT THIS CODE CELL

inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part3'

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  word_count = (
      pipeline
        | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
        | 'Find words' >> beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
        | 'Count words' >> beam.combiners.Count.PerElement()
        | 'Write results' >> beam.io.WriteToText(outputs_prefix)
  )

pipeline.run()
# Sample the first 20 results, remember there are no ordering guarantees.
run('head -n 20 {}-00000-of-*'.format(outputs_prefix))

In [None]:
#@title Answer
inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part3'

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  word_count = (
      pipeline
        | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
        # The composite transform function you created above
        | 'Count Words' >> CountWords()
        | 'Write results' >> beam.io.WriteToText(outputs_prefix)
  )

# Sample the first 20 results, remember there are no ordering guarantees.
run('head -n 20 {}-00000-of-*'.format(outputs_prefix))

### 3.3 Creating Your Own Composite Transform With `PTransform` Directly

To create your own composite transform, create a subclass of the `PTransform` class and override the `expand` method to specify the actual processing logic
([more details](https://beam.apache.org/documentation/programming-guide/#composite-transform-creation)).

For example, if we wanted to create our own composite transform that counted the length of each word.

The following code sample shows how to declare a `PTransform` that accepts a `PCollection` of Strings for input, and outputs a `PCollection` of integers
to show the string lengths.

Within your `PTransform` subclass, you’ll need to override the `expand` method. The `expand` method is where you add the processing logic for the `PTransform`. Your override of `expand` must accept the appropriate type of input `PCollection` as a parameter, and specify the output `PCollection` as the return value.

As long as you override the `expand` method in your `PTransform` subclass to accept the appropriate input `PCollection`(s) and return the corresponding output `PCollection`(s), you can include as many transforms as you want. These transforms can include core transforms, composite transforms, or the transforms included in the Beam SDK libraries.

Your composite transform’s parameters and return value must match the initial input type and final return type for the entire transform, even if the transform’s intermediate data changes type multiple times.

Note: The `expand` method of a `PTransform` is not meant to be invoked directly by the user of a transform. Instead, you should call the apply method on the PCollection itself, with the transform as an argument. This allows transforms to be nested within the structure of your pipeline.



In [None]:
class ComputeWordLengths(beam.PTransform):
  def expand(self, pcoll):
    # Transform logic goes here.
    return pcoll | beam.Map(lambda x: (x, len(x)))

In [None]:
# quickly test it works
["KING", "OF"] | ComputeWordLengths()

In [None]:
#@title Click to check how to use your composite transform to build the pipeline

# put this into the Beam pipeline to compute the length of each word
inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part33'

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  word_count = (
      pipeline
        | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
        | 'Find words' >> beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
        | 'Count Word Lengths' >> ComputeWordLengths()
        | 'Write results' >> beam.io.WriteToText(outputs_prefix)
  )

# Sample the first 20 results, remember there are no ordering guarantees.
run('head -n 20 {}-00000-of-*'.format(outputs_prefix))

## Final Reading

The PTransform Style Guide contains additional information not included here, such as style guidelines, logging and testing guidance, and language-specific considerations. The guide is a useful starting point when you want to write new composite PTransforms.

https://beam.apache.org/contribute/ptransform-style-guide/