<a href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [None]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Try Apache Beam - YAML

While Beam provides powerful APIs for authoring sophisticated data processing pipelines, it still has a high barrier for getting started and authoring simple pipelines. Even setting up the environment, installing the dependencies, and setting up the project can be a challenge.

Here we provide a simple YAML syntax for describing pipelines that does not require coding experience or learning how to use an SDK. You can use any text editor.

Please note: YAML API is still EXPERIMENTAL and subject to change.

In this notebook, you set up your development environment and write a simple pipeline using YAML. Then you run it locally, using the [DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can explore other runners with the [Beam Capatibility Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).

To navigate through different sections, use the table of contents. From **View**  drop-down list, select **Table of contents**.

To run a code cell, click the **Run cell** button at the top left of the cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and re-running it to see what happens.

To learn more about Colab, see [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).

# Setup

First, you need to set up your environment. The following code installs `apache-beam` and downloads some text files from Cloud Storage to your local file system. We'll use these text files as input to the pipelines.

In [None]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

def save_to_file(content, file_name):
  with open(file_name, 'w') as f:
    f.write(content)

# Install apache-beam.
run('pip install --quiet apache-beam')

# Copy the input files into the local file system.
run('mkdir -p data')
run('wget -O data/kinglear.txt https://storage.googleapis.com/dataflow-samples/shakespeare/kinglear.txt')
run('wget -O data/SMSSpamCollection.csv https://storage.googleapis.com/apache-beam-samples/SMSSpamCollection/SMSSpamCollection')

## Inspect the data
We'll be working with two datasets. We'll use `kinglear.txt` for the first example, and `SMSSpamCollection.csv` for the second and third examples.
Let's first take a look at the `kinglear.txt` dataset.

In [None]:
run('head data/kinglear.txt')

This is just a `txt` file that contains lines of text.
Let's take a look at the other dataset.

In [None]:
run('head data/SMSSpamCollection.csv')
run('wc -l data/SMSSpamCollection.csv')

This dataset is a `csv` file that contains 5,574 rows of SMS messages labeled either spam or not-spam ("ham"). Each row contains two columns separated by a tab character:
1. `Column 1`: The label, either `ham` or `spam`
2. `Column 2`: The SMS message as raw text (type `string`)

## Example 1: Word count
This example is a version of the [WordCount](https://beam.apache.org/get-started/wordcount-example/)). It reads lines of text from the input dataset `kinglear.txt` and counts the number of times each word appears in the text.
To start, we'll create a `.yaml` file specifying our pipeline.

In [None]:
pipeline = '''
pipeline:
  # Read input data. Each line from the txt file is a String.
  - type: ReadFromText
    name: InputText
    config:
      file_pattern: data/kinglear.txt

  # Using a regex, we'll split the content of the message (one long string) into words (list of strings).
  # The 'fn' parameter accepts functions written in Python
  - type: PyFlatMap
    name: FindWords
    input: InputText
    config:
      fn: |
        import re
        lambda line: re.findall(r"[a-zA-Z]+", line)

  # Transforming each word to lower case and combining it with a '1'. Result of this step are pairs (word: 1).
  - type: PyMap
    name: PairWordsWith1
    input: FindWords
    config:
      fn: 'lambda word: (word, 1)'

  # Using CombinePerKey transform with the 'sum' function as a combine function,
  # we'll calculate the occurrence of each word.
  - type: CombinePerKey
    config:
      combine_fn: sum
    name: GroupAndSum
    input: PairWordsWith1

  # Format results - each record should be represented as 'word: count'.
  # The 'fn' parameter accepts functions written in Python
  - type: PyMap
    name: FormatResults
    input: GroupAndSum
    config:
      fn: "lambda word_count_tuple: f'{word_count_tuple[0]}: {word_count_tuple[1]}'"

  # Save results to a text file.
  - type: WriteToText
    name: SaveToText
    input: FormatResults
    config:
      file_path_prefix: "data/result-pipeline-01"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-01.yaml')

Each pipeline specification must start with a `pipeline` key on the first line.
The `pipeline` keyh is followed by a list of transforms. For example, the first transform reads the input file:
```
  # Read input data. Each line from the csv file is a String.
  - type: ReadFromText
    name: InputText
    config:
      file_pattern: data/kinglear.txt
```
Note: The indentation is important, because it specifies object hierarchy.
YAML supports comments. Everything after the `#` is always treated as a comment. Use them to improve readability.

Each operation must specify the `type` descriptor and other fields, such as `name` and other transform-specific parameters.
For a list of available transforms and their parameters, see the YAML API documentation. # todo(yaml) add link

To link two operations, use the `input` field. The `input` field specifies the name of another transform.
For example, the third operation in this pipeline takes the `FindWords` transform as input:
```
  # Transforming each word to lower case and combining it with a '1'. Result of this step are pairs (word: 1).
  - type: PyMap
    name: PairWordsWith1
    input: FindWords
    config:
      fn: 'lambda word: (word, 1)'
```
This particular operation takes `fn` (which stands for function) as an argument. Currently only Python functions are supported.

For more complicated functions, you can take advantage of YAML's multiline feature, as show in the second operation:
```
  # Using a regex, we'll split the content of the message (one long string) into words (list of strings).
  - type: PyFlatMap
    name: FindWords
    input: InputText
    config:
      fn: |
        import re
        lambda line: re.findall(r"[a-zA-Z]+", line)
```
In this trasnform, we need to import Python's regex package, `re`. To do that, we use the '|' character to start a multiline string.
This lets us write the function across two lines.

Let's run the pipeline executing the Python entry-point script (`apache_beam.yaml.main`) with our pipeline file as an argument:

In [None]:
run('python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-01.yaml')

Let's inspect the results. Each line contains a word and an associated count.

In [None]:
run('head data/result-pipeline-01-00000-of-00001.txt')

## Example 2: Load data, filter unwanted lines, and save results to a text file.
This example creates a pipeline that loads a `.csv` file containing SMS messages, filters out valid messages, and saves the spam messages to a file.


In [None]:
pipeline = '''
pipeline:
  # Read input data. Each line from the csv file is a String.
  - type: ReadFromText
    name: SmsData
    config:
      file_pattern: data/SMSSpamCollection.csv

  # Split each line into an array, where the first element is message label (ham or spam) and the second is the content of the message.
  - type: PyMap
    name: SplitLine
    input: SmsData
    config:
      fn: 'lambda line: line.split("\\t")'

  # Keep only the rows that contain spam messages, based on the first element in the array - the label.
  - type: PyFilter
    name: KeepSpam
    input: SplitLine
    config:
      keep: 'lambda row: row[0] == "spam"' # this is a function in Python, similar to the 'fn' in the previous example.

  # Save only the rows from the input file that are classified as spam.
  - type: WriteToText
    name: SaveToText
    input: KeepSpam
    config:
      file_path_prefix: "data/result-pipeline-02"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-02.yaml')

Let's run the pipeline with our `.yaml` file as an input:

In [None]:
run('python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-02.yaml')

Verify the results and see the content of the output file.

In [None]:
run('head data/result-pipeline-02-00000-of-00001.txt')

You should see only spam messages from the input dataset. Congratulations, onto the next example!


## Example 3: Count words in spam messages, select the top 10 popular words, and write the results to a file.

This example counts words occurring in spam messages, selects the most popular words, and writes the result to a file.


In [None]:
pipeline = '''
pipeline:
  # Read input data. Each line from the csv file is a String.
  - type: ReadFromText
    name: SmsData
    config:
      file_pattern: data/SMSSpamCollection.csv

  # Split each line into an array, where the first element is message label (ham or spam) and the second is the content of the message.
  - type: PyMap
    name: SplitLine
    input: SmsData
    config:
      fn: 'lambda line: line.split("\\t")'

  # Keep only the rows that contain spam messages, based on the first element in the array - the label.
  - type: PyFilter
    name: SpamMessages
    input: SplitLine
    config:
      keep: 'lambda row: row[0] == "spam"'

  # Using a regex, we'll split the content of the message (one long string) into words (list of strings)
  - type: PyFlatMap
    name: FindWords
    input: SpamMessages
    config:
      fn: |
        import re
        lambda line: re.findall(r"[a-zA-Z]+", line[1])

  # Transforming each word to lower case and combining it with a '1'. Result of this step are pairs (word: 1).
  - type: PyMap
    name: PairLoweredWordsWith1
    input: FindWords
    config:
      fn: 'lambda word: (word.lower(), 1)'

  # Using CombinePerKey transform with the 'sum' function as a combine function,
  # we'll calculate the occurrence of each word.
  - type: CombinePerKey
    config:
      combine_fn: sum
    name: GroupAndSum
    input: PairLoweredWordsWith1

  # Select 10 most popular words. Input format to this step is a tuple (word: count),
  # so we provide the count (row[1]) as the key to compare the numbers
  - type: TopNLargest
    name: MostPopular
    input: GroupAndSum
    config:
      n: 10
      key: 'lambda row: row[1]'

  # Save results to a text file.
  - type: WriteToText
    name: SaveToText
    input: MostPopular
    config:
      file_path_prefix: "data/result-pipeline-03"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-03.yaml')

Run the pipeline:

In [None]:
run('python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-03.yaml')

Finally, view the output:

In [None]:
run('head data/result-pipeline-03-00000-of-00001.txt')

## Summary
Congratulations! You've just run Apache Beam pipelines using YAML.

For all the available operations visit the documentation: # todo(yaml) add url

For a list of available transforms, visit # todo(yaml) add url

To run your pipeline in Dataflow, you'll need to set up your Google Cloud and run the pipeline with the `DataflowRunner`. For more information, follow https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#run-on-dataflow