<a href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [None]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Try Apache Beam - YAML

While Beam provides powerful APIs for authoring sophisticated data processing pipelines, it still has a high barrier for getting started and authoring simple pipelines. Even setting up the environment, installing the dependencies, and setting up the project can be an overwhelming amount of boilerplate.

Here we provide a simple YAML syntax for describing pipelines that does not require coding experience or learning how to use an SDK&mdash;any text editor will do.

In this notebook, we set up your development environment and write a simple pipeline using YAML. We'll run it locally, using the [DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can explore other runners with the [Beam Capatibility Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).

To navigate through different sections, use the table of contents. From **View**  drop-down list, select **Table of contents**.

To run a code cell, you can click the **Run cell** button at the top left of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code cell and re-running it to see what happens.

To learn more about Colab, see [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).

# Setup

First, you need to set up your environment, which includes installing `apache-beam` and downloading a text file from Cloud Storage to your local file system. We are using this file to test your pipeline.

In [None]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

def save_to_file(content, file_name):
  with open(file_name, 'w') as f:
    f.write(content)

# Install apache-beam.
run('pip install --quiet apache-beam')

# Copy the input file into the local file system.
run('mkdir -p data')
run('gsutil cp gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection data/SMSSpamCollection.csv')

## Inspect the data
Let’s see how our data looks.

In [None]:
run('head data/SMSSpamCollection.csv')
run('wc -l data/SMSSpamCollection.csv')

This dataset is a `csv` file with 5,574 rows and 2 columns recording the following attributes separated by a tab sign:
1. `Column 1`: The label (either `ham` or `spam`)
2. `Column 2`: The SMS as raw text (type `string`)

## Example 1
We’ll start with creating a pipeline which loads the data, filters out valid messages leaving spam, and saves only valid lines to a file.


In [None]:
pipeline = '''
pipeline:
  - type: ReadFromText
    name: SmsData
    file_pattern: data/SMSSpamCollection.csv

  - type: PyMap
    name: SplitLine
    input: SmsData
    fn: 'lambda line: line.split("\\t")'

  - type: PyFilter
    name: KeepSpam
    input: SplitLine
    keep: 'lambda row: row[0] == "spam"'

  - type: WriteToText
    name: SaveToText
    input: KeepSpam
    file_path_prefix: "data/result-pipeline-01"
    file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-01.yaml')

In this example, each transformation contains the 'input' key, but if the pipeline is linear, such as ours, we can let the inputs be implicit by designating the pipeline as a `chain` type.


In [None]:
pipeline = '''
pipeline:
  type: chain
  transforms:
    - type: ReadFromText
      name: SmsData
      file_pattern: data/SMSSpamCollection.csv

    - type: PyMap
      name: SplitLine
      fn: 'lambda line: line.split("\\t")'

    - type: PyFilter
      name: KeepSpam
      keep: 'lambda row: row[0] == "spam"'

    - type: WriteToText
      name: SaveToText
      file_path_prefix: "data/result-pipeline-01"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-01-chain.yaml')

To run the pipeline locally, using a DirectRunner, you need to run the yaml's main python script, passing the `pipeline-01-chain.yaml` (or `pipeline-01.yaml`) file as an input:

In [None]:
run('python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-01-chain.yaml')

Let's verify the results and see the content of the output file.

In [None]:
run('head data/result-pipeline-01-00000-of-00001.txt')

If everything went well, you should see only spam messages from our input dataset. Congratulations, onto the next example!


## Count words in spam messages, select top 10 popular words and write results to a file

We'd like to write a pipeline which counts words occurring in spam messages, selects the most popular ones and writes the result to a file.


In [None]:
pipeline = '''
pipeline:
  type: chain

  transforms:
    # Read input data. Each line from the csv file is a String.
    - type: ReadFromText
      name: SmsData
      file_pattern: data/SMSSpamCollection.csv

    # Split each line into an array, where the first element is message label (ham or spam) and the second is the content of the message.
    - type: PyMap
      name: SplitLine
      fn: 'lambda line: line.split("\\t")'

    # Select only the rows that contain spam messages, based on the label.
    - type: PyFilter
      name: SpamMessages
      keep: 'lambda row: row[0] == "spam"'

    # Using a regex, we'll split the content of the message (one long string) into words (list of strings)
    - type: PyFlatMap
      name: FindWords
      fn: |
        import re
        lambda line: re.findall(r"[a-zA-Z]+", line[1])

    # Transforming each word to lower case and combining it with a '1'. Result of this step are pairs (word: 1).
    - type: PyMap
      name: PairLoweredWordsWith1
      fn: 'lambda word: (word.lower(), 1)'

    # Using SumPerKey transform, we'll calculate the occurence of each word.
    - type: SumPerKey
      name: GroupAndSum

    # Select 10 most popular words. Input format to this step is a tuple (word: count),
    # so we provide the count (row[1]) as the key to compare the numbers
    - type: TopNLargest
      name: Largest
      n: 10
      key: 'lambda row: row[1]'

    # Save results to a text file.
    - type: WriteToText
      name: SaveToText
      file_path_prefix: "data/result-pipeline-02"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-02.yaml')

Let's run the pipeline:

In [None]:
run('python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-02.yaml')

To view the output:

In [None]:
run('head data/result-pipeline-02-00000-of-00001.txt')

## Summary
Congratulations! You've just ran yaml pipelines in Apache Beam.

For all the available operations visit the documentation: # todo(yaml) add url

For a list of available transforms, visit # todo(yaml) add url