<a href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [None]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Try Apache Beam - YAML

While Beam provides powerful APIs for authoring sophisticated data processing pipelines, it still has a high barrier for getting started and authoring simple pipelines. Even setting up the environment, installing the dependencies, and setting up the project can be an overwhelming amount of boilerplate.

Here we provide a simple YAML syntax for describing pipelines that does not require coding experience or learning how to use an SDK&mdash;any text editor will do.

Please note: YAML API is still EXPERIMENTAL and subject to change.

In this notebook, we set up your development environment and write a simple pipeline using YAML. We'll run it locally, using the [DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can explore other runners with the [Beam Capatibility Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).

To navigate through different sections, use the table of contents. From **View**  drop-down list, select **Table of contents**.

To run a code cell, you can click the **Run cell** button at the top left of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code cell and re-running it to see what happens.

To learn more about Colab, see [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).

# Setup

First, you need to set up your environment, which includes installing `apache-beam` and downloading a text file from Cloud Storage to your local file system. We are using this file to test your pipeline.

In [134]:
# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

def save_to_file(content, file_name):
  with open(file_name, 'w') as f:
    f.write(content)

# Install apache-beam.
run('pip install --quiet apache-beam')

# Copy the input files into the local file system.
run('mkdir -p data')
run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/kinglear.txt')
run('gsutil cp gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection data/SMSSpamCollection.csv')

>> pip install --quiet apache-beam

>> mkdir -p data

>> gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/kinglear.txt
Copying gs://dataflow-samples/shakespeare/kinglear.txt...
/ [1 files][153.6 KiB/153.6 KiB]                                                
Operation completed over 1 objects/153.6 KiB.                                    

>> gsutil cp gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection data/SMSSpamCollection.csv
Copying gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection...
/ [1 files][466.7 KiB/466.7 KiB]                                                
Operation completed over 1 objects/466.7 KiB.                                    



## Inspect the data
We'll be working with 2 datasets. We'll use `kinglear.txt` for the first example - word count, and `SMSSpamCollection.csv` for the second and third.
Let's first take a loot at the `kinglear.txt` dataset.

In [135]:
run('head data/kinglear.txt')

>> head data/kinglear.txt
	KING LEAR


	DRAMATIS PERSONAE


LEAR	king of Britain  (KING LEAR:)

KING OF FRANCE:




This is just a `txt` file - it contains lines of text.
Let's take a look at the other dataset.

In [136]:
run('head data/SMSSpamCollection.csv')
run('wc -l data/SMSSpamCollection.csv')

>> head data/SMSSpamCollection.csv
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat me like aids patent.
ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam	WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Cl

This dataset is a `csv` file with 5,574 rows and 2 columns recording the following attributes separated by a tab sign:
1. `Column 1`: The label (either `ham` or `spam`)
2. `Column 2`: The SMS as raw text (type `string`)

## Example 1: word count
In this popular introductory exercise, we will build a pipeline that reads lines of text from the input dataset `kinglear.txt` and counts the number of times each word appears in the text.
To start, we'll create a `yaml` file specifying our pipeline.

In [153]:
pipeline = '''
pipeline:
  type: chain
  transforms:

    # Read input data. Each line from the csv file is a String.
    - type: ReadFromText
      name: InputText
      file_pattern: data/kinglear.txt

    # Using a regex, we'll split the content of the message (one long string) into words (list of strings).
    - type: PyFlatMap
      name: FindWords
      fn: |
        import re
        lambda line: re.findall(r"[a-zA-Z]+", line)

    # Transforming each word to lower case and combining it with a '1'. Result of this step are pairs (word: 1).
    - type: PyMap
      name: PairWordsWith1
      fn: 'lambda word: (word, 1)'

    # Using SumPerKey transform, we'll calculate the occurrence of each word.
    - type: SumPerKey
      name: GroupAndSum

    # Format results - each record should be represented as 'word: count'.
    # The 'fn' parameter accepts functions written in Python
    - type: PyMap
      name: FormatResults
      fn: "lambda word_count_tuple: f'{word_count_tuple[0]}: {word_count_tuple[1]}'"

    # Save results to a text file.
    - type: WriteToText
      name: SaveToText
      file_path_prefix: "data/result-pipeline-01"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-01.yaml')

Let's tun the pipeline executing the Python script with the pipeline file as an argument:

In [154]:
run('python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-01.yaml')

>> python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-01.yaml
Building pipeline...
INFO:apache_beam.yaml.yaml_transform:Expanding "InputText" at line 7 
INFO:apache_beam.yaml.yaml_transform:Expanding "FindWords" at line 12 
INFO:apache_beam.yaml.yaml_transform:Expanding "PairWordsWith1" at line 19 
INFO:apache_beam.yaml.yaml_transform:Expanding "GroupAndSum" at line 24 
INFO:apache_beam.yaml.yaml_transform:Expanding "FormatResults" at line 28 
INFO:apache_beam.yaml.yaml_transform:Expanding "SaveToText" at line 33 
Running pipeline...



Let's inspect the data. Each line contains a word and an associated count.

In [155]:
run('head data/result-pipeline-01-00000-of-00001.txt')

>> head data/result-pipeline-01-00000-of-00001.txt
KING: 243
LEAR: 236
DRAMATIS: 1
PERSONAE: 1
king: 65
of: 447
Britain: 2
OF: 15
FRANCE: 10
DUKE: 3



## Example 2: load data, filter unwanted lines and save results to a text file.
In this example, we'll create a pipeline which loads the data, filters out valid messages leaving spam, and saves only valid lines to a file.
Please note that this time we didn't specify the top-level type as `chain`. This allows for more flexibility when creating a pipeline, but requires us to specify the `input` explicitly for each transform.


In [138]:
pipeline = '''
pipeline:

  - type: ReadFromText
    name: SmsData
    file_pattern: data/SMSSpamCollection.csv

  - type: PyMap
    name: SplitLine
    input: SmsData
    fn: 'lambda line: line.split("\\t")'

  - type: PyFilter
    name: KeepSpam
    input: SplitLine
    keep: 'lambda row: row[0] == "spam"'

  - type: WriteToText
    name: SaveToText
    input: KeepSpam
    file_path_prefix: "data/result-pipeline-01"
    file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-02.yaml')

Since this pipeline is linear, we can rewrite it to a `chain` and drop the `input`s in transforms.

In [None]:
pipeline = '''
pipeline:
  type: chain
  transforms:
    - type: ReadFromText
      name: SmsData
      file_pattern: data/SMSSpamCollection.csv

    - type: PyMap
      name: SplitLine
      fn: 'lambda line: line.split("\\t")'

    - type: PyFilter
      name: KeepSpam
      keep: 'lambda row: row[0] == "spam"'

    - type: WriteToText
      name: SaveToText
      file_path_prefix: "data/result-pipeline-02"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-02-chain.yaml')

To run the pipeline locally, using a DirectRunner, you need to run the yaml's main python script, passing the `pipeline-02-chain.yaml` (or `pipeline-02.yaml`) file as an input:

Let's verify the results and see the content of the output file.

In [None]:
run('head data/result-pipeline-02-00000-of-00001.txt')

If everything went well, you should see only spam messages from our input dataset. Congratulations, onto the next example!


## Example 3: count words in spam messages, select top 10 popular words and write results to a file.

We'd like to write a pipeline which counts words occurring in spam messages, selects the most popular ones and writes the result to a file.


In [137]:
pipeline = '''
pipeline:
  type: chain

  transforms:
    # Read input data. Each line from the csv file is a String.
    - type: ReadFromText
      name: SmsData
      file_pattern: data/SMSSpamCollection.csv

    # Split each line into an array, where the first element is message label (ham or spam) and the second is the content of the message.
    - type: PyMap
      name: SplitLine
      fn: 'lambda line: line.split("\\t")'

    # Select only the rows that contain spam messages, based on the label.
    - type: PyFilter
      name: SpamMessages
      keep: 'lambda row: row[0] == "spam"'

    # Using a regex, we'll split the content of the message (one long string) into words (list of strings)
    - type: PyFlatMap
      name: FindWords
      fn: |
        import re
        lambda line: re.findall(r"[a-zA-Z]+", line[1])

    # Transforming each word to lower case and combining it with a '1'. Result of this step are pairs (word: 1).
    - type: PyMap
      name: PairLoweredWordsWith1
      fn: 'lambda word: (word.lower(), 1)'

    # Using SumPerKey transform, we'll calculate the occurrence of each word.
    - type: SumPerKey
      name: GroupAndSum

    # Select 10 most popular words. Input format to this step is a tuple (word: count),
    # so we provide the count (row[1]) as the key to compare the numbers
    - type: TopNLargest
      name: Largest
      n: 10
      key: 'lambda row: row[1]'

    # Save results to a text file.
    - type: WriteToText
      name: SaveToText
      file_path_prefix: "data/result-pipeline-03"
      file_name_suffix: ".txt"
'''
save_to_file(pipeline, 'pipeline-03.yaml')

Let's run the pipeline:

In [None]:
run('python -m apache_beam.yaml.main --pipeline_spec_file=pipeline-03.yaml')

To view the output:

In [None]:
run('head data/result-pipeline-03-00000-of-00001.txt')

## Summary
Congratulations! You've just run Apache Beam pipelines using YAML.

For all the available operations visit the documentation: # todo(yaml) add url

For a list of available transforms, visit # todo(yaml) add url