<a href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [None]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Try Apache Beam - YAML

While Beam provides powerful APIs for authoring sophisticated data processing pipelines, it still has a high barrier for getting started and authoring simple pipelines. Even setting up the environment, installing the dependencies, and setting up the project can be a challenge.

Here we provide a simple declarative syntax for describing pipelines that does not require coding experience or learning how to use an SDK&mdash;any text editor will do. Some installation may be required to actually *execute* a pipeline, but we envision various services (such as Dataflow) to accept yaml pipelines directly obviating the need for even that in the future. We also anticipate the ability to generate code directly from these higher-level yaml descriptions, should one want to graduate to a full Beam SDK (and possibly the other direction as well as far as possible).

It should be noted that everything here is still under development, but any features already included are considered stable. Feedback is welcome at dev@apache.beam.org.

In this notebook, you set up your development environment and write a simple pipeline using YAML. Then you run it locally, using the [DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can explore other runners with the [Beam Capability Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).

To navigate through different sections, use the table of contents. From **View**  drop-down list, select **Table of contents**.

To run a code cell, click the **Run cell** button at the top left of the cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and re-running it to see what happens.

To learn more about Colab, see [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).

# Setup

First, you need to set up your environment. The following code installs `apache-beam` and creates directories for your data, pipelines and results.

In [None]:
# Install apache-beam
! pip install --quiet apache-beam[yaml]

# Create directories for storing data, pipelines and results
! mkdir -p data
! mkdir -p pipelines
! mkdir -p results

We'll also create an artificial dataset that represents a simple database. The csv data contains information about different people. Each line represents a single person and their details are separated by commas. The file contains 5 columns: id, firstname, age, country and a profession.

In [None]:
%%writefile 'data/people.csv'
id,firstname,age,country,profession
1,Reeba,58,Belgium,unemployed
2,Maud,45,Spain,firefighter
3,Meg,11,France,unemployed
4,Rani,53,Spain,doctor
5,Natka,26,France,doctor
6,Aurore,32,Italy,police officer
7,Elvira,39,Italy,doctor
8,Asia,10,Belgium,doctor
9,Lesly,35,Spain,firefighter
10,Orelia,31,Germany,police officer
11,Theodora,16,Italy,unemployed
12,Viviene,44,Germany,police officer
13,Teriann,50,Belgium,police officer
14,Carol-Jean,23,Germany,unemployed
15,Drucie,15,Spain,police officer
16,Elie,10,Italy,doctor
17,Shaylyn,34,Belgium,worker
18,Fayre,33,Spain,police officer
19,Sabina,52,Germany,police officer
20,Aryn,20,Belgium,police officer
21,Darlleen,49,Spain,worker
22,Jere,18,Italy,worker
23,Candi,60,Germany,police officer
24,Sindee,40,Germany,firefighter
25,Selma,20,Spain,worker
26,Vonny,35,Germany,doctor
27,Kate,53,Spain,worker
28,Annabela,48,Belgium,worker
29,Jenilee,55,Germany,police officer
30,Gusella,44,France,police officer
31,Fawne,35,Spain,worker
32,Karolina,39,Spain,police officer
33,Sadie,58,Germany,firefighter
34,Clo,10,Italy,police officer
35,Beth,46,Spain,firefighter
36,Adore,18,Italy,firefighter
37,Tarra,29,Spain,doctor
38,Jessamyn,36,France,police officer
39,Deedee,24,Germany,unemployed
40,Patricia,45,Italy,doctor
41,Wileen,39,Spain,police officer
42,Paola,55,Italy,worker
43,Gwyneth,37,Italy,worker
44,Stacey,36,Spain,worker
45,Camile,60,Germany,unemployed
46,Sheree,10,Spain,unemployed
47,Albertina,53,France,police officer
48,Jinny,30,Spain,firefighter
49,Kayla,57,Italy,firefighter
50,Jaime,55,France,doctor

Let's validate if the file was created correctly. You should see the first few lines from the generated file. Validate if the beginning of the file matches with the declared content above.

In [None]:
! head data/people.csv

# Your first YAML pipelines

In this section we'll present you the basic structure of a YAML pipeline and present you some available transforms.
Below is a simple pipeline that reads data from the csv file we've just created and logs the elements for debugging purposes.

The `LogForTesting` transform lets us log the data when developing a pipeline. Remember, it is not advised to use this transform in production.

Let's define the pipeline and save it to a file:

In [None]:
%%writefile 'pipelines/pipeline-01.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: LogForTesting

We can verify the contents of this file by running:

In [None]:
! cat pipelines/pipeline-01.yaml

Now, we can execute the yaml pipeline by passing this file as an argument to the following command:

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-01.yaml

Here we use Python and `apache_beam` package to execute the pipeline, but we envision various services (such as Dataflow) to accept yaml pipelines directly obviating the need for that in the future.

If you scroll through the output logs, you'll find entries such as:
```
INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=1, firstname='Reeba', age=58, country='Belgium', profession='unemployed')
INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=2, firstname='Maud', age=45, country='Spain', profession='firefighter')
INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=3, firstname='Meg', age=11, country='France', profession='unemployed')
INFO:root:BeamSchema_edf39b51_91da_418a_b28e_af04c9bae811(id=4, firstname='Rani', age=53, country='Spain', profession='doctor')
```
This is a representation of records from the input dataset.

Let's add a transform - `Filter`. To use this transform you need to  specify the 'keep' condition and a language your condition is written in. Below you'll find an example with a condition written in Python.
This pipeline will filter out records containing people that are younger than 18 years old. The only records left to the next transform will be records representing adults. Verify this by scrolling to the bottom of the output logs.

In [None]:
%%writefile 'pipelines/pipeline-filter-01.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: Filter
      config:
        language: python
        keep: "age >= 18"
    - type: LogForTesting

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-filter-01.yaml

Similarly, we can create a condition in other languages, for example SQL. In this example we filter out people that are younger than 18 and have a profession other than being 'unemployed'.

In [None]:
%%writefile 'pipelines/pipeline-filter-02.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: Filter
      config:
        language: sql
        keep: "age >= 18 or (age < 18 and profession = 'unemployed')"
    - type: LogForTesting

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-filter-02.yaml

You'll notice that the output of this pipeline is in a different format than the previous one. That's because this pipeline uses an SQL Filter transform, an example of a [multi-language transform](https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines). Multi-language pipelines are an important feature of Beam, but in this notebook we'll focus on YAML.

To find the output of this pipeline find lines that begin with 'message' keyword and have the associated 'transform_id' set to a transform starting with 'LogForTesting'.
Example:
```
message: "{\"id\":49,\"firstname\":\"Kayla\",\"age\":57,\"country\":\"Italy\",\"profession\":\"firefighter\"}"
instruction_id: "bundle_6"
transform_id: "LogForTesting/beam:schematransform:org.apache.beam:yaml:log_for_testing:v1/LogAsJson/ParMultiDo(Anonymous)"
```
Each log entry represents one element from the output data.

Another useful transform is `MapToFields`. This transform lets us manipulate fields of a record. For example, we can add a field to our records, which is a boolean field specifying if the person is adult or not.

In [None]:
%%writefile 'pipelines/pipeline-map-01.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: MapToFields
      config:
        language: python
        append: true
        fields:
          is_adult: "age >= 18"
    - type: LogForTesting


In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-map-01.yaml

Beam will try to infer the types involved in the mappings, but sometimes this is not possible. In these cases we can explicitly denote the expected output type, e.g.

In [None]:
%%writefile 'pipelines/pipeline-map-02.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: MapToFields
      config:
        language: python
        append: true
        fields:
           is_adult:
             expression: "age >= 18"
             output_type: boolean
    - type: LogForTesting

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-map-02.yaml

When the `append` field is specified, one can `drop` fields as well, e.g.

In [None]:
%%writefile 'pipelines/pipeline-map-03.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: MapToFields
      config:
        language: python
        append: true
        fields:
          is_adult: "age >= 18"
        drop:
          - age
    - type: LogForTesting

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-map-03.yaml

We can also create simple UDFs (User Defined Functions) using Python or other languages. In the example below we add a field `random_number` which value is a random number not bigger than the age of the person.

In [None]:
%%writefile 'pipelines/pipeline-map-04.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: MapToFields
      config:
        language: python
        append: true
        fields:
          random_number:
            callable: |
              import random
              def my_mapping(row):
                return random.randrange(row.age)
    - type: LogForTesting

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-map-04.yaml

Beam YAML has the ability to do aggregations to group and combine values across records. The is accomplished via the `Combine` transform type.

In this example we'll aggregate our records based on the `is_adult` classification. We'll calculate an average age for each of the groups.

In [None]:
%%writefile 'pipelines/pipeline-combine-01.yaml'
pipeline:
  type: chain
  transforms:
    - type: ReadFromCsv
      config:
        path: data/people.csv
    - type: MapToFields
      config:
        language: python
        append: true
        fields:
          is_adult: "age >= 18"
    - type: Combine
      config:
        group_by: is_adult
        combine:
          total:
            value: age
            fn: mean
    - type: LogForTesting

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-combine-01.yaml

If all was executed correctly, you should see the following lines at the bottom of the output log:
```
INFO:root:Result(is_adult=True, total=40.674418604651166)
INFO:root:Result(is_adult=False, total=11.714285714285714
```

All the previous pipelines were linear - output of one transform was an input to the next transform. This is also known as a `chain` pipeline. This is designated in the top-level pipeline configuration, for example:
```
pipeline:
  type: chain
  transforms:
    ...
```
In YAML we can also create nonlinear pipelines. To do this, we can specify `type: composite`, or omit this line completely (this is the default pipeline type). In these pipelines, we must specify the `input` in each of the transforms that take the output of previous transforms. This `input` is the name, or collection of names, of the transform(s) that feed into the receiving transform.
The specification below will create the following pipeline:
```
             +----> Doctors -----------> SaveDoctors
InputData ---+
             +----> OtherProfessions --> SaveOtherProfessions
```

In [None]:
%%writefile 'pipelines/pipeline-nonlinear-01.yaml'
pipeline:
  type: composite
  transforms:
    - type: ReadFromCsv
      name: InputData
      config:
        path: data/people.csv
    - type: Filter
      name: Doctors
      input: InputData
      config:
        language: python
        keep: "profession == 'doctor'"
    - type: Filter
      name: OtherProfessions
      input: InputData
      config:
        language: python
        keep: "profession != 'doctor'"
    - type: WriteToCsv
      name: SaveDoctors
      input: Doctors
      config:
        path: results/doctors
    - type: WriteToCsv
      name: SaveOtherProfessions
      input: OtherProfessions
      config:
        path: results/other-professions

In [None]:
! python -m apache_beam.yaml.main --pipeline_spec_file=pipelines/pipeline-nonlinear-01.yaml

The output are 2 files: `results/doctors-00000-of-00001` and `results/other-professions-00000-of-00001`. Let's see their contents:

In [None]:
! head results/doctors-00000-of-00001

In [None]:
! head results/other-professions-00000-of-00001

# Summary
Congratulations! You've just run Apache Beam pipelines using YAML.

To learn more about Beam YAML visit [Beam YAML API documentation page](https://beam.apache.org/documentation/sdks/yaml/).

To run your pipeline in Dataflow, you'll need to set up your Google Cloud and run the pipeline with the `DataflowRunner`. For more information, follow https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#run-on-dataflow