# Introduction to Apache Beam Python SDK & Google Dataflow

![title](./image/beam_mascot.png)

## prepared and presented by Setia Budi

## What is Apache Beam?

Apache Beam is a flexible programming SDK for building data processing pipelines that can handle batch processing, stream processing, and parallel processing in one. Its unified model allows developers to define and execute abstract data workflows to be deployed on one of any number of different data processing engines, such as Apache Flink, Apache Spark, Google Cloud Dataflow, and Kafka.

BEAM -> Batch + strEAM

## Apache Beam in a Glance

![title](./image/learner_graph.png)

## Basic Components

### Pipeline
A Pipeline encapsulates the entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing output data.

### PCollection
A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. PCollections are the inputs and outputs for each step in your pipeline.

### PTransform
A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as the input, performs a processing function that you provide on the elements of that PCollection, and then produces zero or more output PCollection objects.

### I/O transforms
Beam comes with a number of “IOs” - library PTransforms that read or write data to various external storage systems.

## Illustration for Pipeline, PCollection, and PTransform

![title](./image/pcollection_ptransform.png)

## Installation

In [None]:
!python --version

In [None]:
!which python

In [None]:
!pip install apache-beam
!pip install apache-beam[gcp]

## Sample Dataset

In [10]:
!{("head -n 20 ./example/dept_data.txt")}

149633CM,Marco,10,Accounts,1-01-2019
212539MU,Rebekah,10,Accounts,1-01-2019
231555ZZ,Itoe,10,Accounts,1-01-2019
503996WI,Edouard,10,Accounts,1-01-2019
704275DC,Kyle,10,Accounts,1-01-2019
957149WC,Kyle,10,Accounts,1-01-2019
241316NX,Kumiko,10,Accounts,1-01-2019
796656IE,Gaston,10,Accounts,1-01-2019
331593PS,Beryl,20,HR,1-01-2019
560447WH,Olga,20,HR,1-01-2019
222997TJ,Leslie,20,HR,1-01-2019
171752SY,Mindy,20,HR,1-01-2019
153636AS,Vicky,20,HR,1-01-2019
745411HT,Richard,20,HR,1-01-2019
298464HN,Kirk,20,HR,1-01-2019
783950BW,Kaori,20,HR,1-01-2019
892691AR,Beryl,20,HR,1-01-2019
245668UZ,Oscar,20,HR,1-01-2019
231206QD,Kumiko,30,Finance,1-01-2019
357919KT,Wendy,30,Finance,1-01-2019


## Case 1: Simple and Not So Useful Pipeline

In [11]:
import apache_beam as beam

p1 = beam.Pipeline()

(
    p1
    | beam.io.ReadFromText("./example/dept_data.txt")
    | beam.io.WriteToText("./output/output_data")
)

p1.run()



<apache_beam.runners.portability.fn_api_runner.fn_runner.RunnerResult at 0x7fd03090d0d0>

In [12]:
!{("head -n 20 ./output/output_data-00000-of-00001")}

149633CM,Marco,10,Accounts,1-01-2019
212539MU,Rebekah,10,Accounts,1-01-2019
231555ZZ,Itoe,10,Accounts,1-01-2019
503996WI,Edouard,10,Accounts,1-01-2019
704275DC,Kyle,10,Accounts,1-01-2019
957149WC,Kyle,10,Accounts,1-01-2019
241316NX,Kumiko,10,Accounts,1-01-2019
796656IE,Gaston,10,Accounts,1-01-2019
331593PS,Beryl,20,HR,1-01-2019
560447WH,Olga,20,HR,1-01-2019
222997TJ,Leslie,20,HR,1-01-2019
171752SY,Mindy,20,HR,1-01-2019
153636AS,Vicky,20,HR,1-01-2019
745411HT,Richard,20,HR,1-01-2019
298464HN,Kirk,20,HR,1-01-2019
783950BW,Kaori,20,HR,1-01-2019
892691AR,Beryl,20,HR,1-01-2019
245668UZ,Oscar,20,HR,1-01-2019
231206QD,Kumiko,30,Finance,1-01-2019
357919KT,Wendy,30,Finance,1-01-2019


## Case 1.1: Simple and Not So Useful Pipeline, now with label

In [11]:
import apache_beam as beam

p1 = beam.Pipeline()

(
    p1
    | "ReadFromText" >> beam.io.ReadFromText("./example/dept_data.txt")
    | "WriteOutput" >> beam.io.WriteToText("./output/output_data")
)

p1.run()



<apache_beam.runners.portability.fn_api_runner.fn_runner.RunnerResult at 0x7fd03090d0d0>

In [12]:
!{("head -n 20 ./output/output_data-00000-of-00001")}

149633CM,Marco,10,Accounts,1-01-2019
212539MU,Rebekah,10,Accounts,1-01-2019
231555ZZ,Itoe,10,Accounts,1-01-2019
503996WI,Edouard,10,Accounts,1-01-2019
704275DC,Kyle,10,Accounts,1-01-2019
957149WC,Kyle,10,Accounts,1-01-2019
241316NX,Kumiko,10,Accounts,1-01-2019
796656IE,Gaston,10,Accounts,1-01-2019
331593PS,Beryl,20,HR,1-01-2019
560447WH,Olga,20,HR,1-01-2019
222997TJ,Leslie,20,HR,1-01-2019
171752SY,Mindy,20,HR,1-01-2019
153636AS,Vicky,20,HR,1-01-2019
745411HT,Richard,20,HR,1-01-2019
298464HN,Kirk,20,HR,1-01-2019
783950BW,Kaori,20,HR,1-01-2019
892691AR,Beryl,20,HR,1-01-2019
245668UZ,Oscar,20,HR,1-01-2019
231206QD,Kumiko,30,Finance,1-01-2019
357919KT,Wendy,30,Finance,1-01-2019


## Case 2: Simple Pipeline with Simple Function

In [13]:
import apache_beam as beam


def split_row(element):
    return element.split(",")


def filtering(element):
    return element[3] == "Accounts"


p1 = beam.Pipeline()

(
    p1
    | "ReadFromText" >> beam.io.ReadFromText("./example/dept_data.txt")
    | "SplitRecord" >> beam.Map(split_row)
    | "FilterAccounts" >> beam.Filter(filtering)
    | "WriteOutput" >> beam.io.WriteToText("./output/output_data")
)

p1.run()



<apache_beam.runners.portability.fn_api_runner.fn_runner.RunnerResult at 0x7fd03089fbb0>

In [14]:
!{("head -n 20 ./output/output_data-00000-of-00001")}

['149633CM', 'Marco', '10', 'Accounts', '1-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '1-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '1-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '1-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '1-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '1-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '1-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '1-01-2019']
['149633CM', 'Marco', '10', 'Accounts', '2-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '2-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '2-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '2-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '2-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '2-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '2-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '2-01-2019']
['718737IX', 'Ayumi', '10', 'Accounts', '2-01-2019']
['149633CM', 'Marco', '10', 'Accounts', '3-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts'

## Case 3: Simple Pipeline with Lambda Expression

In [15]:
import apache_beam as beam

p1 = beam.Pipeline()

(
    p1
    | "ReadFromText" >> beam.io.ReadFromText("./example/dept_data.txt")
    | "SplitRecord" >> beam.Map(lambda record: record.split(","))
    | "FilterAccounts" >> beam.Filter(lambda record: record[3] == "Accounts")
    | "WriteOutput" >> beam.io.WriteToText("./output/output_data")
)

p1.run()



<apache_beam.runners.portability.fn_api_runner.fn_runner.RunnerResult at 0x7fd004768a30>

In [16]:
!{("head -n 20 ./output/output_data-00000-of-00001")}

['149633CM', 'Marco', '10', 'Accounts', '1-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '1-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '1-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '1-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '1-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '1-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '1-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '1-01-2019']
['149633CM', 'Marco', '10', 'Accounts', '2-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts', '2-01-2019']
['231555ZZ', 'Itoe', '10', 'Accounts', '2-01-2019']
['503996WI', 'Edouard', '10', 'Accounts', '2-01-2019']
['704275DC', 'Kyle', '10', 'Accounts', '2-01-2019']
['957149WC', 'Kyle', '10', 'Accounts', '2-01-2019']
['241316NX', 'Kumiko', '10', 'Accounts', '2-01-2019']
['796656IE', 'Gaston', '10', 'Accounts', '2-01-2019']
['718737IX', 'Ayumi', '10', 'Accounts', '2-01-2019']
['149633CM', 'Marco', '10', 'Accounts', '3-01-2019']
['212539MU', 'Rebekah', '10', 'Accounts'

## Case 4: Using Google Dataflow as the runner

In [15]:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

pipeline_options = PipelineOptions(
    runner='DataflowRunner',
    project='my-project-id',
    job_name='unique-job-name',
    temp_location='gs://my-bucket/temp',
)

p1 = beam.Pipeline(pipeline_options)

(
    p1
    | "ReadFromText" >> beam.io.ReadFromText("./example/dept_data.txt")
    | "SplitRecord" >> beam.Map(lambda record: record.split(","))
    | "FilterAccounts" >> beam.Filter(lambda record: record[3] == "Accounts")
    | "WriteOutput" >> beam.io.WriteToText("./output/output_data")
)

p1.run()



<apache_beam.runners.portability.fn_api_runner.fn_runner.RunnerResult at 0x7fd004768a30>

## Case 5: Simple Pipeline with CombinePerKey

In [20]:
import apache_beam as beam

p1 = beam.Pipeline()

(
    p1
    | "ReadFromText" >> beam.io.ReadFromText("./example/dept_data.txt")
    | "SplitRecord" >> beam.Map(lambda record: record.split(","))
    | "FilterAccounts" >> beam.Filter(lambda record: record[3] == "Accounts")
    | "MapToKeyValue" >> beam.Map(lambda record: (record[1], 1))
    | "CombineByKey" >> beam.CombinePerKey(sum)
    | "WriteOutput" >> beam.io.WriteToText("./output/output_data")
)

p1.run()



<apache_beam.runners.portability.fn_api_runner.fn_runner.RunnerResult at 0x7fd02b73d3d0>

In [18]:
!{("head -n 20 ./output/output_data-00000-of-00001")}

('Marco', 31)
('Rebekah', 31)
('Itoe', 31)
('Edouard', 31)
('Kyle', 62)
('Kumiko', 31)
('Gaston', 31)
('Ayumi', 30)
