## First Pipeline

In [4]:
import apache_beam as beam

inputs = [0, 1, 2, 3]

# Create a pipeline.
with beam.Pipeline() as pipeline:
  # Feed it some input elements with `Create`.
  outputs = (
      pipeline
      | 'Create initial values' >> beam.Create(inputs)
  )

  # `outputs` is a PCollection with our input elements.
  # But printing it directly won't show us its contents :(
  print(f"outputs: {outputs}")



outputs: PCollection[[4]: Create initial values/Map(decode).None]


`Note:` In Beam, you can NOT access the elements from a PCollection directly like a Python list. This means, we can't simply print the output PCollection to see the elements.

This is because, depending on the runner, the PCollection elements might live in multiple worker machines.

To print the elements in the PCollection, we'll do a little trick:

In [1]:
import apache_beam as beam

inputs = [0, 1, 2, 3]

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create initial values' >> beam.Create(inputs)
  )

  # We can only access the elements through another transform.
  outputs | beam.Map(print)



0
1
2
3


## Map

In Beam, there is the Map transform, but we must use it within a pipeline.

First we create a pipeline and feed it our input elements. Then we pipe those elements into a Map transform where we apply our function.

In [2]:
import apache_beam as beam

inputs = [0, 1, 2, 3]

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create values' >> beam.Create(inputs)
      | 'Multiply by 2' >> beam.Map(lambda x: x * 2)
  )

  outputs | beam.Map(print)



0
2
4
6


## FlatMap

FlatMap accepts a function that takes a single input element and outputs an iterable of elements.

In [3]:
import apache_beam as beam

inputs = [0, 1, 2, 3]

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create values' >> beam.Create(inputs)
      | 'Expand elements' >> beam.FlatMap(lambda x: [x for _ in range(x)])
  )

  outputs | beam.Map(print)



1
2
2
3
3
3


## Filter

Sometimes we want to only process certain elements while ignoring others.

We want to filter each element in a collection using a function.

filter takes a function that checks a single element a, and returns True to keep the element, or False to discard it.

ℹ️ For example, we only want to keep number that are even, or divisible by two. We can use the modulo operator % for a simple check.

In Beam, there is the Filter transform.

In [4]:
import apache_beam as beam

inputs = [0, 1, 2, 3]

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create values' >> beam.Create(inputs)
      | 'Keep only even numbers' >> beam.Filter(lambda x: x % 2 == 0)
  )

  outputs | beam.Map(print)



0
2


## Combine

We also need a way to get a single value from an entire PCollection. We might want to get the total number of elements, or the average value, or any other type of aggregation of values.

We want to combine the elements in a collection into a single output.

combine takes a function that transforms an iterable of inputs a, and returns a single output a.

Other common names for this function are fold and reduce.

ℹ️ For example, we want to add all numbers together.

In Beam, there are aggregation transforms.

For this particular example, we can use the CombineGlobally transform which accepts a function that takes an iterable of elements as an input and outputs a single value.

We can pass the built-in function sum into CombineGlobally.

In [5]:
import apache_beam as beam

inputs = [0, 1, 2, 3]

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create values' >> beam.Create(inputs)
      | 'Sum all values together' >> beam.CombineGlobally(sum)
  )

  outputs | beam.Map(print)



6


ℹ️ There are many ways to combine values in Beam. You could even combine them into a different data type by defining a custom `CombineFn`.

You can learn more about them by checking the available [aggregation transforms](https://beam.apache.org/documentation/transforms/python/overview/#aggregation).

## GroupByKey

Sometimes it's useful to pair each element with a key that we can use to group related elements together.

Think of it as creating a Python dict from a list of (key, value) pairs, but instead of replacing the value on a "duplicate" key, you would get a list of all the values associated with that key.

ℹ️ For example, we want to group each animal with the list of foods they like, and we start with (animal, food) pairs.

In Beam, there is the GroupByKey transform.

In [6]:
import apache_beam as beam

inputs = [
  ('🐹', '🌽'),
  ('🐼', '🎋'),
  ('🐰', '🥕'),
  ('🐹', '🌰'),
  ('🐰', '🥒'),
]

with beam.Pipeline() as pipeline:
  outputs = (
      pipeline
      | 'Create (animal, food) pairs' >> beam.Create(inputs)
      | 'Group foods by animals' >> beam.GroupByKey()
  )

  outputs | beam.Map(print)



('🐹', ['🌽', '🌰'])
('🐼', ['🎋'])
('🐰', ['🥕', '🥒'])


## Reading and writing data

So far we've learned some of the basic transforms like [`Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map), [`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap), [`Filter`](https://beam.apache.org/documentation/transforms/python/elementwise/filter), [`Combine`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally), and [`GroupByKey`](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey). These allow us to transform data in any way, but so far we've used [`Create`](https://beam.apache.org/documentation/transforms/python/other/create) to get data from an in-memory [`iterable`](https://docs.python.org/3/glossary.html#term-iterable), like a `list`.

This works well for experimenting with small datasets. For larger datasets we can use `Source` transforms to read data and `Sink` transforms to write data. If there are no built-in `Source` or `Sink` transforms, we can also easily create our custom I/O transforms.

Let's create some data files and see how we can read them in Beam.

In [7]:
%%writefile data/sample1.txt
This is just a plain text file, UTF-8 strings are allowed 🎉.
Each line in the file is one element in the PCollection.

Writing data/sample1.txt


In [8]:
%%writefile data/sample2.txt
There are no guarantees on the order of the elements.
ฅ^•ﻌ•^ฅ

Writing data/sample2.txt


In [9]:
%%writefile data/penguins.csv
species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667
0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556
1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222
1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333
2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5
2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334

Writing data/penguins.csv


### Reading from text files

We can use the [`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText) transform to read text files into `str` elements.

It takes a [*glob pattern*](https://en.wikipedia.org/wiki/Glob_%28programming%29) as an input, and reads all the files that match that pattern. It returns one element for each line in the file.

For example, in the pattern `data/*.txt`, the `*` is a wildcard that matches anything. This pattern matches all the files in the `data/` directory with a `.txt` extension.

In [10]:
import apache_beam as beam

input_files = 'data/sample*.txt'
with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Read files' >> beam.io.ReadFromText(input_files)
      | 'Print contents' >> beam.Map(print)
  )



This is just a plain text file, UTF-8 strings are allowed 🎉.
Each line in the file is one element in the PCollection.
There are no guarantees on the order of the elements.
ฅ^•ﻌ•^ฅ


### Writing to text files

We can use the [`WriteToText`](https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText) transform to write `str` elements into text files.

It takes a *file path prefix* as an input, and it writes the all `str` elements into one or more files with filenames starting with that prefix. You can optionally pass a `file_name_suffix` as well, usually used for the file extension. Each element goes into its own line in the output files.

In [11]:
import apache_beam as beam

output_file_name_prefix = 'output/sample'

with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Create file lines' >> beam.Create([
          'Each element must be a string.',
          'It writes one element per line.',
          'There are no guarantees on the line order.',
          'The data might be written into multiple files.',
      ])
      | 'Write to files' >> beam.io.WriteToText(
          output_file_name_prefix,
          file_name_suffix='.txt')
  )



In [12]:
# Lets look at the output files and contents.
!head output/sample*.txt

Each element must be a string.
It writes one element per line.
There are no guarantees on the line order.
The data might be written into multiple files.


### Reading from a SQLite database

Lets begin by creating a small SQLite local database file.

Run the "Creating the SQLite database" cell to create a new SQLite3 database with the filename you choose. You can double-click it to see the source code if you want.

In [13]:
import sqlite3

database_file = "data/moon-phases.db" #@param {type:"string"}

with sqlite3.connect(database_file) as db:
  cursor = db.cursor()

  # Create the moon_phases table.
  cursor.execute('''
    CREATE TABLE IF NOT EXISTS moon_phases (
      id INTEGER PRIMARY KEY,
      phase_emoji TEXT NOT NULL,
      peak_datetime DATETIME NOT NULL,
      phase TEXT NOT NULL)''')

  # Truncate the table if it's already populated.
  cursor.execute('DELETE FROM moon_phases')

  # Insert some sample data.
  insert_moon_phase = 'INSERT INTO moon_phases(phase_emoji, peak_datetime, phase) VALUES(?, ?, ?)'
  cursor.execute(insert_moon_phase, ('🌕', '2017-12-03 15:47:00', 'Full Moon'))
  cursor.execute(insert_moon_phase, ('🌗', '2017-12-10 07:51:00', 'Last Quarter'))
  cursor.execute(insert_moon_phase, ('🌑', '2017-12-18 06:30:00', 'New Moon'))
  cursor.execute(insert_moon_phase, ('🌓', '2017-12-26 09:20:00', 'First Quarter'))
  cursor.execute(insert_moon_phase, ('🌕', '2018-01-02 02:24:00', 'Full Moon'))
  cursor.execute(insert_moon_phase, ('🌗', '2018-01-08 22:25:00', 'Last Quarter'))
  cursor.execute(insert_moon_phase, ('🌑', '2018-01-17 02:17:00', 'New Moon'))
  cursor.execute(insert_moon_phase, ('🌓', '2018-01-24 22:20:00', 'First Quarter'))
  cursor.execute(insert_moon_phase, ('🌕', '2018-01-31 13:27:00', 'Full Moon'))

  # Query for the data in the table to make sure it's populated.
  cursor.execute('SELECT * FROM moon_phases')
  for row in cursor.fetchall():
    print(row)

(1, '🌕', '2017-12-03 15:47:00', 'Full Moon')
(2, '🌗', '2017-12-10 07:51:00', 'Last Quarter')
(3, '🌑', '2017-12-18 06:30:00', 'New Moon')
(4, '🌓', '2017-12-26 09:20:00', 'First Quarter')
(5, '🌕', '2018-01-02 02:24:00', 'Full Moon')
(6, '🌗', '2018-01-08 22:25:00', 'Last Quarter')
(7, '🌑', '2018-01-17 02:17:00', 'New Moon')
(8, '🌓', '2018-01-24 22:20:00', 'First Quarter')
(9, '🌕', '2018-01-31 13:27:00', 'Full Moon')


We could use a FlatMap transform to receive a SQL query and yield each result row, but that would mean creating a new database connection for each query. If we generated a large number of queries, creating that many connections could be a bottleneck.

It would be nice to create the database connection only once for each worker, and every query could use the same connection if needed.

We can use a custom DoFn transform for this. It allows us to open and close resources, like the database connection, only once per DoFn instance by using the setup and teardown methods.

ℹ️ It should be safe to read from a database with multiple concurrent processes using the same connection, but only one process should be writing at once.

In [14]:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import sqlite3
from typing import Iterable, List, Tuple, Dict

class SQLiteSelect(beam.DoFn):
  def __init__(self, database_file: str):
    self.database_file = database_file
    self.connection = None

  def setup(self):
    self.connection = sqlite3.connect(self.database_file)

  def process(self, query: Tuple[str, List[str]]) -> Iterable[Dict[str, str]]:
    table, columns = query
    cursor = self.connection.cursor()
    cursor.execute(f"SELECT {','.join(columns)} FROM {table}")
    for row in cursor.fetchall():
      yield dict(zip(columns, row))

  def teardown(self):
    self.connection.close()

@beam.ptransform_fn
@beam.typehints.with_input_types(beam.pvalue.PBegin)
@beam.typehints.with_output_types(Dict[str, str])
def SelectFromSQLite(
    pbegin: beam.pvalue.PBegin,
    database_file: str,
    queries: List[Tuple[str, List[str]]],
) -> beam.PCollection[Dict[str, str]]:
  return (
      pbegin
      | 'Create None' >> beam.Create(queries)
      | 'SQLite SELECT' >> beam.ParDo(SQLiteSelect(database_file))
  )

queries = [
    # (table_name, [column1, column2, ...])
    ('moon_phases', ['phase_emoji', 'peak_datetime', 'phase']),
    ('moon_phases', ['phase_emoji', 'phase']),
]

options = PipelineOptions(flags=[], type_check_additional='all')
with beam.Pipeline(options=options) as pipeline:
  (
      pipeline
      | 'Read from SQLite' >> SelectFromSQLite(database_file, queries)
      | 'Print rows' >> beam.Map(print)
  )



{'phase_emoji': '🌕', 'peak_datetime': '2017-12-03 15:47:00', 'phase': 'Full Moon'}
{'phase_emoji': '🌗', 'peak_datetime': '2017-12-10 07:51:00', 'phase': 'Last Quarter'}
{'phase_emoji': '🌑', 'peak_datetime': '2017-12-18 06:30:00', 'phase': 'New Moon'}
{'phase_emoji': '🌓', 'peak_datetime': '2017-12-26 09:20:00', 'phase': 'First Quarter'}
{'phase_emoji': '🌕', 'peak_datetime': '2018-01-02 02:24:00', 'phase': 'Full Moon'}
{'phase_emoji': '🌗', 'peak_datetime': '2018-01-08 22:25:00', 'phase': 'Last Quarter'}
{'phase_emoji': '🌑', 'peak_datetime': '2018-01-17 02:17:00', 'phase': 'New Moon'}
{'phase_emoji': '🌓', 'peak_datetime': '2018-01-24 22:20:00', 'phase': 'First Quarter'}
{'phase_emoji': '🌕', 'peak_datetime': '2018-01-31 13:27:00', 'phase': 'Full Moon'}
{'phase_emoji': '🌕', 'phase': 'Full Moon'}
{'phase_emoji': '🌗', 'phase': 'Last Quarter'}
{'phase_emoji': '🌑', 'phase': 'New Moon'}
{'phase_emoji': '🌓', 'phase': 'First Quarter'}
{'phase_emoji': '🌕', 'phase': 'Full Moon'}
{'phase_emoji': '🌗',