# Basics of the Beam model

## Pipeline

![image](https://beam.apache.org/images/design-your-pipeline-multiple-pcollections.svg)

* **Pipeline** - A pipeline is a user-constructed graph of transformations that defines the desired data processing operations.
* **PCollection** - A PCollection is a data set or data stream. The data that a pipeline processes is part of a PCollection.
* **PTransform** - A PTransform (or transform) represents a data processing operation, or a step, in your pipeline. A transform is applied to zero or more PCollection objects, and produces zero or more PCollection objects.
* **Aggregation** - Aggregation is computing a value from multiple (1 or more) input elements.
* **User** -defined function (UDF) - Some Beam operations allow you to run user-defined code as a way to configure the transform.
* **Schema** - A schema is a language-independent type definition for a PCollection. The schema for a PCollection defines elements of that PCollection as an ordered list of named fields.
* **SDK** - A language-specific library that lets pipeline authors build transforms, construct their pipelines, and submit them to a runner.
* **Runner** - A runner runs a Beam pipeline using the capabilities of your chosen data processing engine.
* **Window** - A PCollection can be subdivided into windows based on the timestamps of the individual elements. Windows enable grouping operations over collections that grow over time by dividing the collection into windows of finite collections.
* **Watermark** - A watermark is a guess as to when all data in a certain window is expected to have arrived. This is needed because data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals.
* **Trigger** - A trigger determines when to aggregate the results of each window.
* **State and timers** - Per-key state and timer callbacks are lower level primitives that give you full control over aggregating input collections that grow over time.
* **Splittable DoFn** - Splittable DoFns let you process elements in a non-monolithic way. You can checkpoint the processing of an element, and the runner can split the remaining work to yield additional parallelism.

> more info on the [link](https://beam.apache.org/documentation/basics/)

# First Steps

## Import Apache Beam

In [None]:
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib

## Setting interactivity options

In [None]:
ib.options.recording_duration = '10m'
# Set the recording size limit to 1 GB.
ib.options.recording_size_limit = 1e9

## Creating your pipeline

In [None]:
p = beam.Pipeline(InteractiveRunner())

## Reading and visualizing the data

### Creating a PCollection from in-memory data

In [None]:
output = p | beam.Create([1,2,3,4])
ib.show(output)
print(output)

In [None]:
lines = (
      p
      | beam.Create([
          'To be, or not to be: that is the question: ',
          "Whether 'tis nobler in the mind to suffer ",
          'The slings and arrows of outrageous fortune, ',
          'Or to take arms against a sea of troubles, ',
      ]))
ib.show(lines)

### Reading from an external source

In [None]:
lines = p | 'ReadMyFile' >> beam.io.ReadFromText('./my_text.txt')
ib.show(lines)

## Common structure of a Pipeline

![image1](pipe.svg)

![image2](DoFn.svg)

[Python transform catalog overview](https://beam.apache.org/documentation/transforms/python/overview/)

[Built-in I/O Transforms](https://beam.apache.org/documentation/io/built-in/)

### Let's create a first transformation

In [None]:
class ComputeWordLengthFn(beam.DoFn):
  def process(self, element):
    return [len(element)]

## Applying transforms

![image](https://beam.apache.org/images/design-your-pipeline-linear.svg)

In [None]:
word_lengths = lines | beam.ParDo(ComputeWordLengthFn())
ib.show(word_lengths)

[Big Data Analytics: An Interactive Introduction to Apache Beam](https://www.youtube.com/watch?v=w0L1rjU_Ib4)