Dataflow

Executes Apache Beam Pipelines
Can be used for batch or stream data
Scalable, fault-tolerant, multi-step processing of data.
Often used for data preparation/ETL for data sets.
Integrates with other tools (GCP and external)
- Natively – PubSub, BigQuery, Cloud AI Platform
  - BQ I/O connector can stream JSON through Dataflow to BQ
- Connectors – BigTable, Apache Kafka
Pipelines are regional-based
Follows the Flink Programming Model
- Data Source -> Transformations -> Data Sink
Use when:
- No dependencies on Apache Hadoop/Spark
- Favor hands-off/serverless
- Preprocessing for machine learning with Cloud ML Engine
Templates
- Google provided templates.
  - WordCount
  - Bulk compress/decompress GCS Files
  - PubSub to (Avro, PubSub, GCS Text, BigQuery)
  - Datastore to GCS Text
  - GCS Text to (BigQuery, PubSub, DataStore)
Requires a Staging Location where intermediate files may be stored.
System Lag
- Max time an element has been waiting for processing in this stage of the pipeline.
Wall Time
- How long the processing takes.

Apache Beam Architecture

Pipeline
- Entire set of computations
- Not linear, it is a DAG
- Beam programs start by constructing a Pipeline object.
A single, potentially repeatable job, from start to finish, in Dataflow.
Defined by driver program.
- Actual computations run on a backend, abstracted in the driver by a runner.
  - Driver: Defines DAG
  - Runner: Executes DAG
Supports multiple backends
- Spark
- Flink
- Dataflow
- Beam Model
Element
- A single entry of data (e.g. table row)
PCollection
- Distributed data set in pipeline (immutable)
- Specialized container classes that can represent data sets of virtually unlimited size.
  - Fixed size: Text file or BQ table
  - Unbounded: Pub/Sub subscription
- Side inputs
  - Inject additional data into some PCollection
  - Can inject in ParDo transforms.
PTransform
- Data processing operation (step) in pipeline
  - Input: 1 or more PCollection
  - Processing function on elements of PCcollection
  - Output: 1 or more PCollection
ParDo
- Core of parallel processing in Beam SDKs
- Collects the zero or more output elements into an output PCollection.
- Useful for a variety of common data processing operations, including:
  - Filtering a data set.
    - Better than a Filter transform in the sense that a filter transform can only filter based on input element => no side input allowed.
  - Formatting or type-converting each element in a data set.
  - Extracting parts of each element in a data set.
  - Performing computations on each element in a data set.

Dealing with late/out of order data

Latency is to be expected (network latency, processing time, etc.)
PubSub does not care about late data, that is resolved in Dataflow.
Resolved with Windows, Watermarks, and Triggers.
Windows = Logically divides element groups by time span.
Watermarks = Timestamp
- Event time – When data was generated
- Processing time – when data processed anywhere in pipeline
- Can use Pub/Sub provided watermark or source generated.
Trigger = Determine when results in window are emitted.
- (Submitted as complete)
- Allow late-arriving data in allowed time window to re-aggregate previously submitted results.
- Timestamps, element count, combinations of both.

Stopping a Dataflow Jobs

Cancelling

Immediately stop and abort all data ingestion and processing.
Buffered data may be lost.

Draining

Cease ingestion but will attempt to finish processing any remaining buffered data.
Pipeline resources will be maintained until buffered data has finished processing and any pending output has finished writing.

Pipeline Update

Replace an existing pipeline in-place with a new one and preserve Dataflow’s exactly-once processing guarantee.
When updating pipeline manually, use DRAIN instead of CANCEL to maintain in flight data.
- Drain command is supported for streaming pipelines only
Pipelines cannot share data or transforms.

Handling Invalid Inputs in Dataflow

Catch exception, log an error, then drop input.
- Not ideal
Have a dead letter file where all failing inputs are written for later analysis and reprocessing.
- Add a side output in Dataflow to accomplish this.

Converting from Kafka to PubSub

CloudPubSubConnector is a connector to be used with Kafka Connect to publish messages from Kafka to PubSub and vice versa.
Provides both a sink connector (to copy messages from Kafka to PubSub) and a source connector (to copy messages from PubSub to Kafka.

Windowing

Can apply windowning to streams for rolling average for the window, max in a window etc.

Fixed Time Windows (Tumbling Window)

Fixed window size
Non-overlapping time
Number of entities differ within a window

Sliding Time Windows (overlapped)

Fixed window size
Overlapping time
Number of entities differ within a window
Window Interval: How large window is
Sliding Interval: How much window moves over

Session Windows

Changing window size based on session data
No overlapping time
Number of entities differ within a window
Session gap determines window size
Per-key basis
Useful for data that is irregularly distributed with respect to time.

Single Global Window

Late data is discarded
Okay for bounded size data
Can be used with unbounded but use with caution when applying transforms such as GroupByKey and Combine
Default windowing behavior is to assign all elements of a PCollection to a single, global window even for unbounded PCollections.

Triggers

Determines when a Window’s contents should be output based on a certain being met.
- Allows specifying a trigger to control when (in processing time) results for the given window can be produced.
- If unspecified, the default behavior is to trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data.

Time-Based Trigger

Event Time Triggers

Operate on event time, as indicated by timestamp on each data elements.
This is the default trigger.

Processing Time Triggers

Operate on the processing time – the time when the data element is processed at any given stage in the pipeline.

Data-Driven Trigger

Operate by examining the data as it arrives in each window, and firing when that data meets a certain property.
Currently, only support firing after a certain number of data elements.

Composite Triggers

Combine multiple triggers in various ways.

Watermarks

System’s notion of when all data in a certain window can be expected to have arrived in the pipeline.
Tracks watermark because data is not guaranteed to arrive in a pipeline in order or at predictable intervals.
No guarantees about ordering.
Indicates all windows ending before or at this timestamp are closed.
No longer accept any streaming entities that are before this timestamp.
For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed.
Used with Processing Time

IAM

Project-level only – all pipelines in the project (or none)
Pipeline data access separate from pipeline access.
Dataflow Admin
- Full pipeline access
- Machine type/storage bucket config access
Dataflow.developer
- Full pipeline access
- No machine type/storage bucket access (data privacy)
Dataflow Viewer
- View permissions only.
Dataflow.worker
- Enables service account to execute work units for a Dataflow pipeline in Compute Engine.
- Dataflow API also needs to be enabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataflow.md

dataflow.md

Dataflow

Apache Beam Architecture

Dealing with late/out of order data

Stopping a Dataflow Jobs

Cancelling

Draining

Pipeline Update

Handling Invalid Inputs in Dataflow

Converting from Kafka to PubSub

Windowing

Fixed Time Windows (Tumbling Window)

Sliding Time Windows (overlapped)

Session Windows

Single Global Window

Triggers

Time-Based Trigger

Event Time Triggers

Processing Time Triggers

Data-Driven Trigger

Composite Triggers

Watermarks

IAM

Files

dataflow.md

Latest commit

History

dataflow.md

File metadata and controls

Dataflow

Apache Beam Architecture

Dealing with late/out of order data

Stopping a Dataflow Jobs

Cancelling

Draining

Pipeline Update

Handling Invalid Inputs in Dataflow

Converting from Kafka to PubSub

Windowing

Fixed Time Windows (Tumbling Window)

Sliding Time Windows (overlapped)

Session Windows

Single Global Window

Triggers

Time-Based Trigger

Event Time Triggers

Processing Time Triggers

Data-Driven Trigger

Composite Triggers

Watermarks

IAM