Dataflow
Dataflow is a managed service for executing a wide variety of data processing patterns.
Google Cloud Dataflow makes it easy to process and analyze real-time streaming data so that you can derive insights and react to new information in real-time.
https://www.youtube.com/watch?v=7lJyq1hw_KI
Dataflow uses your pipeline code to create an execution graph that represents your pipeline's PCollections and transforms.
https://www.youtube.com/watch?v=cqDBnOaS6O4
The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines.
You create your pipelines with an Apache Beam program and then run them on the Dataflow service.
The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.
https://medium.com/syntio/data-processing-with-dataflow-sql-part-1-2-fe57e47f4bb0
https://mkuthan.github.io/blog/2022/01/28/stream-processing-part1/
https://mkuthan.github.io/blog/2022/03/08/stream-processing-part2/
https://cloud.google.com/blog/products/data-analytics/developing-beam-pipelines-using-scala/
Cloud Dataflow is not the first big data processing engine. You can use Apache Spark in Google Cloud Dataproc Service, Cloud Composer based on Airflow and Cloud Workflows.
https://github.com/manuzhang/awesome-streaming
- Serverless: We don’t have to manage computing resources
- Processing code is separate from the execution environment
- Processing batch and stream mode with the same programming model
https://beam.apache.org/documentation/programming-guide/
https://cloud.google.com/blog/products/databases/apache-beam-firestore-connector-released
https://medium.com/bb-tutorials-and-thoughts/how-to-get-started-with-gcp-dataflow-822295dce7b4
https://doppelfelix.medium.com/pipeline-in-the-cloud-6edb007c4d52
https://cloud.google.com/blog/products/data-analytics/debunking-myths-about-python-on-dataflow
https://beam.apache.org/get-started/wordcount-example/
https://beam.apache.org/get-started/mobile-gaming-example/
In this notebook, we set up your development environment and work through a simple example using the DirectRunner. You can explore other runners with the Beam Capability Matrix.
Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through AI Platform Notebooks.
https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development
The Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.
https://cloud.google.com/blog/products/data-analytics/dataflow-templates-gets-your-data-into-motion
Google provides a set of open-source Dataflow templates.
Dataflow templates use runtime parameters to accept values that are only available during pipeline execution. To customize the execution of a templated pipeline, you can pass these parameters to functions that run within the pipeline (such as a DoFn).
To create a template from your Apache Beam pipeline, you must modify your pipeline code to support runtime parameters.
https://cloud.google.com/dataflow/docs/guides/templates/creating-templates
https://cloud.google.com/dataflow/docs/guides/templates/running-templates
https://cloud.google.com/blog/products/data-analytics/dataflow-templates-for-elastic-cloud
A UDF is a JavaScript snippet that implements a simple element processing logic, and is provided as an input parameter to the Dataflow pipeline. The UDF JavaScript code runs on Nashorn JavaScript engine included in the Dataflow worker’s Java runtime (applicable for Java pipelines such as Google-provided Dataflow templates). The code is invoked locally by a Dataflow worker for each element separately. Element payloads are serialized and passed as JSON strings back and forth.
Dataflow Flex templates use docker containers.
https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#python
https://cloud.google.com/dataflow/docs/guides/gae-mapreduce-migration
ETL From Relational DB into BigQuery using DataFlow
https://cloud.google.com/architecture/processing-logs-at-scale-using-dataflow
https://medium.com/everything-full-stack/dataflow-ci-cd-with-github-actions-65765f09713f
https://medium.com/@larry_nguyen/how-to-import-data-into-google-firestore-2c31614c567c
https://lakshmanok.medium.com/how-to-do-product-mix-optimization-in-real-time-d79ac1bf1c97
https://medium.com/@inigosj/how-to-properly-play-wordle-using-dataflow-and-bigquery-825d2f4099ac
https://medium.com/@mazlum.tosun/error-handling-with-apache-beam-asgarde-with-kotlin-8b742fca120e
You cannot delete a Dataflow job; you can only stop it.
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline