From 277b91a88ec42683055b967338e97977b6c68e38 Mon Sep 17 00:00:00 2001 From: melissa Date: Mon, 17 Apr 2017 17:13:33 -0700 Subject: [PATCH 1/4] [BEAM-1741] Update direct and Cloud Dataflow runner pages with Python info --- src/documentation/runners/dataflow.md | 75 ++++++++++++++++++++++----- src/documentation/runners/direct.md | 24 +++++++-- 2 files changed, 83 insertions(+), 16 deletions(-) diff --git a/src/documentation/runners/dataflow.md b/src/documentation/runners/dataflow.md index f2037a298cb..dc5c0b096ab 100644 --- a/src/documentation/runners/dataflow.md +++ b/src/documentation/runners/dataflow.md @@ -6,6 +6,14 @@ redirect_from: /learn/runners/dataflow/ --- # Using the Google Cloud Dataflow Runner + + The Google Cloud Dataflow Runner uses the [Cloud Dataflow managed service](https://cloud.google.com/dataflow/service/dataflow-service-desc). When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform. The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs, and provide: @@ -40,8 +48,7 @@ For more information, see the *Before you begin* section of the [Cloud Dataflow ### Specify your dependency -You must specify your dependency on the Cloud Dataflow Runner. - +When using Java, you must specify your dependency on the Cloud Dataflow Runner in your `pom.xml`. ```java org.apache.beam @@ -51,6 +58,8 @@ You must specify your dependency on the Cloud Dataflow Runner. ``` +When using Python, you do not need to specify your dependency on the Cloud Dataflow Runner. + ### Authentication Before running your pipeline, you must authenticate with the Google Cloud Platform. Run the following command to get [Application Default Credentials](https://developers.google.com/identity/protocols/application-default-credentials). @@ -61,7 +70,8 @@ gcloud auth application-default login ## Pipeline options for the Cloud Dataflow Runner -When executing your pipeline with the Cloud Dataflow Runner, set these pipeline options. +When executing your pipeline with the Cloud Dataflow Runner (Java), consider these common pipeline options. +When executing your pipeline with the Cloud Dataflow Runner (Python), consider these common pipeline options. @@ -69,36 +79,74 @@ When executing your pipeline with the Cloud Dataflow Runner, set these pipeline + - + + - + + + + - - + + - + + + + - + - + + + + + + + + + + + + + + + + +
Description Default Value
runner The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.Set to dataflow to run on the Cloud Dataflow Service.Set to dataflow or DataflowRunner to run on the Cloud Dataflow Service.
project The project ID for your Google Cloud Project. If not set, defaults to the default project in the current environment. The default project is set via gcloud.
streaming Whether streaming mode is enabled or disabled; true if enabled. Set to true if running pipelines with unbounded PCollections. false
tempLocationOptional. Path for temporary files. If set to a valid Google Cloud Storage URL that begins with gs://, tempLocation is used as the default value for gcpTempLocation. + tempLocation + temp_location + + Optional. + Required. + Path for temporary files. Must be a valid Google Cloud Storage URL that begins with gs://. + If set, tempLocation is used as the default value for gcpTempLocation. + No default value.
gcpTempLocation Cloud Storage bucket path for temporary files. Must be a valid Cloud Storage URL that begins with gs://. If not set, defaults to the value of tempLocation, provided that tempLocation is a valid Cloud Storage URL. If tempLocation is not a valid Cloud Storage URL, you must set gcpTempLocation.
stagingLocation + stagingLocation + staging_location + Optional. Cloud Storage bucket path for staging your binary and any temporary files. Must be a valid Cloud Storage URL that begins with gs://.If not set, defaults to a staging directory within gcpTempLocation. + If not set, defaults to a staging directory within gcpTempLocation. + If not set, defaults to a staging directory within temp_location. +
save_main_sessionSave the main session state so that pickled functions and classes defined in __main__ (e.g. interactive session) can be unpickled. Some workflows do not need the session state if, for instance, all of their functions/classes are defined in proper modules (not __main__) and the modules are importable in the worker.false
sdk_locationOverride the default GitHub location from where the Cloud Dataflow SDK is downloaded. This value can be an URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string default, a standard SDK location is used. If empty, no SDK is copied.default
See the reference documentation for the [DataflowPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for the complete list of pipeline configuration options. @@ -111,8 +159,11 @@ While your pipeline executes, you can monitor the job's progress, view details o ### Blocking Execution -To connect to your job and block until it is completed, call `waitToFinish` on the `PipelineResult` returned from `pipeline.run()`. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing **Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf) or the [Dataflow Command-line Interface](https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf). +To connect to your job and block until it is completed, call waitToFinishwait_until_finish on the `PipelineResult` returned from `pipeline.run()`. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing **Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf) or the [Dataflow Command-line Interface](https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf). + ### Streaming Execution -If your pipeline uses an unbounded data source or sink, you must set the `streaming` option to `true`. +If your pipeline uses an unbounded data source or sink, you must set the `streaming` option to `true`. +The Beam SDK for Python does not currently support streaming pipelines. + diff --git a/src/documentation/runners/direct.md b/src/documentation/runners/direct.md index babe4cb906b..cb45cd1a2ce 100644 --- a/src/documentation/runners/direct.md +++ b/src/documentation/runners/direct.md @@ -6,6 +6,14 @@ redirect_from: /learn/runners/direct/ --- # Using the Direct Runner + + The Direct Runner executes pipelines on your machine and is designed to validate that pipelines adhere to the Apache Beam model as closely as possible. Instead of focusing on efficient pipeline execution, the Direct Runner performs additional checks to ensure that users do not rely on semantics that are not guaranteed by the model. Some of these checks include: * enforcing immutability of elements @@ -16,14 +24,20 @@ The Direct Runner executes pipelines on your machine and is designed to validate Using the Direct Runner for testing and development helps ensure that pipelines are robust across different Beam runners. In addition, debugging failed runs can be a non-trivial task when a pipeline executes on a remote cluster. Instead, it is often faster and simpler to perform local unit testing on your pipeline code. Unit testing your pipeline locally also allows you to use your preferred local debugging tools. Here are some resources with information about how to test your pipelines. -* [Testing Unbounded Pipelines in Apache Beam]({{ site.baseurl }}/blog/2016/10/20/test-stream.html) talks about the use of Java classes [`PAssert`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/testing/PAssert.html) and [`TestStream`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/testing/TestStream.html) to test your pipelines. -* The [Apache Beam WordCount Example]({{ site.baseurl }}/get-started/wordcount-example/) contains an example of logging and testing a pipeline with [`PAssert`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/testing/PAssert.html). + ## Direct Runner prerequisites and setup -You must specify your dependency on the Direct Runner. +### Specify your dependency +When using Java, you must specify your dependency on the Direct Runner in your `pom.xml`. ```java org.apache.beam @@ -33,9 +47,11 @@ You must specify your dependency on the Direct Runner. ``` +When using Python, you do not need to specify your dependency on the Direct Runner. + ## Pipeline options for the Direct Runner -When executing your pipeline from the command-line, set `runner` to `direct`. The default values for the other pipeline options are generally sufficient. +When executing your pipeline from the command-line, set `runner` to `direct` or `DirectRunner`. The default values for the other pipeline options are generally sufficient. See the reference documentation for the [`DirectOptions`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/direct/DirectOptions.html)[`PipelineOptions`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for defaults and the complete list of pipeline configuration options. From 3e13e291216f7c3b9f4a44b0bd900271abbde7cf Mon Sep 17 00:00:00 2001 From: melissa Date: Tue, 18 Apr 2017 11:04:29 -0700 Subject: [PATCH 2/4] Update with review feedback --- src/documentation/runners/dataflow.md | 12 +++++++----- src/documentation/runners/direct.md | 7 +++++-- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/src/documentation/runners/dataflow.md b/src/documentation/runners/dataflow.md index dc5c0b096ab..f507311fcfc 100644 --- a/src/documentation/runners/dataflow.md +++ b/src/documentation/runners/dataflow.md @@ -58,7 +58,7 @@ For more information, see the *Before you begin* section of the [Cloud Dataflow ``` -When using Python, you do not need to specify your dependency on the Cloud Dataflow Runner. +This section is not applicable to the Beam SDK for Python. ### Authentication @@ -142,14 +142,17 @@ gcloud auth application-default login sdk_location - Override the default GitHub location from where the Cloud Dataflow SDK is downloaded. This value can be an URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string default, a standard SDK location is used. If empty, no SDK is copied. + Override the default location from where the Cloud Dataflow SDK is downloaded. This value can be an URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string default, a standard SDK location is used. If empty, no SDK is copied. default -See the reference documentation for the [DataflowPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html)[PipelineOptions](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for the complete list of pipeline configuration options. +See the reference documentation for the +[DataflowPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html) +[`PipelineOptions`]({{ site.baseurl }}/documentation/sdks/pydoc/{{ site.release_latest }}/apache_beam.utils.html#apache_beam.utils.pipeline_options.PipelineOptions) +interface (and any subinterfaces) for additional pipeline configuration options. ## Additional information and caveats @@ -159,8 +162,7 @@ While your pipeline executes, you can monitor the job's progress, view details o ### Blocking Execution -To connect to your job and block until it is completed, call waitToFinishwait_until_finish on the `PipelineResult` returned from `pipeline.run()`. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing **Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf) or the [Dataflow Command-line Interface](https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf). - +To block until your job completes, call waitToFinishwait_until_finish on the `PipelineResult` returned from `pipeline.run()`. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing **Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf) or the [Dataflow Command-line Interface](https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf). ### Streaming Execution diff --git a/src/documentation/runners/direct.md b/src/documentation/runners/direct.md index cb45cd1a2ce..16f15426741 100644 --- a/src/documentation/runners/direct.md +++ b/src/documentation/runners/direct.md @@ -47,13 +47,16 @@ Here are some resources with information about how to test your pipelines. ``` -When using Python, you do not need to specify your dependency on the Direct Runner. +This section is not applicable to the Beam SDK for Python. ## Pipeline options for the Direct Runner When executing your pipeline from the command-line, set `runner` to `direct` or `DirectRunner`. The default values for the other pipeline options are generally sufficient. -See the reference documentation for the [`DirectOptions`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/direct/DirectOptions.html)[`PipelineOptions`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py) interface (and its subinterfaces) for defaults and the complete list of pipeline configuration options. +See the reference documentation for the +[`DirectOptions`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/direct/DirectOptions.html) +[`DirectOptions`]({{ site.baseurl }}/documentation/sdks/pydoc/{{ site.release_latest }}/apache_beam.utils.html#apache_beam.utils.pipeline_options.DirectOptions) +interface for defaults and additional pipeline configuration options. ## Additional information and caveats From 3af890fbd306908955928857abef786b7dbf924b Mon Sep 17 00:00:00 2001 From: melissa Date: Tue, 18 Apr 2017 11:40:57 -0700 Subject: [PATCH 3/4] put back reference to Beam --- src/documentation/runners/dataflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/documentation/runners/dataflow.md b/src/documentation/runners/dataflow.md index f507311fcfc..17228207785 100644 --- a/src/documentation/runners/dataflow.md +++ b/src/documentation/runners/dataflow.md @@ -142,7 +142,7 @@ gcloud auth application-default login sdk_location - Override the default location from where the Cloud Dataflow SDK is downloaded. This value can be an URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string default, a standard SDK location is used. If empty, no SDK is copied. + Override the default location from where the Beam SDK is downloaded. This value can be an URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string default, a standard SDK location is used. If empty, no SDK is copied. default From 3057851b16a6bc91deec1cd2acecffd1491c4767 Mon Sep 17 00:00:00 2001 From: melissa Date: Tue, 18 Apr 2017 14:51:58 -0700 Subject: [PATCH 4/4] Update all instances of language-python to language-py --- src/documentation/runners/dataflow.md | 24 ++++++++++++------------ src/documentation/runners/direct.md | 10 +++++----- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/src/documentation/runners/dataflow.md b/src/documentation/runners/dataflow.md index 17228207785..5eb6b53ccd8 100644 --- a/src/documentation/runners/dataflow.md +++ b/src/documentation/runners/dataflow.md @@ -10,7 +10,7 @@ redirect_from: /learn/runners/dataflow/ Adapt for:
  • Java SDK
  • -
  • Python SDK
  • +
  • Python SDK
@@ -58,7 +58,7 @@ For more information, see the *Before you begin* section of the [Cloud Dataflow ``` -This section is not applicable to the Beam SDK for Python. +This section is not applicable to the Beam SDK for Python. ### Authentication @@ -71,7 +71,7 @@ gcloud auth application-default login ## Pipeline options for the Cloud Dataflow Runner When executing your pipeline with the Cloud Dataflow Runner (Java), consider these common pipeline options. -When executing your pipeline with the Cloud Dataflow Runner (Python), consider these common pipeline options. +When executing your pipeline with the Cloud Dataflow Runner (Python), consider these common pipeline options. @@ -102,11 +102,11 @@ gcloud auth application-default login @@ -123,24 +123,24 @@ gcloud auth application-default login - + - + @@ -151,7 +151,7 @@ gcloud auth application-default login See the reference documentation for the [DataflowPipelineOptions]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html) -[`PipelineOptions`]({{ site.baseurl }}/documentation/sdks/pydoc/{{ site.release_latest }}/apache_beam.utils.html#apache_beam.utils.pipeline_options.PipelineOptions) +[`PipelineOptions`]({{ site.baseurl }}/documentation/sdks/pydoc/{{ site.release_latest }}/apache_beam.utils.html#apache_beam.utils.pipeline_options.PipelineOptions) interface (and any subinterfaces) for additional pipeline configuration options. ## Additional information and caveats @@ -162,10 +162,10 @@ While your pipeline executes, you can monitor the job's progress, view details o ### Blocking Execution -To block until your job completes, call waitToFinishwait_until_finish on the `PipelineResult` returned from `pipeline.run()`. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing **Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf) or the [Dataflow Command-line Interface](https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf). +To block until your job completes, call waitToFinishwait_until_finish on the `PipelineResult` returned from `pipeline.run()`. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing **Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf) or the [Dataflow Command-line Interface](https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf). ### Streaming Execution If your pipeline uses an unbounded data source or sink, you must set the `streaming` option to `true`. -The Beam SDK for Python does not currently support streaming pipelines. +The Beam SDK for Python does not currently support streaming pipelines. diff --git a/src/documentation/runners/direct.md b/src/documentation/runners/direct.md index 16f15426741..0e01c5c9673 100644 --- a/src/documentation/runners/direct.md +++ b/src/documentation/runners/direct.md @@ -10,7 +10,7 @@ redirect_from: /learn/runners/direct/ Adapt for:
  • Java SDK
  • -
  • Python SDK
  • +
  • Python SDK
@@ -30,7 +30,7 @@ Here are some resources with information about how to test your pipelines.
  • The Apache Beam WordCount Example contains an example of logging and testing a pipeline with PAssert.
  • -
  • You can use assert_that to test your pipeline. The Python WordCount Debugging Example contains an example of logging and testing with assert_that.
  • +
  • You can use assert_that to test your pipeline. The Python WordCount Debugging Example contains an example of logging and testing with assert_that.
  • ## Direct Runner prerequisites and setup @@ -47,7 +47,7 @@ Here are some resources with information about how to test your pipelines. ``` -This section is not applicable to the Beam SDK for Python. +This section is not applicable to the Beam SDK for Python. ## Pipeline options for the Direct Runner @@ -55,10 +55,10 @@ When executing your pipeline from the command-line, set `runner` to `direct` or See the reference documentation for the [`DirectOptions`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/runners/direct/DirectOptions.html) -[`DirectOptions`]({{ site.baseurl }}/documentation/sdks/pydoc/{{ site.release_latest }}/apache_beam.utils.html#apache_beam.utils.pipeline_options.DirectOptions) +[`DirectOptions`]({{ site.baseurl }}/documentation/sdks/pydoc/{{ site.release_latest }}/apache_beam.utils.html#apache_beam.utils.pipeline_options.DirectOptions) interface for defaults and additional pipeline configuration options. ## Additional information and caveats -Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with data sets small enough to fit in local memory. You can create a small in-memory data set using a [`Create`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Create.html)[`Create`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) transform, or you can use a [`Read`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/io/Read.html)[`Read`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py) transform to work with small local or remote files. +Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with data sets small enough to fit in local memory. You can create a small in-memory data set using a [`Create`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Create.html)[`Create`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) transform, or you can use a [`Read`]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/io/Read.html)[`Read`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py) transform to work with small local or remote files.
    tempLocation - temp_location + temp_location Optional. - Required. + Required. Path for temporary files. Must be a valid Google Cloud Storage URL that begins with gs://. If set, tempLocation is used as the default value for gcpTempLocation.
    stagingLocation - staging_location + staging_location Optional. Cloud Storage bucket path for staging your binary and any temporary files. Must be a valid Cloud Storage URL that begins with gs://. If not set, defaults to a staging directory within gcpTempLocation. - If not set, defaults to a staging directory within temp_location. + If not set, defaults to a staging directory within temp_location.
    save_main_session Save the main session state so that pickled functions and classes defined in __main__ (e.g. interactive session) can be unpickled. Some workflows do not need the session state if, for instance, all of their functions/classes are defined in proper modules (not __main__) and the modules are importable in the worker. false
    sdk_location Override the default location from where the Beam SDK is downloaded. This value can be an URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string default, a standard SDK location is used. If empty, no SDK is copied. default