-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-508] Fill in the documentation/runners/dataflow portion of the website #77
Conversation
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id c84b49e with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful! Just a few minor comments ;-)
|
||
The Google Cloud Dataflow runner uses the [Cloud Dataflow managed service](https://cloud.google.com/dataflow/service/dataflow-service-desc). When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform. | ||
|
||
The Cloud Dataflow runner and service is suitable for large scale continuous jobs, and provides: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is -> are?
large scale, continuous (add comma)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The Cloud Dataflow runner and service is suitable for large scale continuous jobs, and provides: | ||
|
||
* a fully managed service | ||
* [autoscaling](https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling) of the number of VMs throughout the lifetime of the job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VMs -> workers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
2. Enable billing for your project. | ||
|
||
3. Enable APIs: Cloud Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Cloud Pub/Sub, and Cloud Datastore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigQuery, PubSub, and Datastore are optional, I think. Perhaps you can say: May may need to enable other APIs, such as (these three), if you use them in your pipeline code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. also changed the logging API name to match the name that shows up in the console.
|
||
3. Enable APIs: Cloud Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Cloud Pub/Sub, and Cloud Datastore. | ||
|
||
4. Install the Cloud SDK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google Cloud SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
4. Install the Cloud SDK. | ||
|
||
5. Create a Cloud Storage bucket. | ||
* In the Cloud Platform Console, go to the Cloud Storage browser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google Cloud Platform Console -- I think we should use the full name on the first reference to the term, but not afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
## Pipeline options for the Cloud Dataflow runner | ||
|
||
When executing your pipeline from the command-line, set these pipeline options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed even if not executing from the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
<tr> | ||
<td><code>project</code></td> | ||
<td>The project ID for your Google Cloud Project.</td> | ||
<td>If not set, defaults to the default project of the current user.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no such thing as default project of the current user. I think there's a default project in the current environment set via gcloud
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
</tr> | ||
<tr> | ||
<td><code>streaming</code></td> | ||
<td>Whether streaming mode is enabled or disabled; <code>true</code> if enabled.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set to true if running pipelines with unbounded PCollections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I followed the style of the programming guide, though it looks a bit strange because the code block font is so different from the non-code font. Other possible options would be "PCollection objects" or just "collections", if the visual ickyness is too much.
</tr> | ||
<tr> | ||
<td><code>stagingLocation</code></td> | ||
<td>Cloud Storage bucket path for staging your binary and any temporary files. Must be a valid Cloud Storage URL that begins with <code>gs://</code>.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
### Blocking Execution | ||
|
||
To connect to your job and block until it is completed, call `waitToFinish` on the `PipelineResult` returned from `pipeline.run()`. The Cloud Dataflow runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that typing **Ctrl+C** from the command line does not cancel your job. To cancel the job, you can use the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf) or the [Dataflow Command-line Interface](https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing -> pressing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
<tr> | ||
<td><code>tempLocation</code></td> | ||
<td>Optional. Path for temporary files. If set to a valid Google Cloud Storage URL that begins with <code>gs://</code>, <code>tempLocation</code> is used as the default value for <code>gcpTempLocation</code>.</td> | ||
<td>No default value</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. (dot) at the end of the sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
<dependency> | ||
<groupId>org.apache.beam</groupId> | ||
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId> | ||
<version>0.3.0-incubating</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use jekyll variable here?
(sorry for forgetting that on a previous one. Update too?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added latest version variable. Will also update direct runner with this.
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 664eded with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id a8dec88 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Merging. |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id a19af36 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
R: @davorbonaci @francesperry