[BEAM-2360] Add Beam Runner Guide#251
Conversation
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id b0095ec with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 4174838 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 3a9b3d9 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Solved the code toggle issue and revised a little. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id a516294 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id d12dcb6 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
40f48d1 to
df61b70
Compare
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 40f48d1 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id df61b70 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 6692cd1 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
R: @melap if/when you have time, I wonder if you might make a pass over this. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id cc3143b with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
melap
left a comment
There was a problem hiding this comment.
Took a high level look and left a few comments... I'll do a full edit pass after merge and open a PR with the changes.
src/contribute/runner-guide.md
Outdated
| permalink: /contribute/runner-guide/ | ||
| --- | ||
|
|
||
| # Runner Guide |
There was a problem hiding this comment.
wondering if calling it Runner Implementation Guide or Runner Authoring Guide (or something similar) would be good just for this page title? (I know it wouldn't fit well in the top bar menus) it took me a minute to determine what this was (runner authoring as opposed to how to use existing runners)
There was a problem hiding this comment.
Done.
You have correctly surmised that I was trying to make it fit well. But actually when rendered it seems to fit fine with the longer, clearer, title.
src/contribute/runner-guide.md
Outdated
|
|
||
| # Runner Guide | ||
|
|
||
| This document is aimed at someone who has a data processing system and wants |
There was a problem hiding this comment.
on similar note, what about initial sentence at the start here...
This guide walks through how to implement a new runner. It is aimed at someone who has a data processing system and wants to use it to execute a Beam pipeline. The guide starts ...
There was a problem hiding this comment.
Done.
Took your words exactly.
src/contribute/runner-guide.md
Outdated
| ### Pipeline | ||
|
|
||
| A pipeline in Beam is a graph of PTransforms operating on PCollections. In the | ||
| picture, the boxes are PTransforms and the arrows represent the contents of the |
There was a problem hiding this comment.
I was going to paste in our very intro picture, but forgot. I don't think it adds much. Elsewhere pictures might be useful, but I don't imagine that a runner author needs to be reminded that it is a graph of transforms.
src/contribute/runner-guide.md
Outdated
|
|
||
| Every element in a PCollection has a timestamp associated with it. | ||
|
|
||
| Sources of data read elements from the outside world with timestamps provided. |
There was a problem hiding this comment.
I struggled with this first sentence. is this still accurate?
I/O transforms read data elements with timestamps from the outside world.
There was a problem hiding this comment.
Your phrasing is appropriate for pipeline authors, I think. For a runner author, what they need to know is that the Read.from(Source) primitive is responsible for figuring out the timestamps, and the runner's job is to propagate it. This section exists just to cement the idea that there is no such thing as data without a timestamp.
I've rephrase this now - what do you think?
src/contribute/runner-guide.md
Outdated
| #### The DoFn Lifecycle | ||
|
|
||
| While each language's SDK is free to make different decisions, the Python and | ||
| Java SDK's share an API with the following stages of a DoFn's lifecycle. |
| graph of `PTransforms`. | ||
|
|
||
| The entry point for this in Java is | ||
| [`Pipeline.traverseTopologically`](https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/Pipeline.html#traverseTopologically-org.apache.beam.sdk.Pipeline.PipelineVisitor-) |
There was a problem hiding this comment.
general comment for all the javadoc/pydoc links -- should this use the latest version variable instead of 2.0.0?
There was a problem hiding this comment.
That seems reasonable...
I just tried it, but it seems that we redirect https://beam.apache.org/documentation/sdks/javadoc/ to https://beam.apache.org/documentation/sdks/javadoc/2.0.0 at the root but it doesn't work for deeper links.
I also tried latest which is pretty common, but that is also a 404.
So if we set something like that up, I would definitely use it.
src/contribute/runner-guide.md
Outdated
|
|
||
| ### Traversing a pipeline | ||
|
|
||
| Something you will certain do is to traverse a pipeline, probably to |
src/contribute/runner-guide.md
Outdated
| ### The Runner API | ||
|
|
||
| The Runner API is an SDK-independent schema for a pipeline along with RPC | ||
| interfaces (TBD) for launching a pipeline and checking the status of a job. |
There was a problem hiding this comment.
The "RPC" bit has not been implemented. I rephrased a little bit. I could also just drop that part or put it elsewhere.
src/contribute/runner-guide.md
Outdated
| API](https://github.com/apache/beam/blob/master/sdks/common/runner-api/src/main/proto/beam_runner_api.proto) | ||
| refers to a specific manifestation of the concepts in the Beam model, as a | ||
| protocol buffers schema. Even though you should not manipulate these messages | ||
| direclty, it can be helpful to know the canonical data that makes up a |
kennknowles
left a comment
There was a problem hiding this comment.
Thanks for the review, @melap!
src/contribute/runner-guide.md
Outdated
| permalink: /contribute/runner-guide/ | ||
| --- | ||
|
|
||
| # Runner Guide |
There was a problem hiding this comment.
Done.
You have correctly surmised that I was trying to make it fit well. But actually when rendered it seems to fit fine with the longer, clearer, title.
src/contribute/runner-guide.md
Outdated
|
|
||
| # Runner Guide | ||
|
|
||
| This document is aimed at someone who has a data processing system and wants |
There was a problem hiding this comment.
Done.
Took your words exactly.
src/contribute/runner-guide.md
Outdated
| ### Pipeline | ||
|
|
||
| A pipeline in Beam is a graph of PTransforms operating on PCollections. In the | ||
| picture, the boxes are PTransforms and the arrows represent the contents of the |
There was a problem hiding this comment.
I was going to paste in our very intro picture, but forgot. I don't think it adds much. Elsewhere pictures might be useful, but I don't imagine that a runner author needs to be reminded that it is a graph of transforms.
src/contribute/runner-guide.md
Outdated
|
|
||
| Every element in a PCollection has a timestamp associated with it. | ||
|
|
||
| Sources of data read elements from the outside world with timestamps provided. |
There was a problem hiding this comment.
Your phrasing is appropriate for pipeline authors, I think. For a runner author, what they need to know is that the Read.from(Source) primitive is responsible for figuring out the timestamps, and the runner's job is to propagate it. This section exists just to cement the idea that there is no such thing as data without a timestamp.
I've rephrase this now - what do you think?
src/contribute/runner-guide.md
Outdated
| #### The DoFn Lifecycle | ||
|
|
||
| While each language's SDK is free to make different decisions, the Python and | ||
| Java SDK's share an API with the following stages of a DoFn's lifecycle. |
src/contribute/runner-guide.md
Outdated
|
|
||
| ### Traversing a pipeline | ||
|
|
||
| Something you will certain do is to traverse a pipeline, probably to |
| graph of `PTransforms`. | ||
|
|
||
| The entry point for this in Java is | ||
| [`Pipeline.traverseTopologically`](https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/Pipeline.html#traverseTopologically-org.apache.beam.sdk.Pipeline.PipelineVisitor-) |
There was a problem hiding this comment.
That seems reasonable...
I just tried it, but it seems that we redirect https://beam.apache.org/documentation/sdks/javadoc/ to https://beam.apache.org/documentation/sdks/javadoc/2.0.0 at the root but it doesn't work for deeper links.
I also tried latest which is pretty common, but that is also a 404.
So if we set something like that up, I would definitely use it.
src/contribute/runner-guide.md
Outdated
| ### The Runner API | ||
|
|
||
| The Runner API is an SDK-independent schema for a pipeline along with RPC | ||
| interfaces (TBD) for launching a pipeline and checking the status of a job. |
There was a problem hiding this comment.
The "RPC" bit has not been implemented. I rephrased a little bit. I could also just drop that part or put it elsewhere.
src/contribute/runner-guide.md
Outdated
| API](https://github.com/apache/beam/blob/master/sdks/common/runner-api/src/main/proto/beam_runner_api.proto) | ||
| refers to a specific manifestation of the concepts in the Beam model, as a | ||
| protocol buffers schema. Even though you should not manipulate these messages | ||
| direclty, it can be helpful to know the canonical data that makes up a |
|
R: @davorbonaci I think now that we have had a pass over this we should get this up and improve it as things change and through feedback from runner authors. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 065e5a6 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
LGTM. Please merge! This is exciting! |
|
retest this please |
Add Beam Runner Guide Add support for "no-toggle" code snippets
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 15175f7 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 15175f7 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Add Beam Runner Guide Add support for "no-toggle" code snippets
Add Beam Runner Guide Add support for "no-toggle" code snippets
Add Beam Runner Guide Add support for "no-toggle" code snippets
I did a quick port of the doc at https://s.apache.org/beam-runner-guide.
I wanted to open this early, because there are technical issues to address:
Beyond the draft doc, I've added a ton of links to API docs and code. There may still be comments on the GDoc and I'll incorporate them here.