Doc updates for Python streaming#410
Conversation
| It is important to note that if, for example, you specify | ||
| <span class="language-java">`.elementCountAtLeast(50)`</span> | ||
| <span class="language-py">a count of 50</span> and only 32 elements arrive, | ||
| those 32 elements sit around forever. If the 32 elements are important to you, |
There was a problem hiding this comment.
AfterCount(50) instead of "a count of 50"
| trigger for a `PCollection`, which emits results one minute after the first | ||
| element in that window has been processed. The `accumulation_mode` parameter | ||
| sets the window's **accumulation mode**. | ||
|
|
There was a problem hiding this comment.
This is not in the code sample, unlike the Java snippet.
| ``` | ||
| ```py | ||
| # The Beam SDK for Python does not support triggers. | ||
| pcollection | WindowInto(AfterWatermark(late=AfterProcessingTime(10 * 60))) |
There was a problem hiding this comment.
This example does not match the Java example. I believe the example is about mixing triggers with allowed lateness and that is not supported in python as noted above. So perhaps, we can skip this example for python.
There was a problem hiding this comment.
I think the section is more about AfterWatermark, so I'd like to leave something there for Python, but I set the comment about the 2 day allowed lateness only shown if set to Java.
| # Python Streaming Pipelines | ||
|
|
||
| Apache Beam SDK for Python supports streaming pipeline execution as of version | ||
| TBD. Currently, two Beam runners support Python streaming execution: |
There was a problem hiding this comment.
2 beam runners are the only two beam python runners anyway. I do not know if this is worth mentioning.
|
|
||
| # Python Streaming Pipelines | ||
|
|
||
| Apache Beam SDK for Python supports streaming pipeline execution as of version |
There was a problem hiding this comment.
DirectRunner can do this since 2.1.0 and DataflowRunner will start in 2.5.0. Can we try to rephrase as this is experimentally available with some limitations starting from 2.5.0 ?
| ## Why use streaming execution? | ||
|
|
||
| Beam creates an unbounded PCollection if your pipeline reads from a streaming or | ||
| continously-updating data source (such as Cloud Pub/Sub or Kafka). A runner must |
There was a problem hiding this comment.
Kafka is not supported, I will drop it from examples.
|
|
||
| ``` | ||
| ... | ||
| lines = p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic) |
There was a problem hiding this comment.
ReadStringsFromPubSub is deprecated. (But it is fine to keep it in the doc for now.)
@udim Should we update the example?
| output | beam.io.WriteStringsToPubSub(known_args.output_topic) | ||
| ``` | ||
|
|
||
| ## Running a streaming pipeline |
There was a problem hiding this comment.
Should we explain how to create input/output pubsub topics and how to publish messages to those? (These are not related to Beam.)
|
|
||
| ## Python Type Safety | ||
| ## Python streaming pipelines | ||
|
|
src/get-started/wordcount-example.md
Outdated
|
|
||
| ### Reading an unbounded data set | ||
|
|
||
| This example uses an unbounded data set as input. THe code reads Pub/Sub |
|
In https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md |
|
@mariapython installing packages using --user puts them (by default) under |
|
@aaltay: Where do the version requirements come from? For example, why do we need pip >= 7.0.0? |
|
File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section "Install pip":
|
|
@mariapython We have version requirements for pip and cython. Both are for fairly older versions. Typically there is an incompatibility for versions older than the ones we require. I do not think we kept track of what was the issues with even older versions. |
|
@udim: Is that a problem? What are the consequences of installing at |
|
File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section "Execute a pipeline locally":
|
|
File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section Download and install --> Extra Requirements:
|
|
5/1/2018 UPDATE: A decision needs to be made, so I will not make any more updates by now. You can merge AS IS.
|
|
5/1/2018 UPDATE: A decision needs to be made, so I will not make any more updates by now. You can merge AS IS.
|
|
@mariapython Installing packages under |
src/get-started/wordcount-example.md
Outdated
| messages from a Pub/Sub subscription or topic, and performs a frequency count on | ||
| the words in each message. Similar to WindowedWordCount, this example applies | ||
| fixed-time windowing, wherein each window represents a fixed time interval. The | ||
| fixed window size for this example is 15 minutes. The pipeline outputs the |
There was a problem hiding this comment.
It should read "15 seconds"
(source at https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py#L77
src/get-started/wordcount-example.md
Outdated
| the words in each message. Similar to WindowedWordCount, this example applies | ||
| fixed-time windowing, wherein each window represents a fixed time interval. The | ||
| fixed window size for this example is 15 minutes. The pipeline outputs the | ||
| frequency count of the words seen in each 15 minute window. |
There was a problem hiding this comment.
Likewise, it should read "15 seconds."
|
retest this please |
1 similar comment
|
retest this please |
|
@melap what is the status of this PR? Do you need my help? |
|
Made a bunch of updates, PTAL. After it looks good, I'll squash. |
|
@asfgit merge |
In-progress draft for initial review. Still needs snippet URLs