Doc updates for Python streaming#410

Merged

asfgit merged 1 commit intoapache:asf-sitefrom

melap:streaming

May 7, 2018

melap commented Apr 3, 2018

In-progress draft for initial review. Still needs snippet URLs

melap assigned aaltay

melap requested a review from aaltay

April 3, 2018 21:40

Author

melap commented Apr 3, 2018

Staged: http://apache-beam-website-pull-requests.storage.googleapis.com/410/index.html

aaltay reviewed

View reviewed changes

src/documentation/programming-guide.md

+              It is important to note that if, for example, you specify
+              <span class="language-java">`.elementCountAtLeast(50)`</span>
+              <span class="language-py">a count of 50</span> and only 32 elements arrive,
+              those 32 elements sit around forever. If the 32 elements are important to you,

Member

aaltay Apr 9, 2018

AfterCount(50) instead of "a count of 50"

src/documentation/programming-guide.md

+              trigger for a `PCollection`, which emits results one minute after the first
+              element in that window has been processed. The `accumulation_mode` parameter
+              sets the window's **accumulation mode**.

Member

aaltay Apr 9, 2018

This is not in the code sample, unlike the Java snippet.

src/documentation/programming-guide.md Outdated

               ```
               ```py
-                # The Beam SDK for Python does not support triggers.
+              pcollection | WindowInto(AfterWatermark(late=AfterProcessingTime(10 * 60)))

Member

aaltay Apr 9, 2018

This example does not match the Java example. I believe the example is about mixing triggers with allowed lateness and that is not supported in python as noted above. So perhaps, we can skip this example for python.

Author

melap May 2, 2018

I think the section is more about AfterWatermark, so I'd like to leave something there for Python, but I set the comment about the 2 day allowed lateness only shown if set to Java.

src/documentation/sdks/python-streaming.md Outdated

+              # Python Streaming Pipelines
+              Apache Beam SDK for Python supports streaming pipeline execution as of version
+              TBD. Currently, two Beam runners support Python streaming execution:

Member

aaltay Apr 10, 2018

2 beam runners are the only two beam python runners anyway. I do not know if this is worth mentioning.

src/documentation/sdks/python-streaming.md Outdated


		# Python Streaming Pipelines

		Apache Beam SDK for Python supports streaming pipeline execution as of version

Member

aaltay Apr 10, 2018

DirectRunner can do this since 2.1.0 and DataflowRunner will start in 2.5.0. Can we try to rephrase as this is experimentally available with some limitations starting from 2.5.0 ?

src/documentation/sdks/python-streaming.md Outdated

+              ## Why use streaming execution?
+              Beam creates an unbounded PCollection if your pipeline reads from a streaming or
+              continously-updating data source (such as Cloud Pub/Sub or Kafka). A runner must

Member

aaltay Apr 10, 2018

Kafka is not supported, I will drop it from examples.

src/documentation/sdks/python-streaming.md Outdated

+              ```
+                ...
+                lines = p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic)

Member

aaltay Apr 10, 2018

ReadStringsFromPubSub is deprecated. (But it is fine to keep it in the doc for now.)

@udim Should we update the example?

Member

udim Apr 10, 2018

Yes, using ReadFromPubSub.

src/documentation/sdks/python-streaming.md

+                output | beam.io.WriteStringsToPubSub(known_args.output_topic)
+              ```
+              ## Running a streaming pipeline

Member

aaltay Apr 10, 2018

Should we explain how to create input/output pubsub topics and how to publish messages to those? (These are not related to Beam.)

src/documentation/sdks/python.md


		## Python Type Safety
		## Python streaming pipelines

Member

aaltay Apr 10, 2018

TBD?

src/get-started/wordcount-example.md Outdated


		### Reading an unbounded data set

		This example uses an unbounded data set as input. THe code reads Pub/Sub

Member

aaltay Apr 10, 2018

THe -> The

mariapython commented Apr 12, 2018

In https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md
the line pip install --upgrade virtualenv
should read pip install --upgrade --user virtualenv
to avoid messages of the kind:
OSError: [Errno 13] Permission denied: '/usr/local/bin/virtualenv'

Member

udim commented Apr 12, 2018

@mariapython installing packages using --user puts them (by default) under ~/.local/.
To install virtualenv systemwide, we could tell users to sudo pip install ....

mariapython commented Apr 12, 2018

@aaltay: Where do the version requirements come from? For example, why do we need pip >= 7.0.0?

mariapython commented Apr 12, 2018 •

edited

Loading

File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section "Install pip":

Reword similarly to the "Install Python virtual environment" section: "If you do not have... install it by running: pip install --upgrade --user pip"

Member

aaltay commented Apr 13, 2018

@mariapython We have version requirements for pip and cython. Both are for fairly older versions. Typically there is an incompatibility for versions older than the ones we require. I do not think we kept track of what was the issues with even older versions.

mariapython commented Apr 13, 2018

@udim: Is that a problem? What are the consequences of installing at ~/.local/.?

mariapython commented Apr 13, 2018 •

edited

Loading

File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section "Execute a pipeline locally":

It should read "Execute a pipeline"
To run wordcount.py, run ... sounds a bit repetitive, I would say "For example, run wordcount.py with the following commend:"
Add tabs for other runners, like in beam-site/src/get-started/wordcount-example.md.
The command under "Direct" should be modified to match the convention above, that is:
python -m apache_beam.examples.wordcount --input path/to/given/input --output path/to/write/output
The command under "Dataflow" should have a comment with a link on how to authenticate. For the Python tab it should read:
# Make sure you are authenticated by following the instructions on https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python.
Equivalently, the Java tab should have a comment linking both https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven and https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-eclipse.
(Side note: it is interesting that 1) the authentication is listed in every quickstart, instead of pointing just to the "all quickstarts", 2) the opening sentence for every one of the quickstarts is different.)
Add an explanation after the command: “After running the command, a file output-0000-of-0001 will be written to path/to/write/.”
(As a side note, I don't know why we have that notation instead of the more natural <file_name>-x-of-y with x in [1, y])

mariapython commented Apr 13, 2018

File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section Download and install --> Extra Requirements:

Why should "requirements" be capitalized? ("install" is not)
pip install apache-beam[feature1, feature2] doesn't work. It should read pip install apache-beam[feature1,feature2] (no space after comma).

mariapython commented Apr 13, 2018 •

edited

Loading

5/1/2018 UPDATE: A decision needs to be made, so I will not make any more updates by now. You can merge AS IS.
(*** This section is incomplete, I will remove this note when I complete its review.)
File beam-site/src/get-started/wordcount-example.md:

Section "MinimalWordCount example":
- Since there is an "Adapt for: Python/Java" at the beginning of the website, the lines To run this example in Java: and "To run this example in Python:" are unnecessary, as the display should toggle them accordingly.
- In the case of "To run this example in Java:", there should be tabs for every runner, just like in Python. The command currently shown
  $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.MinimalWordCount
  is probably for the DirectRunner.
- The Python command for the DirectRunner is wrong, as it assumes there will be a README.md file where the command is executed, which won't generally be the case.
Section "WordCount example":
- To be completed
Section "DebuggingWordCount example":
- Under "Testing your pipeline via PAssert," python should read: "This feature is currently working for batch, but is under development for streaming."
Section "WindowedWordCount example":
- To be completed
Section "StreamingWordCount example":
- To be completed

mariapython commented Apr 16, 2018 •

edited

Loading

5/1/2018 UPDATE: A decision needs to be made, so I will not make any more updates by now. You can merge AS IS.
(*** This section is incomplete, I will remove this note when I complete its review.)
File src/get-started/mobile-gaming-example.md:

Section "UserScore: Basic Score Processing in Batch":
- To be completed
Section "HourlyTeamScore: Advanced Processing in Batch with Windowing":
- To be completed
Section "LeaderBoard: Streaming Processing with Real-Time Game Data":
- To be completed
Section "GameStats: Abuse Detection and Usage Analysis":
- To be completed

Member

udim commented Apr 17, 2018

@mariapython Installing packages under ~/.local/ seems to make thing more complicated, but not everyone can (or should) upgrade system packages (because dependencies might break). So perhaps the best way is to use --user when installing pip and virtualenv, and install the rest of the packages in a virtualenv to be consistent with our contribution and quickstart guides.

$ pip install --user --upgrade pip virtualenv
...
$ python - pip --version
pip 10.0.0 from /home/user/.local/lib/python2.7/site-packages/pip (python 2.7)
$ $HOME/.local/bin/virtualenv --version
...

mariapython reviewed

View reviewed changes

src/get-started/wordcount-example.md Outdated

    
              messages from a Pub/Sub subscription or topic, and performs a frequency count on

              the words in each message. Similar to WindowedWordCount, this example applies

              fixed-time windowing, wherein each window represents a fixed time interval. The

              fixed window size for this example is 15 minutes. The pipeline outputs the

mariapython Apr 18, 2018

It should read "15 seconds"
(source at https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py#L77

mariapython reviewed

View reviewed changes

src/get-started/wordcount-example.md Outdated

    
              the words in each message. Similar to WindowedWordCount, this example applies

              fixed-time windowing, wherein each window represents a fixed time interval. The

              fixed window size for this example is 15 minutes. The pipeline outputs the

              frequency count of the words seen in each 15 minute window.

mariapython Apr 18, 2018

Likewise, it should read "15 seconds."

Author

melap commented Apr 24, 2018

retest this please

1 similar comment

Author

melap commented Apr 24, 2018

retest this please

Member

aaltay commented Apr 27, 2018

@melap what is the status of this PR? Do you need my help?

Author

melap commented May 2, 2018

Made a bunch of updates, PTAL. After it looks good, I'll squash.

aaltay approved these changes

View reviewed changes


          Updates for Python streaming

6bff9dc

Author

melap commented May 7, 2018

asfgit pushed a commit that referenced this pull request


          This closes #410

4605bf7

asfgit merged commit 6bff9dc into apache:asf-site

robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request


          This closes apache/beam-site#410

ff92f73

robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request


          This closes apache/beam-site#410

1785b54

melap pushed a commit to apache/beam that referenced this pull request


          This closes apache/beam-site#410

3d7433d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet