Skip to content

Doc updates for Python streaming#410

Merged
asfgit merged 1 commit intoapache:asf-sitefrom
melap:streaming
May 7, 2018
Merged

Doc updates for Python streaming#410
asfgit merged 1 commit intoapache:asf-sitefrom
melap:streaming

Conversation

@melap
Copy link

@melap melap commented Apr 3, 2018

In-progress draft for initial review. Still needs snippet URLs

@melap melap requested a review from aaltay April 3, 2018 21:40
@melap
Copy link
Author

melap commented Apr 3, 2018

It is important to note that if, for example, you specify
<span class="language-java">`.elementCountAtLeast(50)`</span>
<span class="language-py">a count of 50</span> and only 32 elements arrive,
those 32 elements sit around forever. If the 32 elements are important to you,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AfterCount(50) instead of "a count of 50"

trigger for a `PCollection`, which emits results one minute after the first
element in that window has been processed. The `accumulation_mode` parameter
sets the window's **accumulation mode**.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in the code sample, unlike the Java snippet.

```
```py
# The Beam SDK for Python does not support triggers.
pcollection | WindowInto(AfterWatermark(late=AfterProcessingTime(10 * 60)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example does not match the Java example. I believe the example is about mixing triggers with allowed lateness and that is not supported in python as noted above. So perhaps, we can skip this example for python.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the section is more about AfterWatermark, so I'd like to leave something there for Python, but I set the comment about the 2 day allowed lateness only shown if set to Java.

# Python Streaming Pipelines

Apache Beam SDK for Python supports streaming pipeline execution as of version
TBD. Currently, two Beam runners support Python streaming execution:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 beam runners are the only two beam python runners anyway. I do not know if this is worth mentioning.


# Python Streaming Pipelines

Apache Beam SDK for Python supports streaming pipeline execution as of version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DirectRunner can do this since 2.1.0 and DataflowRunner will start in 2.5.0. Can we try to rephrase as this is experimentally available with some limitations starting from 2.5.0 ?

## Why use streaming execution?

Beam creates an unbounded PCollection if your pipeline reads from a streaming or
continously-updating data source (such as Cloud Pub/Sub or Kafka). A runner must
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka is not supported, I will drop it from examples.


```
...
lines = p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadStringsFromPubSub is deprecated. (But it is fine to keep it in the doc for now.)

@udim Should we update the example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using ReadFromPubSub.

output | beam.io.WriteStringsToPubSub(known_args.output_topic)
```

## Running a streaming pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explain how to create input/output pubsub topics and how to publish messages to those? (These are not related to Beam.)


## Python Type Safety
## Python streaming pipelines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBD?


### Reading an unbounded data set

This example uses an unbounded data set as input. THe code reads Pub/Sub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe -> The

@mariapython
Copy link

In https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md
the line pip install --upgrade virtualenv
should read pip install --upgrade --user virtualenv
to avoid messages of the kind:
OSError: [Errno 13] Permission denied: '/usr/local/bin/virtualenv'

@udim
Copy link
Member

udim commented Apr 12, 2018

@mariapython installing packages using --user puts them (by default) under ~/.local/.
To install virtualenv systemwide, we could tell users to sudo pip install ....

@mariapython
Copy link

@aaltay: Where do the version requirements come from? For example, why do we need pip >= 7.0.0?

@mariapython
Copy link

mariapython commented Apr 12, 2018

File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section "Install pip":

  • Reword similarly to the "Install Python virtual environment" section: "If you do not have... install it by running: pip install --upgrade --user pip"

@aaltay
Copy link
Member

aaltay commented Apr 13, 2018

@mariapython We have version requirements for pip and cython. Both are for fairly older versions. Typically there is an incompatibility for versions older than the ones we require. I do not think we kept track of what was the issues with even older versions.

@mariapython
Copy link

@udim: Is that a problem? What are the consequences of installing at ~/.local/.?

@mariapython
Copy link

mariapython commented Apr 13, 2018

File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section "Execute a pipeline locally":

  • It should read "Execute a pipeline"
  • To run wordcount.py, run ... sounds a bit repetitive, I would say "For example, run wordcount.py with the following commend:"
  • Add tabs for other runners, like in beam-site/src/get-started/wordcount-example.md.
  • The command under "Direct" should be modified to match the convention above, that is:
    python -m apache_beam.examples.wordcount --input path/to/given/input --output path/to/write/output
  • The command under "Dataflow" should have a comment with a link on how to authenticate. For the Python tab it should read:
    # Make sure you are authenticated by following the instructions on https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python.
    Equivalently, the Java tab should have a comment linking both https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven and https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-eclipse.
    (Side note: it is interesting that 1) the authentication is listed in every quickstart, instead of pointing just to the "all quickstarts", 2) the opening sentence for every one of the quickstarts is different.)
  • Add an explanation after the command: “After running the command, a file output-0000-of-0001 will be written to path/to/write/.”
    (As a side note, I don't know why we have that notation instead of the more natural <file_name>-x-of-y with x in [1, y])

@mariapython
Copy link

File https://github.com/apache/beam-site/blob/asf-site/src/get-started/quickstart-py.md, section Download and install --> Extra Requirements:

  • Why should "requirements" be capitalized? ("install" is not)
  • pip install apache-beam[feature1, feature2] doesn't work. It should read pip install apache-beam[feature1,feature2] (no space after comma).

@mariapython
Copy link

mariapython commented Apr 13, 2018

5/1/2018 UPDATE: A decision needs to be made, so I will not make any more updates by now. You can merge AS IS.
(*** This section is incomplete, I will remove this note when I complete its review.)
File beam-site/src/get-started/wordcount-example.md:

  • Section "MinimalWordCount example":
    • Since there is an "Adapt for: Python/Java" at the beginning of the website, the lines To run this example in Java: and "To run this example in Python:" are unnecessary, as the display should toggle them accordingly.
    • In the case of "To run this example in Java:", there should be tabs for every runner, just like in Python. The command currently shown
      $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.MinimalWordCount
      is probably for the DirectRunner.
    • The Python command for the DirectRunner is wrong, as it assumes there will be a README.md file where the command is executed, which won't generally be the case.
  • Section "WordCount example":
    • To be completed
  • Section "DebuggingWordCount example":
    • Under "Testing your pipeline via PAssert," python should read: "This feature is currently working for batch, but is under development for streaming."
  • Section "WindowedWordCount example":
    • To be completed
  • Section "StreamingWordCount example":
    • To be completed

@mariapython
Copy link

mariapython commented Apr 16, 2018

5/1/2018 UPDATE: A decision needs to be made, so I will not make any more updates by now. You can merge AS IS.
(*** This section is incomplete, I will remove this note when I complete its review.)
File src/get-started/mobile-gaming-example.md:

  • Section "UserScore: Basic Score Processing in Batch":
    • To be completed
  • Section "HourlyTeamScore: Advanced Processing in Batch with Windowing":
    • To be completed
  • Section "LeaderBoard: Streaming Processing with Real-Time Game Data":
    • To be completed
  • Section "GameStats: Abuse Detection and Usage Analysis":
    • To be completed

@udim
Copy link
Member

udim commented Apr 17, 2018

@mariapython Installing packages under ~/.local/ seems to make thing more complicated, but not everyone can (or should) upgrade system packages (because dependencies might break). So perhaps the best way is to use --user when installing pip and virtualenv, and install the rest of the packages in a virtualenv to be consistent with our contribution and quickstart guides.

$ pip install --user --upgrade pip virtualenv
...
$ python - pip --version
pip 10.0.0 from /home/user/.local/lib/python2.7/site-packages/pip (python 2.7)
$ $HOME/.local/bin/virtualenv --version
...

messages from a Pub/Sub subscription or topic, and performs a frequency count on
the words in each message. Similar to WindowedWordCount, this example applies
fixed-time windowing, wherein each window represents a fixed time interval. The
fixed window size for this example is 15 minutes. The pipeline outputs the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should read "15 seconds"
(source at https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py#L77

the words in each message. Similar to WindowedWordCount, this example applies
fixed-time windowing, wherein each window represents a fixed time interval. The
fixed window size for this example is 15 minutes. The pipeline outputs the
frequency count of the words seen in each 15 minute window.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, it should read "15 seconds."

@melap
Copy link
Author

melap commented Apr 24, 2018

retest this please

1 similar comment
@melap
Copy link
Author

melap commented Apr 24, 2018

retest this please

@aaltay
Copy link
Member

aaltay commented Apr 27, 2018

@melap what is the status of this PR? Do you need my help?

@melap
Copy link
Author

melap commented May 2, 2018

Made a bunch of updates, PTAL. After it looks good, I'll squash.

@melap
Copy link
Author

melap commented May 7, 2018

@asfgit merge

asfgit pushed a commit that referenced this pull request May 7, 2018
@asfgit asfgit merged commit 6bff9dc into apache:asf-site May 7, 2018
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Jun 5, 2018
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Jun 5, 2018
melap pushed a commit to apache/beam that referenced this pull request Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants