New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-825] Fill in the documentation/runners/apex portion of the website #98
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,5 +5,46 @@ permalink: /documentation/runners/apex/ | |
--- | ||
# Using the Apache Apex Runner | ||
|
||
This page is under construction ([BEAM-825](https://issues.apache.org/jira/browse/BEAM-825)). | ||
The Apex Runner executes Apache Beam pipelines using [Apache Apex](http://apex.apache.org/) as an underlying engine. The runner has broad support for the [Beam model and supports streaming and batch pipelines]({{ site.baseurl }}/documentation/runners/capability-matrix/). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Grammar: a broad support |
||
|
||
[Apache Apex](http://apex.apache.org/) is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing. With its stateful stream processing architecture, Apex can support all of the concepts in the Beam model (event time, triggers, watermarks etc.). | ||
|
||
## Apex-Runner prerequisites and setup | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please remove dash between Apex and Runner. |
||
|
||
You may set up your own Hadoop cluster, and [setup Apache Apex on top of it](http://apex.apache.org/docs/apex/apex_development_setup/) or choose any vendor-specific distribution that includes Hadoop and Apex pre-installed. Please see the [distribution information on the Apache Apex website](http://apex.apache.org/downloads.html). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double space before and. |
||
|
||
## Running wordcount using Apex-Runner | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wordcount --> the WordCount example |
||
|
||
Download some data for processing and put it on HDFS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Upload input data to a location in HDFS Also, end with a period ('.'). |
||
``` | ||
curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt | ||
hdfs dfs -mkdir -p /tmp/input/ | ||
hdfs dfs -put /tmp/kinglear.txt /tmp/input/ | ||
``` | ||
|
||
The output directory should not exist on HDFS. Delete it if it exists. | ||
``` | ||
hdfs dfs -rm -r -f /tmp/output/ | ||
``` | ||
|
||
Run the wordcount example | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. WordCount |
||
``` | ||
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/ --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where do the properties come from? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One version, with limited documentation: https://github.com/apache/beam/blob/master/runners/apex/src/test/resources/beam-runners-apex.properties Would be better with a link to the list of properties in a comment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The possible attributes are here: https://github.com/apache/apex-core/blob/master/api/src/main/java/com/datatorrent/api/Context.java I think we should merge this PR and then add examples for the properties in a separate PR. The instructions are already useful and we validated that the YARN launcher works today. |
||
``` | ||
|
||
This will launch an Apex application. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. move this comment into above, e.g.: Run the WordCount example as an Apex application. |
||
|
||
## Checking output | ||
|
||
The sample program which is processing small amount of data would finish quickly. You can check contents on /tmp/output/ on HDFS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The first sentence should be obvious. I'd just remove it. Second sentence: end with a period ('.'). Also, no need for the /tmp/output in the prose -- it is obvious from the command below. You can just say: "Check the output of the pipeline from an HDFS location." |
||
``` | ||
hdfs dfs -ls /tmp/output/ | ||
``` | ||
|
||
## Montoring progress of your job | ||
|
||
Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typo: alternatively |
||
|
||
* YARN : Using YARN web UI generally running on 8088 on the node running resource manager | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. End both sentences with a period. |
||
* Apex cli: [Using apex cli to get running application information](http://apex.apache.org/docs/apex/apex_cli/#apex-cli-commands) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apex cli -> Apex command-line interface apex -> Apex |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runner does not have good support for batch pipelines currently, please remove this.