Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-825] Fill in the documentation/runners/apex portion of the website #98

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
43 changes: 42 additions & 1 deletion src/documentation/runners/apex.md
Expand Up @@ -5,5 +5,46 @@ permalink: /documentation/runners/apex/
---
# Using the Apache Apex Runner

This page is under construction ([BEAM-825](https://issues.apache.org/jira/browse/BEAM-825)).
The Apex Runner executes Apache Beam pipelines using [Apache Apex](http://apex.apache.org/) as an underlying engine. The runner has broad support for the [Beam model and supports streaming and batch pipelines]({{ site.baseurl }}/documentation/runners/capability-matrix/).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner does not have good support for batch pipelines currently, please remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: a broad support


[Apache Apex](http://apex.apache.org/) is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing. With its stateful stream processing architecture, Apex can support all of the concepts in the Beam model (event time, triggers, watermarks etc.).

## Apex-Runner prerequisites and setup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove dash between Apex and Runner.


You may set up your own Hadoop cluster, and [setup Apache Apex on top of it](http://apex.apache.org/docs/apex/apex_development_setup/) or choose any vendor-specific distribution that includes Hadoop and Apex pre-installed. Please see the [distribution information on the Apache Apex website](http://apex.apache.org/downloads.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space before and.


## Running wordcount using Apex-Runner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wordcount --> the WordCount example
Apex-Runner --> Apex Runner


Download some data for processing and put it on HDFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upload input data to a location in HDFS

Also, end with a period ('.').

```
curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt
hdfs dfs -mkdir -p /tmp/input/
hdfs dfs -put /tmp/kinglear.txt /tmp/input/
```

The output directory should not exist on HDFS. Delete it if it exists.
```
hdfs dfs -rm -r -f /tmp/output/
```

Run the wordcount example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordCount

```
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/ --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do the properties come from?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One version, with limited documentation: https://github.com/apache/beam/blob/master/runners/apex/src/test/resources/beam-runners-apex.properties

Would be better with a link to the list of properties in a comment

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possible attributes are here: https://github.com/apache/apex-core/blob/master/api/src/main/java/com/datatorrent/api/Context.java

I think we should merge this PR and then add examples for the properties in a separate PR. The instructions are already useful and we validated that the YARN launcher works today.

```

This will launch an Apex application.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this comment into above, e.g.:

Run the WordCount example as an Apex application.


## Checking output

The sample program which is processing small amount of data would finish quickly. You can check contents on /tmp/output/ on HDFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence should be obvious. I'd just remove it.

Second sentence: end with a period ('.').

Also, no need for the /tmp/output in the prose -- it is obvious from the command below. You can just say: "Check the output of the pipeline from an HDFS location."

```
hdfs dfs -ls /tmp/output/
```

## Montoring progress of your job

Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: alternatively
Typo: following
Typo: options


* YARN : Using YARN web UI generally running on 8088 on the node running resource manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End both sentences with a period.

* Apex cli: [Using apex cli to get running application information](http://apex.apache.org/docs/apex/apex_cli/#apex-cli-commands)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apex cli -> Apex command-line interface

apex -> Apex
cli -> CLI