New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-825] Fill in the documentation/runners/apex portion of the website #98
Conversation
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id d8ecc4e with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id e47749d with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Failure on external links but all the links are valid and working fine. Not sure of the root cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sandeepdeshmukh, this is a great start!
Left a few minor comments, but the high level point of being somewhat vendor-specific should be discussed. What do you think?
CC: @francesperry
CC: @jbonofre
src/documentation/runners/apex.md
Outdated
@@ -2,8 +2,113 @@ | |||
layout: default | |||
title: "Apache Apex Runner" | |||
permalink: /documentation/runners/apex/ | |||
redirect_from: /learn/runners/apex/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, I'd remove.
src/documentation/runners/apex.md
Outdated
--- | ||
# Using the Apache Apex Runner | ||
|
||
This page is under construction ([BEAM-825](https://issues.apache.org/jira/browse/BEAM-825)). | ||
Apex‐Runner is a Runner for Apache Beam which executes Beam pipelines with Apache Apex as underlying engine. The runner has broad support for the Beam model and supports streaming and batch pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Apex Runner executes Apache Beam pipelines using [Apache Apex](link)
as an underlying engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/documentation/runners/apex.md
Outdated
--- | ||
# Using the Apache Apex Runner | ||
|
||
This page is under construction ([BEAM-825](https://issues.apache.org/jira/browse/BEAM-825)). | ||
Apex‐Runner is a Runner for Apache Beam which executes Beam pipelines with Apache Apex as underlying engine. The runner has broad support for the Beam model and supports streaming and batch pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second sentence could link to the capability matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/documentation/runners/apex.md
Outdated
|
||
[Apache Apex](http://apex.apache.org/) is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex is Java based and also provides its own API for application development (native compositional and declarative Java API, SQL) with a comprehensive [operator library](https://github.com/apache/apex-malhar). Apex has a unified streaming architecture and can be used for real-time and batch processing. With its stateful stream processing architecture Apex can support all of the concepts in the Beam model (event time, triggers, watermarks etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove "Apex is Java based and also provides its own API for application development (native compositional and declarative Java API, SQL) with a comprehensive operator library." -- not applicable in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/documentation/runners/apex.md
Outdated
|
||
|
||
|
||
## Apex-Runner prerequisites and setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to make this vendor neutral?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the UI portion. It relies on the DataTorrent community edition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a hadoop cluster with YARN setup. So, trying make it easy for people to use it. Hence thought of using ready made Sandbox.
To monitor details of the application via UI, user can use community edition.
Do share your thoughts on how to proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged. Let me think about it and try to figure out whether there are any policies or precedents here.
Also, CC: @francesperry and @jbonofre.
e47749d
to
ea2ba63
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id ea2ba63 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
@davorbonaci the YARN cluster is the users choice. Apex runs on all distros. The DataTorrent sandbox is just one option. YARN via dataproc could be another. The UI can be installed on either. @kennknowles may have further thoughts on this. |
After some thinking, I think Beam shouldn't be giving a preference to any Apex distributions or services. A Dataproc runner could do it, a DataTorrent runner could do it as well, but a generic Apache Apex runner should not. There's another level of indirection here and, when referring to the Apex runner, we should be Apex vendor-independent. Similarly, Flink runner should be referring to Apache Flink only. |
src/documentation/runners/apex.md
Outdated
|
||
## Apex-Runner prerequisites and setup | ||
|
||
To use Apache Apex Runner for Beam, you need to have a Hadoop cluster with YARN setup. You can setup your own Hadoop cluster or simply download DataTorrent Sandbox which comes with Hadoop and Apex pre-installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may set up your own Hadoop cluster, or choose any vendor-specific distribution that includes Hadoop and Apex pre-installed. Please see the [distribution information on the Apache Apex website](some link).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This new wording gives users what they need to make an informed decision. Maybe a bit less telegraphic. I think it is OK to give generic encouragement to first try it with a distro and only later go a DIY route.
You can set up your own Hadoop+YARN cluster or use an existing cluster if you have one available. Otherwise, to get started quickly and try out Beam it may be easiest to use a vendor distribution that includes Hadoop and Apex pre-installed. Please see the [distribution information on the Apache Apex website](some link) to learn of options.
src/documentation/runners/apex.md
Outdated
hadoop/sbin/stop-yarn.sh && hadoop/sbin/start-yarn.sh | ||
``` | ||
|
||
#### 2. Download code and compile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use releases, not running from head
src/documentation/runners/apex.md
Outdated
|
||
## Montoring progress of your job | ||
|
||
If you are using DataTorrent RTS Sandbox then you can monitor progress of the application on http://localhost:9090/ using your browser. The console provides you updates on the application progress, general statistics like number of operators, number of containers - planned and allcated, allocated memory to the application etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Please see the relevant information on the Apache Apex website. (add specific links)
@sandeepdeshmukh any update on incorporating the feedback? |
I am working on it. Trying to run the app on YARN and checking things around. |
ea2ba63
to
d80cd54
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id d80cd54 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Updated the PR with review comments and is currently based on Apex 3.5.0 and Beam 0.4-Incubating releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sandeepdeshmukh; left a few comments.
src/documentation/runners/apex.md
Outdated
|
||
Wordcount example to run on Apex-Runner | ||
``` | ||
git clone https://github.com/tweise/apex-samples.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use Beam examples instead, similarly to other runners?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apex runner is does not work as-is on the Beam examples and hence we would be maintaining this example till Apex runner works out of the box for Beam examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention why this is currently needed here as well as in the README in https://github.com/tweise/apex-samples/tree/master/beam-apex-wordcount
Reasons include the ability to specify the resource constraints (BEAM-980) etc.
src/documentation/runners/apex.md
Outdated
|
||
Download some data for processing and put it on HDFS | ||
``` | ||
curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this be automatically handled by Beam's WordCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is to use gs:// but we are using HDFS based input. So added steps explicitly for this.
src/documentation/runners/apex.md
Outdated
Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins: | ||
|
||
* YARN : Using YARN web UI generally running on 8088 on the node running resource manager | ||
* Apex cli: [Using apex cli to get running application information](http://apex.apache.org/docs/apex/apex_cli/#apex-cli-commands) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
console.jpg
is not currently used? drop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
d80cd54
to
85fbcae
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 85fbcae with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
85fbcae
to
f126242
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id a6b435e with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
@tweise : I have updated the PR with your suggestions. |
@@ -5,5 +5,72 @@ permalink: /documentation/runners/apex/ | |||
--- | |||
# Using the Apache Apex Runner | |||
|
|||
This page is under construction ([BEAM-825](https://issues.apache.org/jira/browse/BEAM-825)). | |||
The Apex Runner executes Apache Beam pipelines using [Apache Apex](http://apex.apache.org/) as an underlying engine. The runner has broad support for the [Beam model and supports streaming and batch pipelines]({{ site.baseurl }}/documentation/runners/capability-matrix/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runner does not have good support for batch pipelines currently, please remove this.
src/documentation/runners/apex.md
Outdated
|
||
[Apache Apex](http://apex.apache.org/) is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing. With its stateful stream processing architecture, Apex can support all of the concepts in the Beam model (event time, triggers, watermarks etc.). | ||
|
||
ApexRunner does not implement launch on YARN yet, hence we move the Beam pipeline code into Apex StreamingApplication to translate it into the Apex DAG and then launch the application using apex CLI. Later, instead of implementing Apex StreamingApplication, the main method will call the runner. The current process is an intermedia step as of 0.4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is support for launch on YARN now.
With https://issues.apache.org/jira/browse/BEAM-980 in master it should be possible to come up with a configuration that will allow to run the existing example as is and without the Apex packaging workaround? |
I would take care of these comments in a day or two. |
I still have a problem with usage of the private repositories in our instructions, outside Apache's control -- we should fix this. Otherwise, no worries here. |
@tweise : How do you recommend to address @davorbonaci 's comment. |
See earlier comment, there should be no need for this workaround, example should work as is with YARN launcher in master. |
a6b435e
to
468d2fd
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 468d2fd with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sandeepdeshmukh. I like the content. This is nice.
I left just a few grammar/style/typo comments that I think we can resolve quickly and get this done! Once again, thanks.
|
||
[Apache Apex](http://apex.apache.org/) is a stream processing platform and framework for low-latency, high-throughput and fault-tolerant analytics applications on Apache Hadoop. Apex has a unified streaming architecture and can be used for real-time and batch processing. With its stateful stream processing architecture, Apex can support all of the concepts in the Beam model (event time, triggers, watermarks etc.). | ||
|
||
## Apex-Runner prerequisites and setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove dash between Apex and Runner.
@@ -5,5 +5,46 @@ permalink: /documentation/runners/apex/ | |||
--- | |||
# Using the Apache Apex Runner | |||
|
|||
This page is under construction ([BEAM-825](https://issues.apache.org/jira/browse/BEAM-825)). | |||
The Apex Runner executes Apache Beam pipelines using [Apache Apex](http://apex.apache.org/) as an underlying engine. The runner has broad support for the [Beam model and supports streaming and batch pipelines]({{ site.baseurl }}/documentation/runners/capability-matrix/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: a broad support
|
||
## Apex-Runner prerequisites and setup | ||
|
||
You may set up your own Hadoop cluster, and [setup Apache Apex on top of it](http://apex.apache.org/docs/apex/apex_development_setup/) or choose any vendor-specific distribution that includes Hadoop and Apex pre-installed. Please see the [distribution information on the Apache Apex website](http://apex.apache.org/downloads.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space before and.
|
||
You may set up your own Hadoop cluster, and [setup Apache Apex on top of it](http://apex.apache.org/docs/apex/apex_development_setup/) or choose any vendor-specific distribution that includes Hadoop and Apex pre-installed. Please see the [distribution information on the Apache Apex website](http://apex.apache.org/downloads.html). | ||
|
||
## Running wordcount using Apex-Runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wordcount --> the WordCount example
Apex-Runner --> Apex Runner
|
||
## Running wordcount using Apex-Runner | ||
|
||
Download some data for processing and put it on HDFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upload input data to a location in HDFS
Also, end with a period ('.').
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/ --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner | ||
``` | ||
|
||
This will launch an Apex application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this comment into above, e.g.:
Run the WordCount example as an Apex application.
|
||
## Checking output | ||
|
||
The sample program which is processing small amount of data would finish quickly. You can check contents on /tmp/output/ on HDFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence should be obvious. I'd just remove it.
Second sentence: end with a period ('.').
Also, no need for the /tmp/output in the prose -- it is obvious from the command below. You can just say: "Check the output of the pipeline from an HDFS location."
|
||
## Montoring progress of your job | ||
|
||
Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: alternatively
Typo: following
Typo: options
|
||
Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins: | ||
|
||
* YARN : Using YARN web UI generally running on 8088 on the node running resource manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End both sentences with a period.
Depending on your installation, you may be able to monitor the progress of your job on the Hadoop cluster. Alternately, you have folloing optoins: | ||
|
||
* YARN : Using YARN web UI generally running on 8088 on the node running resource manager | ||
* Apex cli: [Using apex cli to get running application information](http://apex.apache.org/docs/apex/apex_cli/#apex-cli-commands) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apex cli -> Apex command-line interface
apex -> Apex
cli -> CLI
|
||
Run the wordcount example | ||
``` | ||
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount -Dexec.args="--inputFile=/tmp/input/ --output=/tmp/output/ --runner=ApexRunner --embeddedExecution=false --configFile=beam-runners-apex.properties" -Papex-runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do the properties come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One version, with limited documentation: https://github.com/apache/beam/blob/master/runners/apex/src/test/resources/beam-runners-apex.properties
Would be better with a link to the list of properties in a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possible attributes are here: https://github.com/apache/apex-core/blob/master/api/src/main/java/com/datatorrent/api/Context.java
I think we should merge this PR and then add examples for the properties in a separate PR. The instructions are already useful and we validated that the YARN launcher works today.
@sandeepdeshmukh, any update here perhaps? |
1 similar comment
@sandeepdeshmukh, any update here perhaps? |
I am extremely sorry for the delay. I am on a long vacation with no compute access and hence requesting @chinmaykolhatkar to please take this up. |
@davorbonaci @dhalperi as mentioned above, I would suggest to merge and open follow-up JIRA to clarify the properties usage. |
@dhalperi the issue I run into when trying to run the example:
The command line was:
I tried it without the file scheme also, same result. |
The Apex CLI binaries are at: https://github.com/atrato/apex-cli-package/releases @dhalperi can probably add some notes about troubleshooting of the pipeline when it runs on YARN. |
@tweise -- sounds great. Can you take over, fixup as desired, and merge? |
@tweise : Could you please review.