-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-507] Fill in the documentation/runners/spark portion of the web… #103
Conversation
R: @jbonofre |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 5e5caf7 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
<artifactId>beam-runners-spark</artifactId> | ||
<version>{{ site.release_latest }}</version> | ||
</dependency> | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the Spark runner doesn't provide a BoM (I created another Jira about that), I think end-users have to define the following additional dependencies:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<!-- exclude old Jetty version of servlet API -->
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.10</artifactId>
<version>${jackson.version}</version>
</dependency>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I'll use this instead.
This one works for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what I'm using in the spark-runner
Maven profile in beam-samples
. Not tested super recently (I will do it) but it worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a very simple Create
-> Distinct
-> TextIO.Write
pipeline and ran (self-contained) on a Spark Standalone cluster (master + 1 executor, on my laptop), and didn't require any dependencies (except for Spark, since it's self-contained.. not deployed like YARN installations are sometimes). Adding an example based on that. YARN examples better wait for HDFS support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough. I will do a test but it sounds good.
|
||
### Deploying Spark with your application | ||
|
||
In some cases, such as running in local mode, your (self-contained) application would be required to pack Spark by explicitly adding the following dependencies in your pom.xml: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a "clear" sentence like: Spark runner standalone/embedded mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used local mode because standalone is a bit confusing as it is the name of Spark's own resource manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree, cleaner. My point was just really to let user understand that spark is "part" of the execution (it's not an external cluster).
|
||
Deploying your Beam pipeline on a cluster that already has a Spark deployment does not require any additional dependencies. | ||
For more details on the different deployment modes see: [Standalone](http://spark.apache.org/docs/latest/spark-standalone.html), [YARN](http://spark.apache.org/docs/latest/running-on-yarn.html), or [Mesos](http://spark.apache.org/docs/latest/running-on-mesos.html). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an exemple using spark-submit
would help (even if it might look stupid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
I'll add the example pom dependencies and submit to yarn example. |
@jbonofre added an example for packaging and submitting in a Standalone cluster. |
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id d6938c5 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM. (Happy to merge @amitsela, if you don't have the website setup ready.) |
@davorbonaci I'll happily accept your merge offer 😄 thanks! |
Merged. Thanks @amitsela, this is great. (Separately, BEAM-900 would be an awesome improvement to the Quickstart, and probably very easy to do.) |
…site.