Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-507] Fill in the documentation/runners/spark portion of the web… #103

Closed
wants to merge 2 commits into from

Conversation

amitsela
Copy link
Member

@amitsela amitsela commented Dec 8, 2016

…site.

@amitsela
Copy link
Member Author

amitsela commented Dec 8, 2016

R: @jbonofre

@asfbot
Copy link

asfbot commented Dec 8, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Test/136/
--none--

@asfbot
Copy link

asfbot commented Dec 8, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/179/

Jenkins built the site at commit id 5e5caf7 with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

<artifactId>beam-runners-spark</artifactId>
<version>{{ site.release_latest }}</version>
</dependency>
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the Spark runner doesn't provide a BoM (I created another Jira about that), I think end-users have to define the following additional dependencies:

                <dependency>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-common</artifactId>
                    <version>${hadoop.version}</version>
                    <exclusions>
                        <!-- exclude old Jetty version of servlet API -->
                        <exclusion>
                            <groupId>org.mortbay.jetty</groupId>
                            <artifactId>servlet-api</artifactId>
                        </exclusion>
                    </exclusions>
                </dependency>
                <dependency>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-client</artifactId>
                    <version>${hadoop.version}</version>
                </dependency>
                <dependency>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-mapreduce-client-core</artifactId>
                    <version>${hadoop.version}</version>
                </dependency>

                <dependency>
                    <groupId>com.fasterxml.jackson.core</groupId>
                    <artifactId>jackson-core</artifactId>
                    <version>${jackson.version}</version>
                </dependency>
                <dependency>
                    <groupId>com.fasterxml.jackson.core</groupId>
                    <artifactId>jackson-annotations</artifactId>
                    <version>${jackson.version}</version>
                </dependency>
                <dependency>
                    <groupId>com.fasterxml.jackson.core</groupId>
                    <artifactId>jackson-databind</artifactId>
                    <version>${jackson.version}</version>
                </dependency>
                <dependency>
                    <groupId>com.fasterxml.jackson.module</groupId>
                    <artifactId>jackson-module-scala_2.10</artifactId>
                    <version>${jackson.version}</version>
                </dependency>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'll use this instead.
This one works for you ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's what I'm using in the spark-runner Maven profile in beam-samples. Not tested super recently (I will do it) but it worked fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a very simple Create -> Distinct -> TextIO.Write pipeline and ran (self-contained) on a Spark Standalone cluster (master + 1 executor, on my laptop), and didn't require any dependencies (except for Spark, since it's self-contained.. not deployed like YARN installations are sometimes). Adding an example based on that. YARN examples better wait for HDFS support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair enough. I will do a test but it sounds good.


### Deploying Spark with your application

In some cases, such as running in local mode, your (self-contained) application would be required to pack Spark by explicitly adding the following dependencies in your pom.xml:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a "clear" sentence like: Spark runner standalone/embedded mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used local mode because standalone is a bit confusing as it is the name of Spark's own resource manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, cleaner. My point was just really to let user understand that spark is "part" of the execution (it's not an external cluster).


Deploying your Beam pipeline on a cluster that already has a Spark deployment does not require any additional dependencies.
For more details on the different deployment modes see: [Standalone](http://spark.apache.org/docs/latest/spark-standalone.html), [YARN](http://spark.apache.org/docs/latest/running-on-yarn.html), or [Mesos](http://spark.apache.org/docs/latest/running-on-mesos.html).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an exemple using spark-submit would help (even if it might look stupid).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@amitsela
Copy link
Member Author

amitsela commented Dec 8, 2016

I'll add the example pom dependencies and submit to yarn example.

@amitsela
Copy link
Member Author

amitsela commented Dec 8, 2016

@jbonofre added an example for packaging and submitting in a Standalone cluster.
~40% of Spark users use it so it should be useful.
As mentioned in my comments, I prefer waiting with YARN examples to include HDFS.

@asfbot
Copy link

asfbot commented Dec 8, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/180/

Jenkins built the site at commit id d6938c5 with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

@asfbot
Copy link

asfbot commented Dec 8, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Test/137/
--none--

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davorbonaci
Copy link
Member

LGTM. (Happy to merge @amitsela, if you don't have the website setup ready.)

@amitsela
Copy link
Member Author

amitsela commented Dec 9, 2016

@davorbonaci I'll happily accept your merge offer 😄 thanks!

@asfgit asfgit closed this in eb5397b Dec 9, 2016
@davorbonaci
Copy link
Member

Merged. Thanks @amitsela, this is great.

(Separately, BEAM-900 would be an awesome improvement to the Quickstart, and probably very easy to do.)

@amitsela amitsela deleted the BEAM-507 branch December 9, 2016 20:53
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Jun 5, 2018
melap pushed a commit to apache/beam that referenced this pull request Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants