Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs small fixes #8629

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 11 additions & 12 deletions docs/building-spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,12 +61,13 @@ If you don't run this, you may see errors like the following:
You can fix this by setting the `MAVEN_OPTS` variable as discussed before.

**Note:**
* *For Java 8 and above this step is not required.*
* *If using `build/mvn` and `MAVEN_OPTS` were not already set, the script will automate this for you.*

* For Java 8 and above this step is not required.
* If using `build/mvn` with no `MAVEN_OPTS` set, the script will automate this for you.

# Specifying the Hadoop Version

Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you'll need to build Spark against the specific HDFS version in your environment. You can do this through the "hadoop.version" property. If unset, Spark will build against Hadoop 2.2.0 by default. Note that certain build profiles are required for particular Hadoop versions:
Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you'll need to build Spark against the specific HDFS version in your environment. You can do this through the `hadoop.version` property. If unset, Spark will build against Hadoop 2.2.0 by default. Note that certain build profiles are required for particular Hadoop versions:

<table class="table">
<thead>
Expand All @@ -91,7 +92,7 @@ mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package
mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phadoop-1 -DskipTests clean package
{% endhighlight %}

You can enable the "yarn" profile and optionally set the "yarn.version" property if it is different from "hadoop.version". Spark only supports YARN versions 2.2.0 and later.
You can enable the `yarn` profile and optionally set the `yarn.version` property if it is different from `hadoop.version`. Spark only supports YARN versions 2.2.0 and later.

Examples:

Expand Down Expand Up @@ -125,7 +126,7 @@ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -Dskip
# Building for Scala 2.11
To produce a Spark package compiled with Scala 2.11, use the `-Dscala-2.11` property:

dev/change-scala-version.sh 2.11
./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

Spark does not yet support its JDBC component for Scala 2.11.
Expand Down Expand Up @@ -163,11 +164,9 @@ the `spark-parent` module).

Thus, the full flow for running continuous-compilation of the `core` submodule may look more like:

```
$ mvn install
$ cd core
$ mvn scala:cc
```
$ mvn install
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the change here -- I assume it's still code-formatted, but it's indented now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's merged already, but for the sake of completeness I'm answering now - yes, it's properly code-formatted.

$ cd core
$ mvn scala:cc

# Building Spark with IntelliJ IDEA or Eclipse

Expand All @@ -193,11 +192,11 @@ then ship it over to the cluster. We are investigating the exact cause for this.

# Packaging without Hadoop Dependencies for YARN

The assembly jar produced by `mvn package` will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The `hadoop-provided` profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.
The assembly jar produced by `mvn package` will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with `yarn.application.classpath`. The `hadoop-provided` profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.

# Building with SBT

Maven is the official recommendation for packaging Spark, and is the "build of reference".
Maven is the official build tool recommended for packaging Spark, and is the *build of reference*.
But SBT is supported for day-to-day development since it can provide much faster iterative
compilation. More advanced developers may wish to use SBT.

Expand Down
15 changes: 8 additions & 7 deletions docs/cluster-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,19 @@ title: Cluster Mode Overview

This document gives a short overview of how Spark runs on clusters, to make it easier to understand
the components involved. Read through the [application submission guide](submitting-applications.html)
to submit applications to a cluster.
to learn about launching applications on a cluster.

# Components

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
Spark applications run as independent sets of processes on a cluster, coordinated by the `SparkContext`
object in your main program (called the _driver program_).

Specifically, to run on a cluster, the SparkContext can connect to several types of _cluster managers_
(either Spark's own standalone cluster manager or Mesos/YARN), which allocate resources across
(either Spark's own standalone cluster manager, Mesos or YARN), which allocate resources across
applications. Once connected, Spark acquires *executors* on nodes in the cluster, which are
processes that run computations and store data for your application.
Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to
the executors. Finally, SparkContext sends *tasks* for the executors to run.
the executors. Finally, SparkContext sends *tasks* to the executors to run.

<p style="text-align: center;">
<img src="img/cluster-overview.png" title="Spark cluster components" alt="Spark cluster components" />
Expand All @@ -33,9 +34,9 @@ There are several useful things to note about this architecture:
2. Spark is agnostic to the underlying cluster manager. As long as it can acquire executor
processes, and these communicate with each other, it is relatively easy to run it even on a
cluster manager that also supports other applications (e.g. Mesos/YARN).
3. The driver program must listen for and accept incoming connections from its executors throughout
its lifetime (e.g., see [spark.driver.port and spark.fileserver.port in the network config
section](configuration.html#networking)). As such, the driver program must be network
3. The driver program must listen for and accept incoming connections from its executors throughout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what's the change here?

The rest LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional spaces at the end. Thanks a lot for reviewing the change and accepting it!

its lifetime (e.g., see [spark.driver.port and spark.fileserver.port in the network config
section](configuration.html#networking)). As such, the driver program must be network
addressable from the worker nodes.
4. Because the driver schedules tasks on the cluster, it should be run close to the worker
nodes, preferably on the same local area network. If you'd like to send requests to the
Expand Down