New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark Standalone cluster with README #424
Conversation
It looks like
|
Changes to |
The Spark Standalone master provides an RPC endpoint for **drivers** and **workers** to communicate with. It also creates tasks out of a job's execution graph and submits them to **workers**. | ||
|
||
There is generally only one master in a Spark Standalone cluster, but when multiple master are up at the same time one must be elected the leader. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to add a place a note in this paragraph saying that the UI will be available on port 8888.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in b79ebf2.
Following the README instructions worked perfectly. I was able to submit a job, kill the master, and restart it and have it recover state appropriately. The state shows as Other than the worker restart behavior, everything else looks good 👍 |
1a5bf0c
to
0752b03
Compare
I added a section to the Unfortunately, neither has been included in an official release yet (both targeted at 2.0.1 and 2.1.0). |
|
||
scalaVersion := "2.11.1" | ||
|
||
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these versions be bumped?
scalaVersion := "2.11.8"
sparkVersion := "2.0.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 00db762.
|
||
## Spark Components | ||
|
||
Often times Spark is deployed on top of a cluster resource manager, such as Apache Mesos or Apache Hadoop YARN (Yet Another Resource Negotiator). In addition to those options, Spark comes bundled with **Spark Standalone**, which is a built-in way to run Spark in a clustered environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that we are going to deploy docker containers and rely on the standalone cluster manager to manage it? Is the reason for this reduce complexity of not having a cluster manager such as Mesos or YARN, or are there other benefits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the usage of Spark Standalone here is a commitment to using it beyond the development environment. It was the simplest to setup locally, and provides a good base for identifying common terminology.
As far as what we use beyond the development environment, I think about it several times a week and still don't think I have a solid plan worthy of an ADR. Spark Standalone is easier to reason about, but lacks support for enabling effective multitenancy. I don't think we'll be able to get away from running a cluster manager, at least if we want to support multiple jobs with the same cluster.
- "spark://spark.services.rf.internal:7077" | ||
|
||
spark-driver: | ||
image: quay.io/azavea/spark:1.6.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to use the geodocker spark, or is that different enough that we need to use an azavea hosted one?
https://github.com/geodocker/geodocker-spark
Might be good to consolidate efforts w.r.t these types of containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial approach for this setup was to layout a base Spark development environment for Raster Foundry components that require it. Right now, the Spark setup consists of the OpenJDK JRE and Spark.
I haven't really taken a close look at GeoDocker until today (at least geodocker-spark
and everything it inherits from), but there are a few aspects that seem to make it a bad fit for what we're trying to achieve right now:
- Based on CentOS and Oracle JDK
- Has multiple layers of base images pinned to
latest
- Installs some components with Apache Bigtop (Hadoop, HDFS), some not (Spark)
- Seems setup to be used with GeoDocker as a whole vs. components by themselves
If we use GeoDocker to deploy GeoTrellis for Raster Foundry, then I think we'd definitely use it for development as well, but it isn't clear that that's the route we're going to take.
* limitations under the License. | ||
*/ | ||
|
||
// scalastyle:off println |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with these editor comments wrapping any Scala code? Is distracting IMO...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with removing them. They came along for the ride when I brought SparkPi
over from the official Spark examples.
Addressed in cc96ffb.
Comments addressed. Functionally, Spark 2.0.0 is being used now and |
@lossyrob do you think this is good to merge or do you still have outstanding questions/issues? |
Lgtm |
Add the components required for a local Spark Standalone cluster housed within Docker containers. In addition, add scaffolding required for a test `SparkPi` job. Lastly, start assembling a `README` to document specifics about the Spark environment so that we have a concise reference as we move toward the EC2 based deployment.
d53d051
to
ccff280
Compare
Add the components required for a local Spark Standalone cluster housed within Docker containers. In addition, add scaffolding required for a test
SparkPi
job.Lastly, start assembling a
README
to document specifics about the Spark environment so that we have a concise reference as we move toward the EC2 based deployment.Testing
Please see the included
README
.I think it would be great if the reviewer could get the Spark Standalone cluster running locally and submit the
SparkPi
job to it. It would also great to kill thespark-master
whileSparkPi
is running and then bring it back up to see if it still recalls the previous cluster state (most easily viewable via the Spark UI).Lastly, please suggest corrections to the README where things seem inaccurate or unclear.