Add Spark Standalone cluster with README #424

hectcastro · 2016-08-22T20:45:29Z

Add the components required for a local Spark Standalone cluster housed within Docker containers. In addition, add scaffolding required for a test SparkPi job.

Lastly, start assembling a README to document specifics about the Spark environment so that we have a concise reference as we move toward the EC2 based deployment.

Testing

Please see the included README.

I think it would be great if the reviewer could get the Spark Standalone cluster running locally and submit the SparkPi job to it. It would also great to kill the spark-master while SparkPi is running and then bring it back up to see if it still recalls the previous cluster state (most easily viewable via the Spark UI).

Lastly, please suggest corrections to the README where things seem inaccurate or unclear.

tnation14 · 2016-08-26T16:43:49Z

It looks like sbt is building the SparkPi JAR in the wrong location. As a result, I can't submit the job to the spark master.


$ vagrant up
$ vagrant ssh
$ docker-compose -f docker-compose.spark.yml run --rm --entrypoint ./sbt spark-driver package
...
[info] Compiling 1 Scala source to /var/lib/spark/.sbt/0.13/staging/6620e604e029e71ff11f/worker-tasks/target/scala-2.10/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.10.6. Compiling...
[info]   Compilation completed in 11.065 s
[info] Packaging /var/lib/spark/.sbt/0.13/staging/6620e604e029e71ff11f/worker-tasks/target/scala-2.10/rf-worker_2.10-0.1.0.jar ...
[info] Done packaging.
[success] Total time: 66 s, completed Aug 26, 2016 4:36:56 PM

$ ls worker-tasks/target
ls: cannot access worker-tasks/target: No such file or directory

hectcastro · 2016-09-01T18:38:28Z

Changes to README in fc5a06b should help keep sbt from using a one-off target directory.

tnation14 · 2016-09-02T14:41:19Z

docs/spark/README.md

+The Spark Standalone master provides an RPC endpoint for **drivers** and **workers** to communicate with. It also creates tasks out of a job's execution graph and submits them to **workers**.
+
+There is generally only one master in a Spark Standalone cluster, but when multiple master are up at the same time one must be elected the leader.
+


It would be helpful to add a place a note in this paragraph saying that the UI will be available on port 8888.

Addressed in b79ebf2.

tnation14 · 2016-09-02T20:45:41Z

Following the README instructions worked perfectly. I was able to submit a job, kill the master, and restart it and have it recover state appropriately. The state shows as RECOVERED in the logs, and the workers re-register, which is visible through the UI. One thing I did notice, though, is if the worker dies, it re-registers when it comes back up, but the job it was working on doesn't get restarted; I had to resubmit the job from the driver. Should the master be re-starting jobs, or will that recovery behavior happen elsewhere?

Other than the worker restart behavior, everything else looks good 👍

hectcastro · 2016-09-12T18:17:06Z

I added a section to the README for Safety Testing that summarizes behavior observed by @tnation14 and I in testing failure scenarios. I suspect that the communication failure between the driver and the worker/executor we're seeing may be related to the following Spark issues:

Unfortunately, neither has been included in an official release yet (both targeted at 2.0.1 and 2.1.0).

lossyrob · 2016-09-24T01:30:48Z

worker-tasks/build.sbt

+
+scalaVersion := "2.11.1"
+
+libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.2"


Should these versions be bumped?

scalaVersion := "2.11.8" sparkVersion := "2.0.0"

Addressed in 00db762.

lossyrob · 2016-09-24T01:36:52Z

docs/spark/README.md

+
+## Spark Components
+
+Often times Spark is deployed on top of a cluster resource manager, such as Apache Mesos or Apache Hadoop YARN (Yet Another Resource Negotiator). In addition to those options, Spark comes bundled with **Spark Standalone**, which is a built-in way to run Spark in a clustered environment.


Is the idea that we are going to deploy docker containers and rely on the standalone cluster manager to manage it? Is the reason for this reduce complexity of not having a cluster manager such as Mesos or YARN, or are there other benefits?

I don't think the usage of Spark Standalone here is a commitment to using it beyond the development environment. It was the simplest to setup locally, and provides a good base for identifying common terminology.

As far as what we use beyond the development environment, I think about it several times a week and still don't think I have a solid plan worthy of an ADR. Spark Standalone is easier to reason about, but lacks support for enabling effective multitenancy. I don't think we'll be able to get away from running a cluster manager, at least if we want to support multiple jobs with the same cluster.

lossyrob · 2016-09-24T01:41:13Z

docker-compose.spark.yml

+      - "spark://spark.services.rf.internal:7077"
+
+  spark-driver:
+    image: quay.io/azavea/spark:1.6.2


Would you be able to use the geodocker spark, or is that different enough that we need to use an azavea hosted one?

https://github.com/geodocker/geodocker-spark

Might be good to consolidate efforts w.r.t these types of containers.

My initial approach for this setup was to layout a base Spark development environment for Raster Foundry components that require it. Right now, the Spark setup consists of the OpenJDK JRE and Spark.

I haven't really taken a close look at GeoDocker until today (at least geodocker-spark and everything it inherits from), but there are a few aspects that seem to make it a bad fit for what we're trying to achieve right now:

Based on CentOS and Oracle JDK

Has multiple layers of base images pinned to latest

Installs some components with Apache Bigtop (Hadoop, HDFS), some not (Spark)

Seems setup to be used with GeoDocker as a whole vs. components by themselves

If we use GeoDocker to deploy GeoTrellis for Raster Foundry, then I think we'd definitely use it for development as well, but it isn't clear that that's the route we're going to take.

lossyrob · 2016-09-24T01:42:22Z

worker-tasks/src/main/scala/com/rasterfoundry/worker/SparkPi.scala

+ * limitations under the License.
+ */
+
+// scalastyle:off println


Are we ok with these editor comments wrapping any Scala code? Is distracting IMO...

I don't have a problem with removing them. They came along for the ride when I brought SparkPi over from the official Spark examples.

Addressed in cc96ffb.

hectcastro · 2016-09-24T23:42:30Z

Comments addressed. Functionally, Spark 2.0.0 is being used now and sbt is packaged inside the container image.

notthatbreezy · 2016-10-19T12:05:35Z

@lossyrob do you think this is good to merge or do you still have outstanding questions/issues?

lossyrob · 2016-10-19T13:34:36Z

Lgtm

Add the components required for a local Spark Standalone cluster housed within Docker containers. In addition, add scaffolding required for a test `SparkPi` job. Lastly, start assembling a `README` to document specifics about the Spark environment so that we have a concise reference as we move toward the EC2 based deployment.

hectcastro assigned tnation14 Aug 24, 2016

tnation14 reviewed Sep 2, 2016
View reviewed changes

hectcastro force-pushed the feature/hmc/sparky branch from 1a5bf0c to 0752b03 Compare September 12, 2016 18:12

lossyrob suggested changes Sep 24, 2016

View reviewed changes

hectcastro force-pushed the feature/hmc/sparky branch from d53d051 to ccff280 Compare October 19, 2016 14:07

hectcastro merged commit fde9cd6 into develop Oct 19, 2016

hectcastro deleted the feature/hmc/sparky branch October 19, 2016 14:36

hectcastro unassigned tnation14 Jul 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spark Standalone cluster with README #424

Add Spark Standalone cluster with README #424

hectcastro commented Aug 22, 2016

tnation14 commented Aug 26, 2016

hectcastro commented Sep 1, 2016

tnation14 Sep 2, 2016

hectcastro Sep 2, 2016

tnation14 commented Sep 2, 2016 •

edited

hectcastro commented Sep 12, 2016

lossyrob Sep 24, 2016

hectcastro Sep 24, 2016

lossyrob Sep 24, 2016

hectcastro Sep 24, 2016

lossyrob Sep 24, 2016

hectcastro Sep 24, 2016

lossyrob Sep 24, 2016

hectcastro Sep 24, 2016

hectcastro commented Sep 24, 2016

notthatbreezy commented Oct 19, 2016

lossyrob commented Oct 19, 2016

		The Spark Standalone master provides an RPC endpoint for drivers and workers to communicate with. It also creates tasks out of a job's execution graph and submits them to workers.

		There is generally only one master in a Spark Standalone cluster, but when multiple master are up at the same time one must be elected the leader.


		scalaVersion := "2.11.1"

		libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.2"


		## Spark Components

		Often times Spark is deployed on top of a cluster resource manager, such as Apache Mesos or Apache Hadoop YARN (Yet Another Resource Negotiator). In addition to those options, Spark comes bundled with Spark Standalone, which is a built-in way to run Spark in a clustered environment.

Add Spark Standalone cluster with README #424

Add Spark Standalone cluster with README #424

Conversation

hectcastro commented Aug 22, 2016

tnation14 commented Aug 26, 2016

hectcastro commented Sep 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnation14 commented Sep 2, 2016 • edited

hectcastro commented Sep 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hectcastro commented Sep 24, 2016

notthatbreezy commented Oct 19, 2016

lossyrob commented Oct 19, 2016

tnation14 commented Sep 2, 2016 •

edited