Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark Standalone cluster with README #424

Merged
merged 1 commit into from Oct 19, 2016
Merged

Conversation

hectcastro
Copy link
Contributor

Add the components required for a local Spark Standalone cluster housed within Docker containers. In addition, add scaffolding required for a test SparkPi job.

Lastly, start assembling a README to document specifics about the Spark environment so that we have a concise reference as we move toward the EC2 based deployment.


Testing

Please see the included README.

I think it would be great if the reviewer could get the Spark Standalone cluster running locally and submit the SparkPi job to it. It would also great to kill the spark-master while SparkPi is running and then bring it back up to see if it still recalls the previous cluster state (most easily viewable via the Spark UI).

Lastly, please suggest corrections to the README where things seem inaccurate or unclear.

@tnation14
Copy link
Contributor

It looks like sbt is building the SparkPi JAR in the wrong location. As a result, I can't submit the job to the spark master.


$ vagrant up
$ vagrant ssh
$ docker-compose -f docker-compose.spark.yml run --rm --entrypoint ./sbt spark-driver package
...
[info] Compiling 1 Scala source to /var/lib/spark/.sbt/0.13/staging/6620e604e029e71ff11f/worker-tasks/target/scala-2.10/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.10.6. Compiling...
[info]   Compilation completed in 11.065 s
[info] Packaging /var/lib/spark/.sbt/0.13/staging/6620e604e029e71ff11f/worker-tasks/target/scala-2.10/rf-worker_2.10-0.1.0.jar ...
[info] Done packaging.
[success] Total time: 66 s, completed Aug 26, 2016 4:36:56 PM
$ ls worker-tasks/target
ls: cannot access worker-tasks/target: No such file or directory

@hectcastro
Copy link
Contributor Author

Changes to README in fc5a06b should help keep sbt from using a one-off target directory.

The Spark Standalone master provides an RPC endpoint for **drivers** and **workers** to communicate with. It also creates tasks out of a job's execution graph and submits them to **workers**.

There is generally only one master in a Spark Standalone cluster, but when multiple master are up at the same time one must be elected the leader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add a place a note in this paragraph saying that the UI will be available on port 8888.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b79ebf2.

@tnation14
Copy link
Contributor

tnation14 commented Sep 2, 2016

Following the README instructions worked perfectly. I was able to submit a job, kill the master, and restart it and have it recover state appropriately. The state shows as RECOVERED in the logs, and the workers re-register, which is visible through the UI. One thing I did notice, though, is if the worker dies, it re-registers when it comes back up, but the job it was working on doesn't get restarted; I had to resubmit the job from the driver. Should the master be re-starting jobs, or will that recovery behavior happen elsewhere?

Other than the worker restart behavior, everything else looks good 👍

@hectcastro
Copy link
Contributor Author

I added a section to the README for Safety Testing that summarizes behavior observed by @tnation14 and I in testing failure scenarios. I suspect that the communication failure between the driver and the worker/executor we're seeing may be related to the following Spark issues:

Unfortunately, neither has been included in an official release yet (both targeted at 2.0.1 and 2.1.0).


scalaVersion := "2.11.1"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these versions be bumped?

scalaVersion := "2.11.8"
sparkVersion := "2.0.0"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 00db762.


## Spark Components

Often times Spark is deployed on top of a cluster resource manager, such as Apache Mesos or Apache Hadoop YARN (Yet Another Resource Negotiator). In addition to those options, Spark comes bundled with **Spark Standalone**, which is a built-in way to run Spark in a clustered environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that we are going to deploy docker containers and rely on the standalone cluster manager to manage it? Is the reason for this reduce complexity of not having a cluster manager such as Mesos or YARN, or are there other benefits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the usage of Spark Standalone here is a commitment to using it beyond the development environment. It was the simplest to setup locally, and provides a good base for identifying common terminology.

As far as what we use beyond the development environment, I think about it several times a week and still don't think I have a solid plan worthy of an ADR. Spark Standalone is easier to reason about, but lacks support for enabling effective multitenancy. I don't think we'll be able to get away from running a cluster manager, at least if we want to support multiple jobs with the same cluster.

- "spark://spark.services.rf.internal:7077"

spark-driver:
image: quay.io/azavea/spark:1.6.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to use the geodocker spark, or is that different enough that we need to use an azavea hosted one?

https://github.com/geodocker/geodocker-spark

Might be good to consolidate efforts w.r.t these types of containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial approach for this setup was to layout a base Spark development environment for Raster Foundry components that require it. Right now, the Spark setup consists of the OpenJDK JRE and Spark.

I haven't really taken a close look at GeoDocker until today (at least geodocker-spark and everything it inherits from), but there are a few aspects that seem to make it a bad fit for what we're trying to achieve right now:

  • Based on CentOS and Oracle JDK
  • Has multiple layers of base images pinned to latest
  • Installs some components with Apache Bigtop (Hadoop, HDFS), some not (Spark)
  • Seems setup to be used with GeoDocker as a whole vs. components by themselves

If we use GeoDocker to deploy GeoTrellis for Raster Foundry, then I think we'd definitely use it for development as well, but it isn't clear that that's the route we're going to take.

* limitations under the License.
*/

// scalastyle:off println
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok with these editor comments wrapping any Scala code? Is distracting IMO...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with removing them. They came along for the ride when I brought SparkPi over from the official Spark examples.

Addressed in cc96ffb.

@hectcastro
Copy link
Contributor Author

Comments addressed. Functionally, Spark 2.0.0 is being used now and sbt is packaged inside the container image.

@notthatbreezy
Copy link
Contributor

@lossyrob do you think this is good to merge or do you still have outstanding questions/issues?

@lossyrob
Copy link
Contributor

Lgtm

Add the components required for a local Spark Standalone cluster housed
within Docker containers. In addition, add scaffolding required for a
test `SparkPi` job.

Lastly, start assembling a `README` to document specifics about the
Spark environment so that we have a concise reference as we move toward
the EC2 based deployment.
@hectcastro hectcastro merged commit fde9cd6 into develop Oct 19, 2016
@hectcastro hectcastro deleted the feature/hmc/sparky branch October 19, 2016 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants