GitHub - fdavidcl/vagrant-spark: A Vagrant cluster with Spark and spark-notebook

A multi-machine Vagrant cluster to work with Spark, including:

Spark 1.5
spark-notebook

Includes code from fredcons/vagrant-spark, wangqiang8511/vagrant-salt-spark and H4ml3t/spark-scala-maven-boilerplate-project.

Usage

Clone this repository:

git clone git@github.com:fdavidcl/vagrant-spark.git
cd vagrant-spark

Configure the Vagrantfile to your liking with the number of compute nodes you want. If you want a single node, set SPARK_NODES to 0 and use the master node.

Installing Ansible

pip install --user ansible
export PATH=$HOME/.local/bin:$PATH

Provisioning with Vagrant

First of all, generate a pair of private/public RSA keys for the master node with ssh-keygen -t rsa and place them inside the provisioning/master_keys directory.

The source code can be used to provision the virtual machines with Vagrant and Ansible, with a simple

vagrant up

which shall take a bit of time.

Logging into the master node

Once the VM is launched, you can log in with:

vagrant ssh master
# gaining root access (required to launch the cluster)
sudo su -

Once connected as root, you can use the various libraries installed or submit a program to the cluster.

Example

In this example we will provide a text file to Spark in order to count the ocurrences of each word. First, start up the cluster with

$SPARK_HOME/sbin/start-all.sh

You can check the active nodes on the web GUI at localhost:8080.

Then, compile the project with Maven and submit it to the cluster:

cd /vagrant/examples/count_words
mvn clean package # This can take a while
spark-submit --class com.examples.MainExample \
  --master spark://spark-cluster-master:7077 \
  /vagrant/examples/count_words/target/spark-scala-maven-project-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
  /vagrant/examples/quijote.txt \
  /vagrant/examples/quijote_out

When the task is completed, you will be able to see the results inside the examples/quijote_out directory. Finally, stop the cluster with:

$SPARK_HOME/sbin/stop-all.sh

Spark

To access the Spark shell, just use the command

spark-shell

spark-notebook

cd /opt/spark-notebook/ && ./bin/spark-notebook

And then head to http://localhost:9000/.

Other resources

For a real Data Analytics-oriented cluster, maybe check out vagrant-cluster

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
examples		examples
provisioning		provisioning
.gitignore		.gitignore
README.md		README.md
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

Installing Ansible

Provisioning with Vagrant

Logging into the master node

Example

Spark

spark-notebook

Other resources

About

Releases

Packages

fdavidcl/vagrant-spark

Folders and files

Latest commit

History

Repository files navigation

Usage

Installing Ansible

Provisioning with Vagrant

Logging into the master node

Example

Spark

spark-notebook

Other resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages