A multi-machine Vagrant cluster to work with Spark, including:
- Spark 1.5
- spark-notebook
Includes code from fredcons/vagrant-spark, wangqiang8511/vagrant-salt-spark and H4ml3t/spark-scala-maven-boilerplate-project.
Clone this repository:
git clone git@github.com:fdavidcl/vagrant-spark.git
cd vagrant-spark
Configure the Vagrantfile
to your liking with the number of compute nodes you want. If you want a single node, set SPARK_NODES
to 0
and use the master node.
pip install --user ansible
export PATH=$HOME/.local/bin:$PATH
First of all, generate a pair of private/public RSA keys for the master node with ssh-keygen -t rsa
and place them inside the provisioning/master_keys
directory.
The source code can be used to provision the virtual machines with Vagrant and Ansible, with a simple
vagrant up
which shall take a bit of time.
Once the VM is launched, you can log in with:
vagrant ssh master
# gaining root access (required to launch the cluster)
sudo su -
Once connected as root, you can use the various libraries installed or submit a program to the cluster.
In this example we will provide a text file to Spark in order to count the ocurrences of each word. First, start up the cluster with
$SPARK_HOME/sbin/start-all.sh
You can check the active nodes on the web GUI at localhost:8080
.
Then, compile the project with Maven and submit it to the cluster:
cd /vagrant/examples/count_words
mvn clean package # This can take a while
spark-submit --class com.examples.MainExample \
--master spark://spark-cluster-master:7077 \
/vagrant/examples/count_words/target/spark-scala-maven-project-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
/vagrant/examples/quijote.txt \
/vagrant/examples/quijote_out
When the task is completed, you will be able to see the results inside the examples/quijote_out
directory. Finally, stop the cluster with:
$SPARK_HOME/sbin/stop-all.sh
To access the Spark shell, just use the command
spark-shell
cd /opt/spark-notebook/ && ./bin/spark-notebook
And then head to http://localhost:9000/
.
For a real Data Analytics-oriented cluster, maybe check out vagrant-cluster