Spark Vagrant VM
Vagrant VM box for Spark.
Preliminaries: VirtualBox and Vagrant
Vagrant is a tool to "create and configure lightweight, reproducible, and portable development environments." Vagrant itself is a virtual instance creation and startup tool on top of Oracle VirtualBox which takes care of the virtualisation.
Download and install the Open Source Edition of VirtualBox from virtualbox.
Then download and install Vagrant from vagrant. The Linux packages install
vagrant executable at
/opt/vagrant/bin and you will need to add this to
Building the VM
There is a
Rakefile with useful targets for creating and generating the Spark
Vagrant VM. To create a new VM run the default Rake target:
This will create the Spark box in Vagrant and run the necessary Puppet provisioning. This step will take some time to install Java, Hadoop, download and compile Spark, etc.
When the box is complete, you will find it in
You will likely only need to do this once unless you want to adapt the VM and make it available to others. If you are the trusting type, there is a prebuilt VM at:
Copy the download to the
target directory if you are cheating and continue.
You can test the VM by using the Vagrant definition in
cd example vagrant up vagrant ssh
The Spark Web UI will be port forwarded to port 8080 on your host so you can
http://localhost:8080 on your host computer to see some Spark details.
The HDFS Web UI is also port forwarded to port 50070 and 50075 so you can browse
the HDFS on the VM by opening
http://localhost:50070 on your host.
When finished, you destroy the VM using:
Note for the Paranoid
If you are inclined to paranoia, see
modules\spark\manifests\ssh.pp for notes
on changing the passwordless root SSH needed on the VM instance to start a
Note on the Versions
The VM uses Spark 0.7.2 and Hadoop 1.0.3. The reason for the slightly peculiar Hadoop is to match the version in Elastic MapReduce which this work originally targetted.
A more recent Hadoop 1 can be selected by changing the download in
/modules/spark/templates/root/spark.setup.erb and rebuilding.
To use the examples, you may also need to update the dependencies in
To run some sample applications, cd to
examples and compile a fat jar from
the SBT project there:
cd examples ./sbt012 assembly
The jar can be run on your host machine directly using e.g.:
java -cp target/scala-2.9.3/spark-assembly-1-SNAPSHOT.jar \ org.boringtechiestuff.spark.TweetWordCount \ --local \ dev/sample.json output
To run it on the VM, first SSH to it and put the necessary in HDFS:
vagrant ssh hadoop fs -mkdir /lib hadoop fs -put /vagrant/target/scala-2.9.3/spark-assembly-1-SNAPSHOT.jar /lib hadoop fs -mkdir /input hadoop fs -put /vagrant/dev/sample.json /input
/vagrant directory is a convenience mount of the
examples directory onto
Run the same application as earlier but in cluster mode this time:
java -cp /vagrant/target/scala-2.9.3/spark-assembly-1-SNAPSHOT.jar \ org.boringtechiestuff.spark.TweetWordCount \ hdfs://localhost:9000/input \ hdfs://localhost:9000/output
Check the Web UI on
localhost:8080 to prove it is doing something. When done,
the output can be checked using:
hadoop fs -ls /output hadoop fs -text /output/part-*
Spark also provides a streaming mode.
A streaming version of the previous can be run on your host machine directly using:
java -cp target/scala-2.9.3/spark-assembly-1-SNAPSHOT.jar \ org.boringtechiestuff.spark.StreamingTweetWordCount \ --local \ input output
In this case new files added in
input will be picked up and processed and
result left in
output by timestamp. For instance, copy the input file:
mkdir input cp dev/sample.json input
After a few seconds, a new directory will be added in output with the results:
cd output ls -alR
And look for the directory with a nonzero
The application runs until explicitly killed.
As before, this works on the VM also:
vagrant ssh hadoop fs -rmr /input hadoop fs -mkdir /input hadoop fs -rmr /output java -cp /vagrant/target/scala-2.9.3/spark-assembly-1-SNAPSHOT.jar \ org.boringtechiestuff.spark.StreamingTweetWordCount \ hdfs://localhost:9000/input \ hdfs://localhost:9000/output
In another console:
vagrant ssh hadoop fs -put /vagrant/dev/sample.json /input/sample2.json hadoop fs -lsr /output
And look for the nonempty
part files again.
Some useful Vagrant commands.
vagrant suspend: Disable the virtual instance. The allocated disc space for the instance is retained but the instance will not be available. The running state at suspend time is saved for resumption.
vagrant resume: Wake up a previously suspended virtual instance.
vagrant halt: Turn off the virtual instance. Calling
vagrant upafter this is the equivalent of a reboot.
vagrant destroy: Hose your virtual instance, reclaiming the allocated disc space.
vagrant provision: Rerun puppet or chef provisioning on the virtual instance.
vagrant box list: List the VM definitions that Vagrant has imported.
vagrant box remove <name>: Remove the named VM definition from Vagrant, possibly to allow for an updated version to be imported.
Vagrant SSH X Forwarding
X applications on VMs can be displayed on the host machine by specifying a
Vagrant SSH connection with X11 forwarding in the
config.ssh.forward_x11 = true
On the host machine, add an
xhost for the Vagrant VM:
Then X applications started from the VM should display on the host machine.
To see more verbose output on any vagrant command, add a VAGRANT_LOG environment variable setting, e.g.:
VAGRANT_LOG=INFO /opt/vagrant/bin/vagrant up
Further help troubleshooting can be obtained by editing your
config.vm.boot_mode = :gui setting. This will pop up a VirtualBox
GUI window on boot.