Builds a data science work environment for Russell Jurney's book Agile Data Science.
You will need Virtualbox and Vagrant installed and working. If you are using a version of Vagrant older than 1.3.0, you will also need Salty Vagrant. Salty Vagrant requires Ruby development libraries (ruby-dev on Ubuntu/Debian or ruby-devel on RHEL/Fedora/Centos).
- Clone this repo and edit the
Vagrantfileto customize your VM to taste.
See the Installation notes section below for comments on Java versions, operating systems, and details on a misleading error message you may receive.
The method for agreeing to the Oracle terms and downloading Java is based on the Chef Java Cookbook.
Initial run versus subsequent runs
During the intial run the components are downloaded, installed, and in some cases built. During subsequent runs only package/git updates (if any) are applied. On my machine with two CPUs assigned to the VM the initial run takes 21 minutes and subsequent runs take 1.5 minutes.
The VM environment includes the following major components:
Please be aware that Oracle JDK 6u45 is known to contain several security vulnerabilities so be careful if you access the internet from the virtual machine. See the Java versions section below for further comments on choosing a different version.
The book Agile Data Science contains instructions for the tools. This section documents small differences between the book and this environment.
The default base directory is
/home/vagrant/agiledata, which contains the following:
book-code: a clone of the Agile_Data_Code repo.
downloads: tarfiles downloaded during installation.
env.sh: source this script to set
JAVA_HOMEand add all tool binaries to your
linkjars.sh: see the Registering jarfiles in pig section below.
software: tools and libraries are installed in this directory.
venv: the python virtualenv used in the book.
Registering jarfiles in pig
The installation process creates and runs the script
linkjars.sh. This script finds all jarfiles in the
software directory and creates symlinks to them in
software/lib. The symlinks make it easier to register jarfiles in pig scripts. For example, to register MongoDB jars in your pig script, you can use
REGISTER /home/vagrant/agiledata/software/mongo-hadoop/flume/target/mongo-flume-1.1.0-SNAPSHOT.jar REGISTER /home/vagrant/agiledata/software/mongo-hadoop/target/mongo-hadoop-1.1.0-SNAPSHOT.jar REGISTER /home/vagrant/agiledata/software/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar REGISTER /home/vagrant/agiledata/software/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar REGISTER /home/vagrant/agiledata/software/software/lib/mongo-java-driver-2.11.1.jar
linkjars.sh script is run during installation and each time the VM is rebooted. It is unlikely you will need to run it manually, but the script is provided just in case. Please note that the following jarfiles are actual files rather than symlinks, and will not be affected by running the script:
Many factors can influence your choice of Java version. Recommending a specific Java version is a dubious proposition, like providing health advice to strangers.
This project conservatively uses Oracle JDK 1.6, the version specified in Pig's Getting Started doc and historically used by enterprise Hadoop installations.
However, you do have other options:
- Pig has been compatible with 1.7 for a while
- CDH4 works with 1.7
- MapR and Hortonworks work with 1.7 and will even work with OpenJDK
You may need to consult your organization, your sysadmin, your vendor, and/or your conscience before making this decision.
Supported operating systems
This environment should work on any system that can run Virtualbox and Vagrant. If you experience problems installing on Windows related to changing file permissions (look for
Failed to change mode to 755) in the output from the installation process you could try to delete line 13 in oracle_java.sls related to
- mode: 755
Windows does not have the same concept of file permissions as Unix-like and POSIX-compliant operating systems.
The default VM (configured in the
Vagrantfile) is Ubuntu Precise x64. I have also tested with Fedora 18. The environment may work using other Redhat- or Debian-based distros as well.
Misleading error message
Salt 0.15.x is affected by issue saltstack/salt#4904, causing it to exit with code 2 rather than code 0 on successful run. Vagrant interprets this code as an error, and displays the following message:
The following SSH command responded with a non-zero exit status. Vagrant assumes that this means the command failed! salt-call state.highstate -l debug
True errors in building the agiledata environment are much uglier than this. However, if you'd like to verify the installation, ssh into the VM with
vagrant ssh and then run
sudo salt-call state.highstate -l debug. This is a subsequent run, so it should take only a minute or two to complete. Since you are running the state directly rather than through Vagrant, you should see a true return code on success.