This repository contains a set of Ansible playbooks that automate the process of deploying the code for the Open Syllabus Project - both the metadata extraction rig and the public-facing web application. These playbooks can be used to configure a local development environment managed by Vagrant, or to deploy public-facing instances on EC2.
Set up a local development environment
Clone this repository, create a Python 2.x virtualenv, and install dependencies.
virtualenv env . env/bin/activate pip install -r requirements.txt
Check out the submodules:
git submodule update --init
vars/local.yml, set a value for the
osp_db_passvariable, which will be used as the password for the main Postgres database.
osp_sync_data, enter paths to directories on the local filesystem that will by synced with the Vagrant VM. Eg, on a Mac, something like:
osp_sync_code: /Users/davidmcclure/Projects/osp-vagrant-code osp_sync_data: /Users/davidmcclure/Projects/osp-vagrant-data
These directories don't need to exist - they'll be created automatically when the VM is started.
Install the Vai plugin for Vagrant with:
vagrant plugin install vai
Start the Vagrant box with:
Then, provision the box with:
On the first run, this will take 20-30 minutes on most systems, since the pip install has to compile a number of very large packages (
Once the playbooks run, log into VM with:
And start the testing Elasticsearch process server:
sudo supervisorctl start es-test
Wait ~5s for Elasticsearch to start, and then change into
/home/vagrant/ospand run the test suite:
If this passes, the environment is fully configured and ready for work. Any changes to the code made in the synced directory (set by
osp_sync_code) will be automatically propagated to the VM, and vice versa.
At this point, if you're working with one of the public data dumps, just move the dump into the synced
osp_sync_datadirectory, and then, from the Vagrant VM, reset the database and use
pg_restoreto source in the data:
dropdb osp -U postgres createdb osp -U postgres pg_restore /osp/osp-public.sql -d osp -U postgres -v
This will take 10-20 minutes, since it has to rebuild a couple of pretty large indexes. Once it's complete, hop into
psql, and you should be able to interact with the data:
> psql osp -U osp psql (9.5.1) Type "help" for help. osp=> select count(*) from document; count --------- 1415005 (1 row)
When you're finished working, the VM can be paused with:
This shuts down Ubuntu but preserves the configuration and data of the VM. To restart it later, just run the
Then, open the VirtualBox Manager application and find the listing for
osp-deploy_server_XXX. Right click on it, and then click on "show" to launch a GUI for the VM. At a certain point during the boot process, Ubuntu will hang with:
The disk drive for /osp is not ready yet or not present
sto skip the message, which will allow the system to boot. (This isn't actually a problem, just a quirky interaction between Vagrant's directory sync'ing and Ubuntu.)
Make a wheelhouse to speed up deployments
The slowness of the pip install is a drag, especially when it comes time to put up a big set of workers on EC2. To get around this, we can build a "wheelhouse" from the local Vagrant VM, which contains pre-built binaries for Ubuntu. Then, when deploying new servers (either on EC2 or locally), we can tell the provisioning scripts to use these binaries, instead of recompiling everything from scratch.
Log into the Vagrant box with:
Change down into
/home/vagrant/osp, and run:
pip wheel -r requirements.txt -w wheelhouse
Tar up the wheelhouse:
tar -cvzf wheelhouse.tar.gz wheelhouse
Move the tarball into the synced directory:
mv wheelhouse.tar.gz /vagrant
Now, any future deployments will automatically detect the wheelhouse and deploy it to the remote server.
Deploying to EC2
Add your AWS credentials to ~/.boto
osp-deployrepo symlink the ansible config EC2 config file:
ln -s config/ansible.ec2.cfg ansible.cfg
vars/local.yml, fill in the
ec2_context: use something to differentiate your instances, such as your name
ec2_keypair: the name of the SSH key, as listed in AWS console
ec2_subnet_id: get this from David
ec2_osp_snapshot: get this from David
scraper-createplaybook to start an instance.
Once that's up, you sometimes have to explicitly refresh the EC2 inventory with
Then run the
scraper-deployplaybook and it will automatically run against the new instance.