This software is designed to run genomics workflows on the Google Compute Engine. The distinguishing characteristic of this software is that it is designed to automatically mount and unmount disk storage as needed during the course of the workflow. This is in contrast to NFS storage or saving intermediate results to cloud storage. In certain use cases this strategy more closely approaches optimal resource utilization.
- Get Setup with GCE
- Configure GCE Image
- Get Service Account Authentication Info
- Constructing Disk and Instance CSV Files
- Run Test
- Web Server Version
- Alternative Licensing
Get a GCE Account and setup a Google Cloud Storage bucket. URI should look something like this gs://bucketname/
In this section we will configure a GCE Image for use as the OS on both the Master and Worker instances.
From the GCE developers console boot a new instance. I've chosen CENTOS6.6 as the base image, but if you use a different base you may need to modify the software installation below. Make sure to enable full access to storage during setup. This is important because you will save you image to your Google Cloud Storage bucket.
-
SSH into the new instance.
-
Install Git
sudo yum install git
-
Clone this project and note project location
git clone https://github.com/collinmelton/DynamicDiskCloudSoftware.git
-
Install project specific dependencies
** install development tools **
sudo yum install libevent-devel python-devel
sudo yum groupinstall "Development tools"
** install pip, apache-libcloud, PyCrypto, and httplib2 **
curl -o get-pip.py https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
sudo pip install apache-libcloud
sudo pip install PyCrypto
sudo pip install httplib2
** some additional commands to install python2.7, easyinstall, pip, and get PyCrypto etc **
sudo rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
wget http://www.python.org/ftp/python/2.7.6/Python-2.7.6.tgz
tar xvzf Python-2.7.6.tgz
sudo yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel
cd Python-2.7.6
sudo ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
sudo make && sudo make altinstall
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
sudo /usr/local/bin/python2.7 ez_setup.py
sudo /usr/local/bin/easy_install-2.7 pip
sudo /usr/local/bin/pip2.7 install apache-libcloud
sudo /usr/local/bin/pip2.7 install PyCrypto
sudo /usr/local/bin/pip2.7 install httplib2
sudo /usr/local/bin/pip2.7 install psutil
sudo /usr/local/bin/pip2.7 install -U https://github.com/google/google-visualization-python/zipball/master
ssh-keygen -t rsa -b 4096 -C "yourname@yourdomain.com"
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_rsa
edit ~/.ssh/authorized keys by adding the key located in ~/.ssh/id_rsa.pub, in this setup the master running this image will have the same public key as the worker and we want the master to be able to ssh into the worker so we need to add the master's public key to the list of authorized keys
sudo gcimagebundle -d /dev/sda -o /tmp/ --log_file=/tmp/abc.log
** check to see name of the image file from output of above command and edit commands below with this name **
gsutil cp /tmp/imagename.image.tar.gz gs://yourbucketname/
** below you can name your image, I've name the image cloudtest2, I think this can also be done with gcloud **
~/google-cloud-sdk/bin/gcutil --project "your_project_name" addimage cloudtest2 gs://yourbucketname/imagename.image.tar.gz
or
gcloud compute images create ddtest --source-uri gs://yourbucketname/imagename.image.tar.gz
In order to run the software you need to get a service account email address and a pem file. See instructions here: https://cloud.google.com/storage/docs/authentication#service_accounts
You should make a pem file and note your service account email address in the format: numbersandletters@developer.gserviceaccount.com
The workflow is specified by the disk and instance files. There is no internal check for validity of your specified workflow so be careful to map it out and not make errors! Make sure names of instances and disks comply with GCE standards (Name must start with a lowercase letter followed by up to 62 lowercase letters, numbers, or hyphens, and cannot end with a hyphen).
The disk file contains information including disk names, types, and sizes. A description of columns to include in this file are as follows:
| Column Name | Description |
|---|---|
| notes | Any notes you want to include about this row. |
| name | The name of the disk. $JOBID and $JOBMULT variables in the name will be replaced with job id and job multiplicity |
| size | The size of the disk in GB. |
| location | Disk zone. Default is us-central-1a. |
| snapshot | Disk snapshot to use. Default is None |
| image | Disk image to use. Default is None |
| job_multiplicity | This is a string with variable names separated by the pipe |. |
| job_id | The job id. Really this is just a scheme to replace the $JOBID variable in the name column with this value. |
| disk_type | The type of disk. Options are pd-standard or pd-ssd. |
| init_source | A location to copy to disk when initialized. This can be cloud storage bucket/folder combo or another disk mounted to the initial instance. |
| shutdown_dest | A location to save the disk contents to when the instance finishes. This can be cloud storage bucket/folder combo or another disk mounted to the instance. |
Note: To use default please write DEFAULT in place of specifying the column.
The instance file contains information including instance names, types, commands to run, instance dependencies, and disks to mount. A description of columns to include in this file are as follows:
| Column Name | Description |
|---|---|
| run | TRUE or FALSE whether to use or ignore this row. |
| notes | Any notes you want to include about this row. |
| name | The name of the instance. $JOBID and $JOBMULT variables in the name will be replaced with job id and job multiplicity |
| dependencies | Names of instances that must complete before this instance is launched. Separate instance names by pipe |. |
| read_disks | Names of disks separated by | to mount in read mode. |
| read_write_disks | Names of disks separated by | to mount in read/write mode. |
| boot_disk | Names of disk to use as a boot disk. |
| script | New line separated linux commands to run on instance |
| size | GCE instance specification (default is n1-standard-1). |
| image | The image to use. I believe this is only used if no boot disk is specified in the disk file and otherwise the image on the boot disk is used. Default is None. |
| location | The zone in which to boot the instance. (default is us-central1-a) |
| ex_network | A network to specify (I've never ended up using this so it could be buggy.) Default is 'default'. |
| ex_tags | Similar to ex_network this hasn't been tested but in theory should add tags to the instance. |
| ex_metadata | Similar to ex_network this hasn't been tested but in theory should add metadata to the instance. |
| job_multiplicity | This is a string with variable names separated by the pipe |. Variables will be added with a leading dash to replace $JOBMULTNODASH and without a dash for $JOBMULTNODASH |
| var_multiplicity | This is an old option I haven't used in a while. I belive the idea was that you put in pipe separated variables and it replaces $VARMULT with these variables pasted together is some useful fashion. I don't really like this option and I'm thinking of deprecating it. I believe I initially used it to get variable names formatted for combining files. |
| job_id | The job id. Really this is just a scheme to replace the $JOBID variable in the name column with this value. The variable will be added in place of $JOBID with a trailing dash. |
** Use the GCE Instance used to create the image above or make a new instance with access to compute and storage authorized with the image you created above. **
** Navigate to the directory for this project then go to the Master folder. **
Instance and disk files for a simple test run are in TestRuns/ named basictest_instances.csv and basictest_disks.csv
python2.7 RunJobs.py --I basictest_instances.csv --D basictest_disks.csv --P yourprojectname --PM test.pem --E somelettersandnumbers@developer.gserviceaccount.com --RD /home/yourusername/ --SD ./
Instance command data are saved in pickle files in the 'storage directory' specified by the --SD option. You can view the results of a history file by using the following command:
python2.7 DynamicDiskCloudSoftware/Worker/printCommandHistory.py --H ~/a-node-1-step1.history.pickle
I am developing an updated version of the software that runs a webserver (https://github.com/collinmelton/DDCloudServer). This version allows the user to generate a workflow, launch a workflow, and view progress and performance of the workflow as it runs.
This project is licensed open source under GNU GPLv2. To inquire about alternative licensing options please contact the Stanford Office of Technology Licensing (www.otl.stanford.edu).