Skip to content

biobakery_internal

ljmciver edited this page Jun 10, 2022 · 15 revisions

bioBakery: Internal documentation

This page describes the process we follow to maintain bioBakery. Please be informed that the instructions on this page are targeted to the bioBakery team. bioBakery users will not perform these steps. However, users are welcome to read the internal documentation to learn about how we add new tools and release new images.


Table of contents


1. Add a new tool

Each bioBakery tool has its own conda recipe and Docker image. All tools are also included in the biobakery VM and cloud images.

To add a new tool to the biobakery public release repositories, follow these steps:

  1. Write a conda recipe for your tool. Once built and tested, push the package to the conda biobakery repository. See the kneaddata conda recipe as an example: https://github.com/biobakery/conda-biobakery .
  2. Write a Dockerfile for your tool using the conda package. Once built and tested, push the package to the biobakery Dockerhub repository. See the kneaddata Dockerfile as an example: https://github.com/biobakery/biobakery/tree/master/docker .
  3. Update the biobakery VM and google cloud images to include the new tool.

2. Add a new public Atlas Vagrant box

The bioBakery start up scripts download and install the bioBakery box hosted by Atlas.

To build and upload a new version of the box, use the instructions that follow. If you need to update the packages included in the box, first update the box provisioning scripts. See the section on provisioning for more information.

  1. Build a new GUI bioBakery box . When the box is built the latest version of the bioBakery tool suite will be installed. This will include the latest versions of all of the bioBakery tools.

  2. Package the box

    a. Find the box name with Virtualbox: $ vboxmanage list vms

    b. Package the box with vagrant: $ vagrant package --base BOXNAME --output biobakery-gui.box --vagrantfile Vagrantfile.package (replace BOXNAME with the name of the box from the prior step)

  3. Test that the packaged box works as expected

    a. Add the box to your box list: $ vagrant box add biobakery-gui.box --name biobakery-gui-test

    b. Start a new box from the packaged box

    i.  `$ vagrant init biobakery-gui-test`
    ii. `$ vagrant up --provider=virtualbox`
    
  4. Add the new bioBakery box version to the bioBakery Atlas box set

2.2. Provisioning scripts

Vagrant builds the bioBakery box by executing a series of linux commands within a base Ubuntu box. See the Vagrantfile for the url of the current base Ubuntu box. These commands are contained in "provisioning scripts," which are bash scripts. The provisioning scripts are called from the Vagrantfile associated with each bioBakery box. All boxes use two provisioning scripts:

  • provision-biobakery-core.sh [common to all images]
  • provision-biobakery-(gui|nogui).sh [box-specific]

The first script, provision-biobakery-core.sh, handles all configuration options common across the bioBakery boxes. Specifically, this includes (1) install and removal of packages from the base Ubuntu box and, more importantly, (2) install of the bioBakery tool suite with the Homebrew formula. A single Homebrew formula installs the full bioBakery tool suite.

The second script is specific to the type of box you are trying to build. For example, the version of bioBakery with a graphical user interface (GUI) is additionally configured by calling provision-biobakery-gui.sh. These second scripts install additional packages, configure the graphical environment (for the GUI version), set aliases in the .bashrc file, and so forth. Notably, any "cleanup" steps common to all box builds must be present at the end of these box-specific provisioning scripts (e.g. purging the apt-get cache).


3. Add a new public Google Cloud image

The public Google Cloud image is hosted in the hutlab biobakery bucket. Follow these instructions to add a new public image to the bucket.

  1. Build a new bioBakery Google Cloud instance

    a. SKIP the step that installs tools with licenses.

  2. Stop the instance

  3. Go to Compute Engine -> Snapshot -> Create snapshot and create a snapshot of the stopped instance

  4. Delete the original instance

  5. Go to Compute Engine -> Disks -> Create disk

    a. Create a disk from the snapshot

    i.  Name the disk: `disk-biobakery-image`
    

    b. Create a temp disk

    i.  Name the disk: `disk-temp`
    ii. This disk is 50% larger than the snapshot disk.
    iii. This disk is blank.
    
  6. Go to Compute Engine -> VM instances -> Create new instance

    a. Create a new instance with the following:

    i.  Ubuntu 16.04 (10 GB memory, 1 core, 3.75 GB RAM)
    ii. `Identity and API Access -> Add Storage read / write`
    
  7. AFTER the instance has been created add the disks by editing the instance properties

    a. Add the snapshot and temp disk b. As of June 2016, there exists an issue with Google Cloud instances boot ordering. Adding the additional disks when the instance is created will cause errors in the remaining steps when trying to export the image. Please only add the additional disks after the instance has been created.

  8. SSH to the instance to run the script to package and export the image

    a. Clone the bioBakery repository

    i.  `$ sudo apt-get install git`
    ii. `$ git clone https://github.com/biobakery/biobakery`
    

    b. Run the script to package and export the bioBakery image

    i.  `$ cd biobakery/google_cloud`
    ii. `$ bash -x package_biobakery.sh $VERSION` (replace \$VERSION
        with the version number, ie 1.1)
    iii. This script will take some time to run. It will shred files
        (following AWS security best practices), build the image,
        and then export it to the bioBakery bucket.
    
  9. Delete the instance

  10. Go to Storage -> Browser -> biobakery_bucket and click on the link to make the new image public

  11. Follow the basic user instructions to run bioBakery in Google Cloud to create a new instance.

  12. Test the bioBakery install.

  13. Delete the test instance.


4. Add a new demo to test suites

Follow these instructions to add a new demo to biobakery test suite. You will only need to add input files, output files, and a bash script to add a new demo. You will not need to edit the biobakery demos software. The software will discover any new demos that are added to its sub-folders and make them available as new tool options.

  1. Make new data folders for your tool (replace NEWTOOL with tool name)

    a. `$ mkdir biobakery_tests/data/NEWTOOL/input`
    b. `$ mkdir biobakery_tests/data/NEWTOOL/output`
    
  2. Add the input files for the demo to the folder biobakery_tests/data/NEWTOOL/input

  3. Add the output files from running the demo to the folder biobakery_tests/data/NEWTOOL/output

  4. Create a bash script with demo commands for your tool (see biobakery_tests/demos/kneaddata.bash as an example)

    a.   This bash script should be added to the folder
        `biobakery_tests/demos/`
    b.   This bash script should be named `NEWTOOL.bash` (replace
        NEWTOOL with tool name)
    c.   Note in the bash script `$INPUT_FOLDER` and `$OUTPUT_FOLDER`
        will be replaced with the full paths to these folders.
    
  5. Reinstall biobakery_tests (this will add the new files to the install folder): $ python setup.py install

  6. Test running your new demo (replace NEWTOOL with tool name): $ biobakery_tests --tool NEWTOOL --mode test


5. Set up for a workshop

There are three methods that can be used by students to connect to bioBakery Google Cloud instances for a workshop ( Web Browser, VNC Viewer, and SSH ). The following instructions describe how to setup a workshop that will allow for all three methods.

  1. Log in to the google cloud console using the hutlab.public account: http://console.cloud.google.com

    a. If you are not prompted to login, please log out of your personal google account and try the link again. Alternatively you can login through an incognito window if in other windows you would like to remain in your personal google cloud account.

  2. In most cases a template has already been created for the workshop. If you have been given a template, please skip this step. If you need to create a template, go to Compute Engine -> Instance templates to create a new template that will capture the settings for all of the instances.

    a. In general instances should have at least 1 core, 6.5 GB of memory, and an image that includes a desktop with vncserver installed and set to run on startup. Also instances should have the latest bioBakery tool suite installed. The bioBakery public image can be used for the instances after starting up vnc to set the password and setting vnc up to run on startup. Small modifications to the image could be made for the workshop if needed like installing additional tools and changing the desktop configuration (ie new items on menus, increasing font).

    b. For Physalia an instance with 4 cores and 15 Gb of memory is used. This corresponds to the legacy (first generation) n1 series, n1-standard-4. Access is limited to that required for each instance (all API access is either disabled or read-only). Traffic settings are determined based on the access type required to the instance. Default settings are enough to allow students to access the instances through the Guacamole server by internal IP addresses.

  3. Go to Compute Engine -> Instance groups ( https://console.cloud.google.com/compute/instanceGroups ) to create one or more groups of instances for the workshop. Make sure to name it something with "biobakery" so the automated configuration script will pick up these instances. When creating the group, select a managed instance group (stateless), turn "off" auto-scaling, select the region + zone (see note), and enter the total number of instances required, then click "Create".

    a. NOTE: If a large number of instances are required ( > 10 ), you will need to create a couple groups of instances each in a different region to not overload the max quotas. For information on quotas, including max and use, go to IAM & Admin -> Quotas ( https://console.cloud.google.com/iam-admin/quotas ). To select zones and regions for the new groups, filter the list by CPUs. Please note region limits and zone limits both apply.

    b. Once this step is complete, students can access the instances through SSH.

  4. Go to Compute Engine -> VM instances ( https://console.cloud.google.com/compute/instances ) and start the guacamole server (name instance-guacamole-server) and also start the reverse proxy server (name instance-reverse-proxy). To start an instance, click the box on the left of the instance name in the list and then click on "Start / Restart" instance from the menu at the top.

    a. NOTE: An automated script will run on the guacamole server to find the group of student instances and update its configuration to include their IPs. This script will run about four minutes after the machine is started. The file $HOME/update_config/config.log will contain the log of the runs for this script. The file $HOME/update_config/mysql_commands_run.txt will contain the list of the configuration commands run plus the mapping of the google cloud instance name to guacamole instance name.

    b. NOTE: Depending on the number of student instances you might want to increase the machine size for the guacamole server. In a class of 71 instances an 8 core, 30 GB machine was used. This worked well and appeared to be more than enough computing resources. A machine of 4 cores and 15 GB could possibly be used for a workshop of 71 instances.

    c. The proxy instance has a static external IP so it will always be at the IP of the huttenhower guacamole redirect.

  5. Login to the hutlab guacamole server as student: http://huttenhower.sph.harvard.edu/guacamole . Check that all of the expected connections are up and running. If there are any issues refer to the log on the guacamole server and possibly rerun the configuration script at $HOME/update_config/run_update.bash if needed.

  6. (Optional) Depending on the number of student connections you might need to add or remove access to the new/existing connections for the student account. If so, logout as admin and and login as student to make sure all of the instances are visible to this account.

  7. (Optional) To allow students to connect through VNC, add a tag vnc-access to each of the instances by editing the instance properties. Once the changes to the properties have been saved VNC access is available. It is not recommended to set this option by default in the group template if VNC access is not being used because it allows access to the instances through their external IPs.

NOTE: It is very important to delete the student instances when the workshop is complete. When the workshop is complete, go to Compute Engine -> Instance groups to delete all of the student instances. Then go to Compute Engine -> VM instances and stop the guacamole server.

Clone this wiki locally