Publishing Indexes to Main

In order to avoid duplication, we currently build indexes (e.g. with data managers) on Test and then once tested, "publish" them to Main. However, since the introduction of CVMFS, we are duplicating the data anyway.

Current Procedure

Adding New Data Managers

New DMs can be installed by using the Test Installer. If the DM needs more memory than the default allocation (8 GB), be sure to modify job_conf.xml in the playbook. See the entries for existing DMs: the datamanager server should be the handler. Test will need to be restarted after changes are made (ansible-env test config will do this for you).

See issue #31 for important details about paths that need to be fixed for newly-installed DMs

Once indexes are built and are ready to publish (after the procedure below), you will need to update /cvmfs/data.galaxyproject.org/managed/bin/managed.py to understand the new index. Necessary changes include adding the new index to:

the locmap dict. The value is the path to the DM's loc file, relative to /galaxy-repl. If the DM has multiple loc files, this can be a list.
the loccols dict. The value is a dict (or list of dicts) with keys corresponding to the columns of the build, dbkey, and path data table columns. Not all are required (depending on what columns are present in the data table for that location file)
the index_subdirs dict, if the index directory is a subdirectory of the genome build in /galaxy-repl/manageddata/data. The value is the name of the subdirectory in the genome directory.
the index_dirs dict, if the index directory is a subdirectory of /galaxy-repl/manageddata/data.

Once complete, add the appropriate data table entry to /cvmfs/data.galaxyproject.org/managed/location/tool_data_table_conf.xml. See existing entries as an example.

Updating Existing Data

Go to https://test-datamanager.galaxyproject.org/ and use the installed DMs to create new indexes. New data may not be available to Test until it has been restarted. Once verified, it can be published in CVMFS with:

$ ssh g2test@cvmfs0-psu0.galaxyproject.org
$ cvmfs_server transaction data.galaxyproject.org
$ /cvmfs/data.galaxyproject.org/managed/bin/managed.py <index>   # where <index> is a key in `locmap`
$ cvmfs_server publish -a <some_tag> -m <some_message> data.galaxyproject.org

NOTE: If the transaction is too large, your SSH session might time out before the publish command finishes, aborting the publish. To prevent this, one can use something like tmux, screen, or nohup.

A bit about what's going on: managed.py is essentially rsyncing data from Corral @ TACC into CVMFS, and modifying paths/location files as it does. We probably should not modify the data paths, but I chose to do so for easing the CVMFS catalog size splits. Where DM data is installed at:

/galaxy-repl/manageddata/data/<genome_build>/<index_name>/<build_id>/...

I copy this to CVMFS as:

/cvmfs/data.galaxyproject.org/managed/<index_name>/<build_id>/...

Location files are copied from their DM paths (/galaxy-repl/test/tool_data/...) to /cvmfs/data.galaxyproject.org/managed/location and paths are updated to the correct locations in CVMFS.

You can force these to be rapidly distributed by logging in to the stratum 1 CVMFS servers as g2test and running cvmfs_server snapshot data.galaxyproject.org && systemctl restart squid, then wiping the cache on Main servers (galaxy-web-{01..04}) with /usr/local/bin/cvmfs_wipecache (sorry, no automated process yet) but if you don't do this, changes should propagate over a few hours. Main may need to be restarted (gracefully) to pick up the new indexes on the web handlers (galaxy-web-{05,06}).

[g2main@galaxy-web-05 ~]$ galaxyctl graceful
galaxy_main_uwsgi:zergling0: started
[g2main@galaxy-web-06 ~]$ galaxyctl graceful
galaxy_main_uwsgi:zergling0: started

Protip

If you really want to watch the workers spin up and down while this is going on, you could use the command galaxyctl graceful && watch $HOME/bin/supervisorctl status

Jetstream

Jetstream mounts /cvmfs/data.galaxyproject.org, so nothing should be necessary to make the data available on Jetstream

Legacy, no longer required, documented for posterity:

With the Parrot connector, CVMFS is now available on Bridges, so the following section is for anyone curious.

Stampede and Bridges

Stampede and Bridges do not mount CVMFS, so it's necessary to copy the data. Stampede is currently behind and I need to change some things to update it, but it should be working fine for most data. At present, the only tool running on Bridges that needs reference data is (RNA) STAR, so I copied those by hand. However, due to the path changes we make for Main, the following information is relevant:

On Test, we use data as installed by the DMs, but on Main we modify the structure to live under a per-index directory. e.g. on Test:

hg19/rnastar_index2/hg19/...

On Main:

rnastar_index2/hg19/...

In order to make the data work for both, I copy from TACC using the Test layout and then create symlinks.

To copy data from TACC (any VM at TACC should work):

$ cd /galaxy-repl/manageddata/data
$ for d in $(find . -type d -name rnastar_index2); do rsync -avP --relative /galaxy-repl/manageddata/data/${d} xcgalaxy@br005.bridges.psc.edu:/pylon2/mc48nsp/xcgalaxy/data; done

The use of find . is critical - when using --relative and a path contains /./, the path components before the /./ are stripped from the remote side.

To create symlinks for Main (as xcgalaxy on Bridges):

$ cd /pylon2/mc48nsp/xcgalaxy/data
$ mkdir -p rnastar_index2
$ for b in */rnastar_index2/*; do ln -s ../$b rnastar_index2/$(basename $b); done

Future Changes

The need for copying can be eliminated. The changes to support this are outlined in #30.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly