Create a standardized format for reference genome files and indexes
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
README.md
recipes.md

README.md

Refgenie: Reference Genome Indexer

Refgenie creates a standardized folder structure for reference genome files and indexes. You can download pre-built genomes or use the script to build your own for any genome you like. Minimal input is just a fasta file. It also comes with an optional docker container (nsheff/refgenie) so you don't have to install any software.

Rationale

NGS pipelines require reference genomes for alignments and other computation. Usually, each pipeline has a unique way to organize reference genomes, and they may require the assembly to be organized in this structure. The problem with that is you must build a new reference genome folder organization for each pipeline, or at the very least specify your organization to each pipeline specifically. This makes it hard to share pipelines across environments and people, as they have different methods for organizing and passing index files to the pipelines.

If everyone built pipelines to follow a standard structure for reference genomes, then pipelines could just take a string describing that genome (e.g. "hg38") and would be able to know how to find the indexes it needs for that genome.

Refgenie standardizes the reference genome index folder structure, so you can build your pipelines around that standardized reference genome format. This makes it easy to switch to a new reference genome, or to share pipelines and reference data among collaborators. Pipelines then need only a single variable: the genome name, and you can even use an environment variable to make all pipelines work seamlessly. The goal is something like the iGenomes project, except they just produce finalized files for you to download, but provide no software to produce your own reference for your own genomes. So, you can't use that to make a standardized format for your internal spike-in genomes or other species they don't provide.

You can download pre-computed tarballs of refgenie assemblies if you like, but more importantly, refgenie is a script, so you can produce the standard for whatever genome you want.

Download pre-built indexed reference genomes

These are built indexes for common genomes. These pre-built downloads are tar gzipped files, so you will need to unarchive them after downloading. The complete collection is listed at http://big.databio.org/refgenomes/:

Mirror 1:

Mirror 2 (use if mirror 1 is down):

Index list

Refgenie currently builds indexes for tools like bowtie2, hisat2, bismark (for DNA methylation), etc. You can find the complete list in the config file. These are all optional; you only have to build indexes for ones you intend to use. You can also add more later.

Indexing your own reference genome

  • Install Pypiper (pip install --user --upgrade https://github.com/epigen/pypiper/zipball/master) (Refgenie requires version >= 0.5)
  • Clone this repo (e.g. git clone git@github.com:databio/refgenie.git)
  • Install software for indexes to build; put them in your path (default) or specify paths in your refgenie config file. Or, you can use the Docker version and then you don't have to install anything but pypiper and docker.

Run refgenie with: src/refgenie.py -i INPUT_FILE.fa. (INPUT_FILE is a fasta file of your reference genome, and can be either a local file or a URL)

Optional

  • Set an environment shell variable called GENOMES to point to where you want your references saved.
  • Choose which indexes you want to include by toggling them in the config file.

To build a standard reference for a popular genome, follow one of the recipes.

Docker

I have produced a docker image on DockerHub (nsheff/refgenie) that has all of these packages pre-installed, so you can run the complete indexer without worrying about paths and packages. Just clone this repo and run it with the -d flag. For example:

~/code/refgenie/src/refgenie.py -d --input rn6.fa --outfolder $HOME

Contributing

Pull requests welcome! Add an indexer if you like.