Common reference genomes hosted on AWS S3
In NGS bioinformatics, a typical analysis run involves aligning raw DNA sequencing reads against a known reference genome. A different reference is needed for every species, and many species have several references to choose from. Each tool then builds its own indices against these references. As such, one analysis run typically requires a number of different files. For example: raw underlying DNA sequence, annotation (GTF files) and index file for use the chosen alignment tool.
These files are quite large and take time to generate. Downloading and building them for each AWS run often takes a significant of the total run time and resources, which is very wasteful. To help with this, we have created an AWS S3 bucket containing the illumina iGenomes references, with a few additional indices for a extra tools on top of this base dataset. The iGeomes initiative aims to collect and standardise a number of common species, references and tool indices.
This data is hosted in an S3 bucket (~5TB) and crucially is uncompressed (unlike the
.tar.gz files held on the illumina iGenomes FTP servers). AWS runs can by pull just the required files to their local file storage before running. This has the advantage of being faster, cheaper and more reproducible.
To make usage easier, this repository contains a script (
aws-igenomes.sh) which can sync the AWS-iGenomes for you. It requires the AWS command line tools to be installed and configured with authentication. Required references can be supplied on the command line or given through prompts when running the script.
This repository is hosted using GitHub pages, so the script can be run in a single command as follows:
curl -fsSL https://ewels.github.io/AWS-iGenomes/aws-igenomes.sh | bash
For more details, see https://ewels.github.io/AWS-iGenomes/
If you'd prefer to just get a sync command for the files you need, you can use the web-based command builder that's available at https://ewels.github.io/AWS-iGenomes/
The details of the S3 bucket are as follows:
- Bucket Name:
- Bucket ARN:
- Region: EU (Ireland)
Description of Files
A full list of available files can be seen in
The following species have reference builds available:
- Arabidopsis thaliana
- Bacillus cereus ATCC 10987
- Bacillus subtilis 168
- Bos taurus
- Caenorhabditis elegans
- Canis familiaris
- Danio rerio
- Drosophila melanogaster
- Enterobacteriophage lambda
- Equus caballus
- Escherichia coli K 12 DH10B
- Escherichia coli K 12 MG1655
- Gallus gallus
- Glycine max
- Homo sapiens
- Macaca mulatta
- Mus musculus
- Mycobacterium tuberculosis H37RV
- Oryza sativa japonica
- Pan troglodytes
- Pseudomonas aeruginosa PAO1
- Rattus norvegicus
- Rhodobacter sphaeroides 2.4.1
- Saccharomyces cerevisiae
- Schizosaccharomyces pombe
- Sorangium cellulosum So ce 56
- Sorghum bicolor
- Staphylococcus aureus NCTC 8325
- Sus scrofa
- Zea mays
Most of these species then have references from multiple sources and builds. For example, Mus musculus has the following:
Within each reference build, the following resources are typically available (with a few exceptions):
- Gene annotation in
- Whole genome files
- Separate chromosomes
- Abundant sequences
- Alignment indices for the following tools:
- For some genomes:
- smRNA (miRBase)
An additional special-case is the GATK bundles, available for Homo sapiens (
See Data origin below for more details of how these files were generated.
Costs, billing and authentication
The S3 bucket is currently set to be completely open access (there were problems with the previous Requester Pays policy). This will remain the case until the credits awarded to fund this project from Amazon run out or expire (hopefully stable for some time yet).
Note that if if possible, it's best for us if you run in the same region as this S3 bucket (
Then there should be no data transfer fees and the resource should stay around for longer.
From the EC2 FAQ:
There is no Data Transfer charge between two Amazon Web Services within the same region (i.e. between Amazon EC2 US West and another AWS service in the US West). Data transferred between AWS services in different regions will be charged as Internet Data Transfer on both sides of the transfer.
How you use this resource largely depends on how you're using AWS. Very generally however, you can retrieve your required data by using the AWS Command Line Interface.
For example, using the
aws sync command:
aws s3 sync s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/ ./my_refs/
aws tool isn't installed, probably the easiest way to get it is using
pip install --upgrade --user awscli
Remember that you must configure the tool with some kind of AWS authentication to access the contents of the s3 bucket.
For more information and help, see the AWS CLI user guide.
Usage with Nextflow
Nextflow is a powerful workflow manager allowing the creation of bioinformatics analysis pipelines. It was created to help the transition from traditional academic HPC systems to cloud computing. As such, it has extensive built-in support for a number of AWS features. One such feature is native integration with s3. This means that you can specify paths to required reference files in your pipeline which are stored in s3 and Nextflow will automatically retrieve them.
The repository contains an example Nextflow config file containing common paths and a suggested usage example:
For an example of this in action, see our NGI-RNAseq pipeline. The
aws profile config contains s3 paths and our regular HPC config contains comparable regular file paths. This allows us to run the pipeline on either our HPC system or AWS with the same command and no extra setup.
This resource is based on the illumina iGenomes references. These were downloaded and unpacked in April 2016.
A full list of available files can be seen in this repository:
module load star/2.5.1b STAR --runMode genomeGenerate --runThreadN 8 --genomeDir ./ --genomeFastaFiles genome.fa --sjdbGTFfile genes.gtf --sjdbOverhang 100
(if no GTF file available,
--sjdbGTFfile genes.gtf --sjdbOverhang 100 was not specified).
module load bowtie/1.1.2 module load bowtie2/2.2.6 module load bismark/0.14.5 bismark_genome_preparation ./ bismark_genome_preparation --bowtie2 ./
Please note that
b37/CEUTrio.HiSeq.WGS.b37.NA12878.bam and associated files are not included.
This file is ~355GB and with the FTP download limiting from Broad it was going to take nearly
a year to transfer.
STAR, Bismark and BED12 additions were kindly done by the UPPMAX team. Full details and exactly scripts used for this can be found at github.com/UPPMAX/bio-data.
We are currently in discussion with the Open Data team at Amazon about making this into a Public Data resource. Until this happens, Amazon have been kind enough to provide us with a grant to cover the expenses of hosting this data on S3 for one year (April 2017 until April 2018).
Version v0.3 (dev)
- Made a web interface for generating aws s3 sync commands (not everyone likes random command line scripts..)
- Now that Amazon are taking the cost of the hosting, everything is fully public
--no-sign-requestto the commands so that they work without authentication
- Added new GRCh37 and GRCh38 builds for GATK
- Different to the existing hg18 and hg19 builds only in that the file organisation is cleaner and consistent with the rest of iGenomes (old builds left for backwards-compatibility)
- Contain new indexes for BWA. More to be added in the future.
Version v0.2 - 2016-05-25
- Added GATK bundles
hg38from the Broad FTP download
- Minor download script updates
Version v0.1 - 2016-05-23
Initial released. Repository created with file-list of the iGenomes resource, with added BED12, STAR and Bismark indices. Download bash script written and basic website created at https://ewels.github.io/AWS-iGenomes/
The iGenomes resource was created by illumina. All credit for the collection and standardisation of this data should go to them!
This S3 resource was set up and documented by Phil Ewels (@ewels). The additional references not found in the base iGenomes resource were created with the help of Wesley Schaal (@wschaal) - a system administrator at UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science).