Athena is a read cloud assembler for metagenomes.
Recent updates
-
Conda: Athena is now available through bioconda. Please ensure channels are properly setup for bioconda before installing. Also, note you must create a new environment with
conda create
before installing as referenced in #14162 -
v1.3 release: Updates to command-line arguments and logging.
To run Athena through a Docker image with Athena and its prerequisities already installed, please skip to section Docker (and example dataset).
To install Athena in your native environment, the following prerequisites must be installed:
- python 2.7 on Mac and Linux; Athena is not compatible with python 3.x
- idba_ud --
please use this version, which is modified both to handle longer
short-read lengths and to locally assemble subsampled barcoded reads
clouds. Ensure all compiled binaries, including
idba_subasm
, are in your$PATH
- samtools and htslib -- version 1.3 or
later of
samtools
must all be in your$PATH
- bwa-mem
- flye -- version 2.3.1
We recommend setting up a virtualenv prior to installing Athena (or using virtualenvwrapper):
sudo pip install virtualenv
virtualenv athena_meta
Then, to install
cd /path/to/athena_meta
pip install .
To test that Athena is installed correctly, you can simply run
athena-meta
from the commandline, which should show help text without
any error messages.
Overview:
- Generate input seed contigs for Athena with metaspades/idba_ud. Align
barcoded input reads to seed contigs with
bwa
. - Setup a
config.json
file, which specifies inputs to Athena - Run Athena
Input read clouds must be specified as an uncompressed paired-end interleaved FASTQ, with the following tag information as in the example read pair below:
@NS500418:354:H27G3BGXY:3:12612:25572:11380 RG:Z:rg-1 BC:Z:GCCAATTCAAGTTT-1
TTCCATGTGGAAGTAGTTGTATTTGACGTAGCCCGCCATACCGTTTTCTGACATGAAGCGGTAATTCTCCTCAGAACCGTAGCCGGATACGGCCACCACCGTATGGGCCAACCTGTCATATCTGCTTGAGAAGGATTG
+
EEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEE6EEAEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEE
@NS500418:354:H27G3BGXY:3:12612:25572:11380 RG:Z:rg-1 BC:Z:GCCAATTCAAGTTT-1
CACGTGGTCTGGCGGGTCTCGCGCCACCTCTGGTTCGCCGTGGCCCTAACGGACAAGGACGCTACTTTCATGAGAATGAAGGAGGATGCCATGCGTAACGGCCAGACAAAGCCCGGTTACAACCTCCAGAACGGCACCGAGAACCAGA
+
EEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEAEEE
The barcoded interleaved FASTQ must satisfy the following:
- For each barcoded read, there must be a tag (either BC or BX, but not both) specifying the barcode. Each barcode must also end with a '-' followed by an integer sample identifier.
- The query name line for each read can have mulitple tags, but these
must be tab-delimited to be compatible with
bwa mem
specifying-C
. - The input FASTQ file must be barcode-sorted such that all reads with the same attached barcode appear in a contiguous block.
Run metaspades
or idba_ud
out of the box to assemble your input barcoded read
clouds into seed contigs. An example using metaspades:
metaspades.py --12 /path/to/reads -o /path/to/metaspades/out
Create a bwa
index for the ouput short-read draft assembly. Then run
bwa mem
specifying -C
, to pass the FASTQ tags through to the BAM, and
-p
, to specify that the paired-end FASTQ is interleaved:
bwa index /path/to/metaspades/out/contigs.fasta
bwa mem -C -p /path/to/metaspades/out/contigs.fasta /path/to/reads | samtools sort -o align-reads.metaspades-contigs.bam -
samtools index align-reads.metaspades-contigs.bam
Note that the resulting BAM must be position sorted and indexed.
The configuration file is in the JSON format, and contains the following parts:
input_fqs
: path to barcoded reads (FASTQ). Must be uncompressed interleaved paired end reads, which specify barcodes with the BC tag as specified above.ctgfasta_path
: path to seed contigs (FASTA), which must bebwa
indexedreads_ctg_bam_path
: alignments of barcoded input reads to seed contigs (alignments must have BC tag with barcode information per read).- (optional)
cluster_settings
: cluster compute environment to be used to perform assembly if a batch queueing submission system is available. Athena manages the environment using ipython-cluster-helper
A minimal config.json
file contains the following:
{
"input_fqs": "/path/to/fq",
"ctgfasta_path": "/path/to/seeds.fa",
"reads_ctg_bam_path": "/path/to/reads_2_seeds.bam"
}
An example cluster_settings
entry specifying a compute cluster
contains the following:
{
"cluster_settings": {
"cluster_type": "IPCluster",
"processes": 128,
"cluster_options": {
"scheduler": "slurm",
"queue": "normal",
"extra_params": {"mem":16}
}
}
}
scheduler
may be any of the clusters supported by
ipython-cluster-helper
. Currently, these are Platform LSF ("lsf"), Sun
Grid Engine ("sge"), Torque ("torque"), and SLURM ("slurm").
processes
specifies the size of the job array to be used.
To check all prerequisites are installed, run athena-meta --check_prereqs
.
To run a tiny test assembly to check that Athena is properly setup, run athena-meta --test
.
To run Athena on an input dataset, run athena-meta --config /path/to/config.json
.
Note that the athena-meta
command will continue running until all steps
have completed. athena-meta
runs locally with a single thread by
default, but can be run using multiple threads by specifying --threads
.
Please be aware that each thread can required up to 4Gb of memory during
the subassembly step and so the number of threads should be adjusted
accordingly. If the config file provided specifies a cluster environment,
the athena-meta
command itself can be run from a head node as it is
itself a lightweight process.
The output assembled contigs will be placed in a subdirectory of the one
config.json
resides in (in this case
/path/to/results/olc/athena.asm.fa
.) Logging output for each step will
also be in the subdirectory logs
(in this case /path/to/logs
), which
can be used to debug in event of an error.
A docker image is available for Athena. To download and run
athena-meta
on the example read clouds (~46MB), you can run the
following commands:
# use 'curl -O' if you're on a mac without wget
wget https://storage.googleapis.com/gbsc-gcp-lab-bhatt-public/readclouds-l-gasseri-example.tar.gz
tar -xzf readclouds-l-gasseri-example.tar.gz
Assuming docker is
installed, the following command can be used to assemble the example read
clouds from within docker (make sure you are in the same directory where
you downloaded and extracted readclouds-meta-asm-example.tar.gz
):
docker run -v `pwd`:/data -w /data/readclouds-l-gasseri-example abishara/athena-meta-flye-docker athena-meta --config config.json
This requires ~16GB of memory to run (for overlap assembly) and will take ~20 minutes to complete. If you are running docker for Mac, please make sure that your docker client has access to at least 16GB of memory (you may need to set in Preferences).
The output can be found in native host directory of
readclouds-meta-asm-example
.
Please cite the following publication:
- A. Bishara and E. Moss, et al. High-quality genome sequences of uncultured microbes by assembly of read clouds. Nature Biotechnology 2018 (https://doi.org/10.1038/nbt.4266).
The athena-meta
command may be run multiple times to resume from the
last step successfully completed.
If you are having trouble installing or running Athena, the docker file (see above) may help you diagnose the issue.
If an error arises, the output from athena-meta
or the log files may
be informative.
ShortSequence: Sequence is too long. If you get this error during assembly, please make sure you are using the right fork of idba_ud.
Please submit issues on the github page for Athena.