Benchmark data creation
Tamock simulates habitat-specific benchmark data for metagenomic samples.
Simulated metagenomic reads are widely used to benchmark software and workflows for metagenome interpretation. Scope and power of metagenomic benchmarks depend on the selection of their underlying communities. As a result, conclusions of benchmark studies are limited for distant communities towards the benchmark data used. Ideally, simulations are therefore based on genomes, which resemble metagenomic communities realistically.
Tamock facilitates the simulation of metagenomic reads according to a microbial community, derived from real metagenomic data. Thus, Tamock simulations enable an assessment of computational methods, workflows and parameters specific for a microbial habitat. Tamock automatically determines taxonomic profiles from shotgun metagenomic data, selects reference genomes accordingly and uses them to simulate metagenomic reads.
By default, the bacterial fraction of a community is simulated, however other domains such as Eukaryota, Archaea or Viruses can be simulated as well (-d/--domains option).
Classification is implemented using Centrifuge (Kim et al. 2016) and sequencing read simulation is done by ART (Huang et al. 2012).
- GNU Scientific Library (GSL) https://www.gnu.org/software/gsl/
- Perl >= v5.12.0
Debian/Ubuntu
apt install libgsl-dev
#older distributions
apt install libgsl0-dev
CentOS
yum install gsl-devel
MacOS
brew install gsl
git clone https://github.com/gerners/tamock.git
cd tamock
#if ART (version 20160605) and Centrifuge (>= v1.0.4) are available on the system, this step is not needed
./tamock --install-deps
./tamock --install-test
Centrifuge indexes have to be downloaded separately. Prebuild indexes are provided on the centrifuge homepage at http://www.ccb.jhu.edu/software/centrifuge/
Example:
#for details on provided indexes, please see documentation at the centrifuge homepage (link above)
mkdir /<tamock_install_dir>/centrifuge-index
cd /<tamock_install_dir>/centrifuge-index
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz
tar -xzf p+h+v.tar.gz
#if there is already a present centrifuge installation with indexes, the index folder can also be linked as follows to be used by default
ln -s /path/to/centrifuge-index /<tamock_install_dir>/centrifuge-index
If a centrifuge p+h+v index is placed at the install directory under <install_dir/centrifuge-index/p+h+v*> this index is used by default if no other index is provided to tamock.
RefSeq reference genomes are all saved within one directory. If the directory /<tamock_install_dir>/refseq-genomes is created, this directory is used by default unless another directory is provided via the command line.
CAVEAT: If Eukaryotic genomes are simulated, downloaded genomes could potentially use >100 GB of file space
#optional
mkdir /<tamock_install_dir>/refseq-genomes
If a local copy of all RefSeq genomes is present on the system, softlinks to all genome files (e.g. GCF_000013425.1_ASM1342v1_genomic.fna.gz) can be placed in /<tamock_install_dir>/refseq-genomes to use the local mirror of RefSeq.
Paired end simulation with 4 threads
tamock -1 <paired_1.fastq> -2 <paired_2.fastq> -o <output directory> \
-R <reference genome directory> -x <centrifuge-index/p+h+v> -t 4
Paired end simulation with additional unpaired reads
tamock -1 <paired_1.fastq> -2 <paired_2.fastq> -U <single.fastq> \
-o <output directory> -R <reference genome directory> -x <centrifuge-index/p+h+v>
Single end simulation
tamock -U <single.fastq> -o <output directory> -R <reference genome directory> \
-x <centrifuge-index/p+h+v>
Classification is the most expensive task. Multiple threads speed up classification while RAM usage depends on the index used. This step can also be skipped by providing pre-computed centrifuge classification results.
CAVEAT: When using external centrifuge results, ensure that centrifuge version 1.0.4 or higher is used and that the same index is provided to tamock.
tamock -1 <paired_1.fastq> -2 <paired_2.fastq> --centrifuge-kreport -o <output directory> \
-R <reference genome directory> -x <centrifuge-index/p+h+v>
Simulate additional domains (only bacterial domain by default). All classified species with a corresponding reference genome will be simulated. Provide selected domains as a comma-separated selection via -d option, e.g. for (E)ukaryota, (B)acteria, (V)iruses, and (A)rchaea
tamock -1 <paired_1.fastq> -2 <paired_2.fastq> -o <output directory> -d E,B,V,A \
-R <reference genome directory> -x <centrifuge-index/p+h+v>
Tamock simulates all sequences which could be classified and do have a reference genome available. All simulated sequences will replace the corresponding classified sequences in the original sample, while the unknown and unclassified sequence fraction will be maintained. Tamock aims to replicate the original sample as close as possible therefore sequence depth for a benchmark sample will be the same as the original sample. However to enable additional experiments, read depth for the simulated sequence fraction can be altered (via --rn-sim) which will scale all classified sequences with a reference genome to the specified sequence depth which will be combined with the unknown/unclassified sequence fraction for the final benchmark.
tamock -1 <paired_1.fastq> -2 <paired_2.fastq> -o <output directory> --rn-sim <sequence depth for simulated sequences> \
-R <reference genome directory> -x <centrifuge-index/p+h+v>
If the separate sequence fraction (simulated, classified and subsequently replaced, unclassified and not replaced) should remain in the output of Tamock, use the -k/--keep option. All temporary read files will remain in the output directory under outdir/tmpreads like
- outdir/tmpreads/Sample_1.not_repl.fastq
Unclassified sequences which will not be replaced
- outdir/tmpreads/Sample_1.repl_by_sim.fastq
Sequences which could be classified and will be replaced by corresponding simulated sequences
- outdir/tmpreads/simulated_1.fq
Sequences simulated by ART. Combined with XX.not_repl.fastq they form the final Tamock benchmark sample.
tamock -1 <paired_1.fastq> -2 <paired_2.fastq> -o <output directory> --keep \
-R <reference genome directory> -x <centrifuge-index/p+h+v>
- -1/--forward
Forward paired reads. Can be used together with unpaired reads using -U option.
- -2/--reverse
Reverse paired reads. Can be used together with unpaired reads using -U option.
- -U/--unpaired
Unpaired or single end reads. Required if no paired reads are present.
- -o/--outdir
Output directory
- -R/--refgenomes
Directory to store all RefSeq reference genomes. Defaults to <install_dir/refseq-genomes> if present
- -x/--index
Centrifuge index (*.cf files). Only basename needed, e.g. /path/to/index.X.cf should be provided as '-x /path/to/index'. Defaults to <installdir/centrifuge-index/p+h+v> if present and no other option given
Index files can be downloaded from the centrifuge homepage at http://www.ccb.jhu.edu/software/centrifuge/
- -d/--domains
Select domains which should be simulated. Options are E(ukaryota), B(acteria), V(iruses), A(rchaea) For multiple selections, provide comma separated values e.g. <-d E,B>. Defaults to Bacteria (B).
- --gzip
Gzip all sequence files
-
-k/--keep Keep all intermediate ART and sequence files (sequence fraction with (repl_by_sim)/without (not_repl) reference genomes, simulated sequences (simulated*.fq)). Temporary files are stored in <output_dir/tmpreads>
-
-a/--assembly-summary
NCBI assembly summary table from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt. Defaults to /installdir/assembly_summary.txt (downloaded during installation).
- --ncbi-sum-update
Update NCBI assembly summary to current version and replace old version at /installdir/assembly_summary_refseq.txt Alternatively manually download from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt and provide via -a
- -t/--threads
Number of threads used by centrifuge. Defaults to 1.
- -v/--verbose
Verbose mode, print detailed information to screen
==== Benchmarks ====
- --centrifuge-kreport
Centrifuge kraken-like report. If centrifuge-kreport and centrifuge output files are provided, classification is skipped and these files are used instead. CAVEAT: The same index used for classification has to be provided to tamock via -x/--index. CAVEAT: Centrifuge version 1.0.4. or higher required
Centrifuge example command: centrifuge-kreport -x index centrifuge.out > centrifuge.kreport
- --centrifuge-out
Centrifuge tabbed output file (centrifuge -S ). If centrifuge-kreport and centrifuge output files are provided, classification is skipped and these files are used instead. CAVEAT: The same index used for classification has to be provided to tamock as well via -x/--index.
Centrifuge example command: centrifuge -x index -1 forwardpairs.fq -2 reversepairs.fq --out-fmt tab -S centrifuge.out --report-file centrifuge.report
- --no-reassign
No reassignment of read counts from strains without a reference genome to other classified strains/reference genomes of same species. Only simulate sequences classified directly to a reference. Default off.
==== ART sequence simulator ====
- -M/--illumina-model
Either 'custom' to calculate error profile of input reads or use precalculated ART Illumina error model. Available precalculated profiles are: GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp), HS10 - HiSeq 1000 (100bp), HS20 - HiSeq 2000 (100bp), HS25 - HiSeq 2500 (125bp, 150bp), HSXn - HiSeqX PCR free (150bp), HSXt - HiSeqX TruSeq (150bp), MinS - MiniSeq TruSeq (50bp), MSv1 - MiSeq v1 (250bp), MSv3 - MiSeq v3 (250bp), NS50 - NextSeq500 v2 (75bp)
Defaults to sample specific error profile, calculated on-the-fly (custom).
- --qprof1
Pre-calculated forward-read quality profile for custom error profile in ART. Replaces calculation of error profiles of input sequences.
- --qprof2
Pre-calculated reverse-read quality profile for custom error profile in ART. Replaces calculation of error profiles of input sequences.
- -l/--length
Read length of simulated reads. Should be matched with a realistic error profile with according read lengths (-M). If not provided, the longest read length from the first 250 reads of input sequences are used.
- --rn-sim
Number of reads for simulated sequence fraction. By default, the same number of sequences from the input sample will be kept, this option alters number of sequences!
- --mean-fragment-length
Mean size of fragment length for paired-end simulations by ART. Defaults to 200.
- --sd-fragment-length
Standard deviation of fragment size for paired-end simulations. Defaults to 10.
==== Dependencies ====
- --install-deps
Install centrifuge and ART default versions into the installation directory of tamock. Depends on libgsl (See Installation).
- --install-test
Test installation.