Currently microbial community profiling sudies rutienly use 454 pyrosequencing to generate 10,000 - 100,000 reads from particular variable regions of the 16S rRNA gene. Unfortunately 454 pyrosequencing has a number of short falls such as homopolymer errors. Furthermore taxonomic resolution can be lost when using 454 pyrosequencing due to the smaller fragment of the 16S rRNA gene that is being analyzed. Amplishot combines amplification of the full 16S rRNA gene sequence with de novo reconstruction of full-length 16S rRNA genes from specially constructed "Amplishot" Illumina sequencing libraries or from metagenomes.
- Qiime - tested only on version 1.6.0 & 1.8.0
- bowtie2 - tested with version 2.0.5
- biom - version 1 required (most likely version 2 will break things; only tested with version 1)
OTU Clustering Dependancies
- cd-hit - tested with 4.5.4
- uclust - now packaged with Qiime if using version 1.8.0 or higher. NOTE: uclust is not the same as usearch although google searches for the former will return the latter
You must have one of the following
- phrap - tested with version 1.09518 (not currently recomented as it does not scale)
- Ray - tested with version 2.3.1
[sudo] python setup.py install
or if you do not have sudo on your computer use the
--prefix option to change
the installation directory.
Amplishot has a single executable called
amplishot; you can see some basic help
by running the command
The command-line options for Amplishot only offer a subset of the options that are
available. Most options are changed by using a configuration file. The Amplishot
configuration file is written in YAML, which is a simple markup
language; before you try to modify the configuration file it might be helpful to
read up on the YAML syntax.
Command-line options and a configuration file can be used in tandum. Any options
specified on the command-line will overwrite the corresponding value in the
configuration file. If changes have been made to a configuration by using
command-line options, a new configuration file will be outputted to the global
output directory with a datetime signature so that no previous configuration details
are lost. A new configuration file will not be outputted if there are no changes
to the current configuration set.
Configuration options and their values
threads: Sets the number of threads/processes to use in the Amplishot pipeline. The value should be a single integer number (default: 1)
log_level: Changes the verbosity of logging messages. The options from most verbose to least are: DEBUG, INFO, WARN, ERROR, CRITICAL (default: INFO)
output_directory: This is the directory where all results will be outputted.
By default it is the current directory, symbolized by a '.' character
input_raw_reads: This must be a list of files raw Illumina sequencing read files to input into Amplishot. The format of the input reads must be a list-of-lists, which can be added to the configuration in two ways:
input_raw_reads: - [/full/path/to/sample1.1.fq, /full/path/to/sample1.2.fq] - [/full/path/to/sample2.1.fq, /full/path/to/sample2.2.fq] input_raw_reads: - - /full/path/to/sample1.1.fq - /full/path/to/sample1.2.fq - - /full/path/to/sample2.1.fq - /full/path/to/sample2.2.fq
aliases: Use this option to set the sample names to be used in the output files. By default the filename is used without the file extension. The form of the values must be a YAML list specified by either:
aliases: - alias1 - alias2 aliases: [alias1, alias2]
skip_pairtigs: specify true or false whether you would like to assemble paired reads first before mapping onto the reference 16S rDNA database. This option is highly recomended for samples that are from full metagenomes that will likely be mostly from non-rDNA source
minimum_pairtig_length: Specify the minimum length that pairtigs must be. This option has no effect if the
pairtig_read_filesoption is set. (default: 350)
pair_overlap_length: The minimum number of nucleotides that two reads from a pair must overlap by to generate a pairtig. (default: 30)
mapper: The name of the short read aligner used in Amplishot. Currently only bowtie2 is implemented and therefore the only valid value for this option is bowtie
mapper_database: Give the full filepath to an index file generated by the short read mapper
taxonomy_file: the name of the file containing a mapping between the reference sequences and their taxon strings
mapping_similarity_cutoffs: A list of required similarity between a reference sequence and a pairtig. Reads will be segregated into a band of similarity and assembled separately in that band
taxon_coverage: list of two integer numbers that determine whether there are enough reads for assembly. The first number must be the minimum coverage (vertical read depth) for a taxon; the second number is the number of bases that must contain the minimum coverage. (default: [2, 1000])
assembly_method: de novo 16S reconstruction method. The only valid values are
minimum_reconstruction_length: minimum length of sequences that are defined as 'full length' and used in taxonomic assignment.
otu_clustering_method: Currently the only valid value is
otu_clustering_similarity: the similarity used for clustering full-length sequences from different samples into OTUs
neighbours_file: A file that calculate the phylogenetic distanse between two separate reference sequences
Program related blocks
Some of the underlying programs used in Amplishot can be controlled precisely by
specifying a block in the configuration file containing options specific
to that program. Each of these blocks is specified with a key that is identical
to the program name; within each block are program specific key-value pairs.
The program specific key-value pairs must be indented by 4 spaces ( not tabs ), this indentation must be consistent throughout the entire configuration file. Currently program related blocks are available for both the assembly and taxonomy assignment parts of Amplishot
Specify extra options
phrap key. Any of the command-line options available in phrap
(listed here) can be used as the keys
in the phrap block, however you must not add in the dash (-) prefix for the options.
For example to modify stringency of the assembly, you could change the scoring matrix:
phrap: penalty: -9 gap_ext: -11 gap_init: -12 minscore: 350
Just because you can do this does not mean that you should unless you know exactly what you are doing or are experimenting when Amplishot is producing sub-standard results. The scoring matrix and other assembly parameters used in phrap have already been altered to generate accurate 16S assemblies, so the default settings should work well.
Taxonomy assignment is handled in Amplishot after the reconstruction of full-length
16S sequences has occurred. There are a number of different methods for taxonomic
assignment that include some of those available in Qiime 1.6.0. The taxonomy
assignment method is determined from the Amplishot configuration file with the
assign_taxonomy_method key. By default the Bowtie2 taxonomy assigner is used.
The valid values for each classifier are shown below:
bowtiefor bowtie2 based assigner
blastfor Qiime blastall based assigner
rdpfor RDP classifier
mothurfor Mothur classifier
Configuration File options
For all taxonomy assigners a special block can be given in the configuration file
for specific options. The key to this block must be the same as the value of
assign_taxonomy_method key. For example to use the blast taxon assigner the
following code could be added into the configuration file:
assign_taxonomy_method: blast blast: evalue: 1e-50 blast_db: /full/path/to/blast/database
Options specific to all taxon assigners
id_to_taxonomy_fp: Full path to file containing a mapping between reference sequences and their respective taxon strings. By default all taxon assigners will use the value of the
taxonomy_filekey. This option should only be used if different reference sequence set is being used for taxonomy assignment
reference_sequences_fp: Full path to file containing reference sequences
index: Full file path to bowtie2 formatted index file for the reference sequences.
By default the bow tie taxonomic assigner will use the value of the
mapper_databasekey, however a different database file can be accessed here for taxonomic assignment
threads: specify the number of threads that bowtie can use
percentId: The minimum percent identity that a representative sequence must map with, any sequence below this threshold will not be given a taxonomy
blast_db: Full file path to the file containing the blast database that must be formatted using the
formatdbutility for nucleotide sequences. Do not add in the file extensions usually associated with blast databases. e.g.
evalue: The maximum allowable e-value allowed for a given match. If no match can be found below this score, then a representative sequence will not be given a taxonomic assignment.
Confidence: Minimum allowed confidence score for taxonomic assignment
Confidence: Minimum allowed confidence score allowed for taxonomic assignment
max_memory: Set the maximum memory allowed for the RDP java virtual machine
training_data_properties_fp: Full path to a file containing pre-compiled training data.
This option is overridden if both the
id_to_taxonomy_fpkeys are set.
Example Configuration file
--- threads: 5 log_level: INFO minimum_pairtig_length: 350 # minimum length of the overlapped pairs pair_overlap_length: 30 # mimimum length of the overlap mapper: bowtie # program used for read mapping mapping_similarity_cutoffs: [0.85, 0.90, 0.95, 0.98] # the sequence similarity required between the reference database and the reads taxon_coverage: [2, 1000] # list of two numbers. The first is the minimum coverage, the second is the number of bases that need to be covered assembly_method: ray # choose a genome assembler minimum_reconstruction_length: 1000 # minimum length of sequences that we define as 'full length' otu_clustering_method: cdhit otu_clustering_similarity: 0.97 # the similarity used for clustering full-length sequences from different samples into OTUs read_mapping_percent: 0.90 # the percent identity that individual reads have to map with to be considered part of the reference assign_taxonomy_method: blast minimum_taxon_similarity: 0.90 # sequences that fall below this cutoff will be listed as no taxonomy blast_db: '/srv/whitlam/bio/db/gg/from_www.secongenome.com/2012_10/gg_12_10_otus/rep_set/99_otus.fasta'
Tips for writing config files
Writing out the full file path names in the configuration file can be a
real pain. However you can reduce the burden on yourself by taking
advantage of some of the advanced features in the
vim text editor.
INSERT mode if you start typing a file path (like
then press CTRL-x CTRL-f, you'll get a popup menu of file paths!! You
can use this to quickly add in file names to your config file.