get Miniconda here: https://docs.conda.io/en/latest/miniconda.html, this pipeline is written in Python 2.7
A simple installation script has been written for installing this pipeline
source minibarcoder_install.sh <new_conda_env_name> <path_to_directory_for_cloning_minibarcoder>
Alternatively one can run the following commands
conda config --add channels bioconda
conda create -n mbconda python=2.7 mafft racon graphmap=0.5.2 blast seqtk git fasta3
conda activate mbconda
conda install -c anaconda biopython
git clone https://github.com/asrivathsan/miniBarcoder/
cd miniBarcoder
python setup.py install
In a few computers, we are experience issues with racon obtained from bioconda, if this happens: please compile directly from github
git clone --recursive https://github.com/isovic/racon.git racon
cd racon
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
cp build/bin/racon ~/miniconda2/envs/mbconda/bin/racon
Alternatively, if dependencies are being installed separately, you can just use the scripts directly, or proceed from "git clone " onwards.
Before proceeding with standard use, Please run the test files for the pipeline.
To test the pipeline, with the recent updates use the following. This works if setup.py install has been done. Else call each as a python script, i.e. python mb_parallel_demultiplex.py ...
cd testfiles
mb_parallel_demultiplex.py -d demfile_2.txt -l 600 -o testout -f test.fasta
mb_parallel_consensus.py -i testout
mv testout/all_barcodes.fa test_mafft_barcode.fa
filter_by_Ns.py -n 6 -i test_mafft_barcode.fa
racon_consensus.py -i testout -d racon_out -o test_racon.fa -b test_mafft_barcode_Nfilter.fa
aacorrection.py -b test_mafft_barcode_Nfilter.fa -bo test_barcodes_megablast.txt -bf test_barcodes_megablast.fasta -o test_mafft_barcode_aacorr.fa
aacorrection.py -b test_racon.fa -bo test_barcodes_megablast.txt -bf test_barcodes_megablast.fasta -o test_racon_barcode_aacorr.fa
consolidate.py -m test_mafft_barcode_aacorr.fa -r test_racon_barcode_aacorr.fa -t con_temp -o test_consolidate.fa
If fasta file is not available:
seqtk seq -A fastqfile > fastafile
If BLAST output file not available for aacorrection.py
aacorrection.py -b test_mafft_barcode_Nfilter.fa -d /path/to/nt -o test_mafft_barcode_aacorr.fa
In a few computers, we are experience issues with racon obtained from bioconda, if this happens: please compile directly from github
git clone --recursive https://github.com/isovic/racon.git racon
cd racon
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
cp build/bin/racon ~/miniconda2/envs/mbconda/bin/racon
Other details on scripts for quality assessments such as assess_corrbarcodes_wref.py are given below if you would like to compare minion barcodes with reference barcodes.
Given the multiple steps in the pipeline, we have written out the steps for obtaining barcodes for datasets A,B, and C for our manuscript. Easiest way to get a handle on the pipeline is to test it out on one of these datasets as described in Steps to operate the pipeline document.
In order to run Racon correction, both FASTQ and FASTA files are required as Racon requires quality score information: the order of the sequences must be same in the 2 files. This can be simply obtained if fasta is generated from FASTQ using tools such as fq2fa.
- Python2 (tested with 2.7)
- MAFFT v7
- glsearch36 (https://github.com/wrpearson/fasta36) This is part of fasta36 suite and one of the installed programs is "glsearch36"
- numpy
MAFFT and glsearch36 must be in path and callable by "mafft" and "glsearch36" on the terminal.
- Python2 (tested with 2.7)
- Biopython (tested with v 1.69)
- MAFFT v7
- BLAST+ (tested with 2.2+, 2.6+, 2.7+)
- (If BLAST output and corresponding accession fasta does not exist) Preferably: a local copy of the NT database. Alternatively, you can use blastn -remote option to run commandline blast on NCBI server, but this may not be as fast. If relying on -remote option, then ensure your BLAST is updated. Also note that for frequent use Entrez email should be provided.
“mafft”, “blastn” should be in path and callable from terminal. If local nt is used “blastdbcmd” should be callable from terminal, available as part of BLAST+
This worked in Ubuntu but we couldn’t get it to work in Mac OS 10.12.6 (we couldn’t install racon).
- graphmap (https://github.com/isovic/graphmap) (v0.5.2) (released currently with racon). v0.3 gave errors.
- racon (https://github.com/isovic/racon) : v1.3.1 Both can be installed by:
git clone https://github.com/isovic/racon.git && cd racon && make modules && make tools && make -j
export PATH=$PATH:$PWD/bin
export PATH=$PATH:$PWD/tools/graphmap/bin/Linux-x64
These must be installed and in path and calling “graphmap” and “racon” in terminal should not throw an error.
This script performs the various steps from primer finding, demultiplexing, read alignment and consensus calling.
For the purpose of the is pipeline, please avoid putting “;” in specimen name.
Output is stored in outpudirectory/
Basic usage for 658 bp barcode (100 reads used):
python miniBarcoder.py –f inputfasta –d demultiplexingfile –o outputdirectory –l 600
Basic usage for 658 bp barcode (all reads used):
python miniBarcoder.py –f inputfasta –d demultiplexingfile –o outputdirectory –l 600 –D 0
Output is stored in outpudirectory/all_barcodes.fa
Arguments:
-h, --help show this help message and exit
-f INFASTA, --infasta INFASTA
Path to input fasta file
-d DEMFILE, --demfile DEMFILE
Path to demultiplexing file
-o OUTDIR, --outdir OUTDIR
set output directory path
-l MINLEN, --minlen MINLEN
exclude barcode sequences identified that are shorter
than specified length
-m MODE, --mode MODE run with unique tag mode (1) or dual tag mode (2),
mode 1 ignores mismatch setting and allows only 1 bp
mismatch, default 2
-mm int, --mismatch int
number of mismatches allowed in tags, must be <=5,
default 2
-e EVALUE, --evalue EVALUE
evalue for primer search using glsearch36,default 1e+6
-g GAPS, --gaps GAPS number of gaps allowed for primer identification,
default 5
-D MAXDEPTH, --maxdepth MAXDEPTH
set max depth per coverage to improve speed, default
100, must be >2
-t THREADS, --threads THREADS
number of threads for glsearch and mafft
-bl BLEN, --barcodelength BLEN
estimated barcode length, used for unique tag mode
only, please keep it slightly shorter than actual barcode length, as insertion errors can make this run into primers, default 300
Basic usage for 658 bp COI
MODE 1: Existing BLAST output and accession fasta file
python aacorrection.py –b input_uncorrected_barcode_fasta –bf blast_accession_fasta_file –bo blastout_output file –o outputfilename
MODE 2: Conducting BLAST to local NT
python aacorrection.py –b input_uncorrected_barcode_fasta –d /path/to/nt/withprefix –o outputfilename
MODE 3: Conducting BLAST to remote NT (SLOW)
python aacorrection.py –b input_uncorrected_barcode_fasta –d “nt –remote” –o outputfilename
Commands
arguments:
-h, --help show this help message and exit
-b INFASTA, --barcodes INFASTA
Path to input barcode fasta file
-o OUTFILE, --outfile OUTFILE
outfile file name
-p THREADS, --threads THREADS
number of threads for BLAST,default=4
-bo BLASTOUTFILE, --blastout BLASTOUTFILE
Path to blast output file, outputformat 6
-bf BLASTACCFILE, --blastfasta BLASTACCFILE
Path to fasta file containing sequences of BLAST hits,
required if -bo or --blastout is given
-d PATH_TO_DB, --db PATH_TO_DB
Path to nucleotide database with database prefix, if
local copy is unavailable you can try typing 'nt
-remote'. note that remote has not been extensively
tested and is slower
-a NAMBS, --amb NAMBS
proportion of ambiguities allowed per barcode,
default=0.01
-l MINLEN, --minlen MINLEN
exclude barcodes shorter than this length, default=640
-L MAXLEN, --maxlen MAXLEN
exclude barcodes longer than this length, default=670
-c CONGAPS, --congaps CONGAPS
exclude sequences containing any gap of length >=
value, default=5
-n NAMINO, --namino NAMINO
number of flanking amino acids around the gap used for
correction, default=3
-g GENCODE, --gencode GENCODE
genetic code https://www.ncbi.nlm.nih.gov/Taxonomy/Uti
ls/wprintgc.cgi, default=5, invertebrate mitochondrial
-e EVALUE, --evalue EVALUE
e-value for BLAST search, default=1e-5
-H HPLEN, --hplen HPLEN
minimum homopolymer length, default=2
-s SUPPORT, --support SUPPORT
minimum support for indel in references. Reducing this
increases chances of errors and improper detection of
reading frames,default=2
Other scripts:
Usage:
sh racon_consensus.sh input_fastq_for_all_data input_fasta_for_all_data outputfolder_of_minibarcoder mafft_barcodes_obtained_by_minibarcoder outputdirectory
usage:
python consolidate.py –m mafft_aa_barcodefile –r racon_aa_barcodefile –o output_barcode_file
-h, --help show this help message and exit
-m MAFFTB, --mafft MAFFTB
Path to input mafft corrected barcode fasta file
-r RACONB, --racon RACONB
Path to input racon corrected barcode fasta file
-o OUTFILE, --outfile OUTFILE
Path to output corrected barcode fasta file
-o OUTDIR, --outdir OUTDIR
output directory
python filter_by_Ns.py -i input_barcode_fasta -n number of ambiguities
Output is stored in *Nfilter.fasta
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Path to input fasta file
-n NAMBS, --nambs NAMBS
Number of ambiguities allowed
assess_corrbarcodes_wref.py: Compare corrected barcodes with reference barcodes to measure accuracy.
-h, --help show this help message and exit
-m MINIONFASTA, --minionfasta MINIONFASTA
Path to input mafft corrected barcode fasta file
-r REFFASTA, --reffasta REFFASTA
Path to input reference fasta file
-t OUTDIR, --tempdir OUTDIR
Path to temporary directory
-o OUTFILE, --outfile OUTFILE
Path to output file containing statistics
assess_corrbarcodes_wref.py: Compare uncorrected barcodes with reference barcodes to measure accuracy.
-h, --help show this help message and exit
-m MINIONFASTA, --minionfasta MINIONFASTA
Path to input mafft corrected barcode fasta file
-r REFFASTA, --reffasta REFFASTA
Path to input reference fasta file
-t OUTDIR, --tempdir OUTDIR
Path to temporary directory
-o OUTFILE, --outfile OUTFILE
Path to output file containing statistics
python measure_ambs.py input_barcode_fasta
For bugs, queries and other issues, please contact Amrita Srivathsan, asrivathsan@gmail.com
This project was funded by Southeast Asian Biodiversity Genomics Centre (SEABIG), NUS and ASTAR
The publication associated is: http://dx.doi.org/10.1111/1755-0998.12890