A pipeline written in bash and R to access and process mitochondrial reference library data from NCBI RefSeq. Currently supported 12S metabarcode markers are: 'tele02', 'mifish-u', 'elas02', and 'mifish-u-mod'.
- Install hmmer v3.1b2:
sudo apt install hmmer
-
Make sure system utilities including wget, curl, md5sum, gzip are available.
-
Clone the repository:
git clone https://github.com/genner-lab/refseq-reflib.git
- Change directory:
cd refseq-reflib
- Make temp and output directories:
mkdir temp references
- Obtain correct R packages:
Rscript -e "renv::restore()"
curl ftp://ftp.ncbi.nlm.nih.gov/refseq/release/RELEASE_NUMBER
- May take more than 10 minutes to download and unpack the full catalogue, depending on internet connection speed.
scripts/download.sh
-
Use hidden Markov models to extract homologous DNA sequences of the fragment of interest.
-
The '-p' flag is the primer set. Currently supported sets for 12S are 'tele02', 'mifish-u', 'elas02', and 'mifish-u-mod'.
-
The script removes all hybrid (e.g. 'Cyprinus carpio x Carassius auratus') and undescribed taxa (e.g. 'Corydoras sp.').
-
The script converts all subspecies to species (e.g. 'Thunnus thynnus thynnus' becomes 'Thunnus thynnus').
-
The script then taxonomically dereplicates the sequences, meaning that within species all duplicated haplotypes will be removed, but different species can share haplotypes.
scripts/extract.R -p tele02
-
Annotate the mtDNA data with taxonomic information from GBIF. May take 10 minutes or more to retrieve the taxonomic database, depending on internet connection speed.
-
The script writes out three files into 'references': (a) the annotated RefSeq sequences in tabular format ('refseqVERSION-annotated-PRIMER.csv'); (b) annotated RefSeq fasta sequences formatted for the sintax algorithm employed in vsearch ('refseqVERSION-annotated-PRIMER.fasta'); and (c) the same sintax fasta sequences, but for one random species selected per genus ('refseqVERSION-annotated-genera-PRIMER.fasta').
-
The '-s' flag is the random number seed for the per genus subset.
-
The '-p' flag is the primer set. Currently supported sets for 12S are 'tele02', 'mifish-u', 'elas02', and 'mifish-u-mod'.
scripts/annotate.R -s 42 -p tele02