Skip to content

2. Metabarcoding workflow

Vasco Elbrecht edited this page Apr 12, 2018 · 1 revision

JAMP follows a Usearch based workflow but allows to apply additional filtering in the OTU clustering process, that is optimized for metazoan bulk samples, where specimens often vary in biomass (Elbrecht et al. 2017). The R package is optimized for Illumina amplicon data, using a fusion primer system (For a wet lab guide, have a look at this recent preprint).

Make sure to check out the more extensive documentation in the Tutorial folder! We are looking here at a small mock sample data set from (Leray & Knowlton, 2017)

1) Demultiplexing and Prefiltering

Usually, several samples are multiplexed on the same sequencing run, which have to be demultiplexed based on inline tags Demultiplexing_shifted() or might already be separated if Illumina Indexing is used. Demultiplexing with JAMP requires a csv table containing tags and an additional table indicating the used primer combinations. If using illumina indexing your samples might be already demultiplexed, in which case you can directly proceed to the paired-end merging step U_merge_PE(). Notice that PE merging currently requires your files to end with r1.txt & r2.txt (this will be more flexible in future versions). Also all modules starting with an U rely on Usearch (Edgar 2013).

JAMP always looks into the previous folder at the _data section and processes these files. If you want to run files from somewhere else, you can specify these files with list.files("path/to/_data", full.names=T). Currently, JAMP does switch into the folder where it works in, which will be avoided in a future version for easier data processing!

Demultiplexing and PE merging are the first steps that are typical for any metabarcoding data set. Please notice that we do merge as many reads as possible and then apply quality filtering after primers are trimmed. Otherwise, the primer sequences influence the quality filtering step. Primers are trimmed with Cutadapt (Martin 2011) using the command Cutadapt(). Most commercial library preparation kits attach the illumina adapter sequences in a way, that forward and reverse reads are sequenced simultaneously. In this cases, Cutadapt has to be applied two times, with flipping switching out forward and reverse primers in the second step, and reverse complementing the sequences U_revcomp(), as well as merging both files (with cat, see tutorial file). In a future script, this will be automated. If you are using fusion primers with parallel sequencing of forward and reverse direction, you should reverse complement the sequences starting with sequencing in reverse direction, before trimming primers. Once primers are trimmed and all reads are in the same orientation, additional Phred quality score based filtering and sequence length filtering can be applied. It is however in no case appropriate to use average based filtering of a logarithmic scoring system, thus the use of maximum expected error (EE) filtering is highly recommended (Edgar & Flyvbjerg, 2015). With EE filtering U_max_ee() the Phred score is converted into error probabilities. These are added together, and the read discarded if it exceeds a certain threshold (a sequence with e.g. EE=1.4would be discarded if max_ee=1, as 1.4 of the bases are believed to be sequencing errors). After applying EE based filtering, the fastq files are usually transformed to fasta files, as sequence quality information is no longer needed. Additionally, length filtering can be applied Minmax(), to remove reads that do not match the expected amplicon length. Here +/- 10 bp from the expected amplicon length might be a good option, but one should verify the sequence length distribution. For example, eDNA samples from water often contain mostly bacteria which can be coamplified due to the universality of the highly degenerated COI primers, leading to shorter or longer amplicons than expected. Sequence length can be visualized with e.g. Geneious (% expected length and histograms will be added to JAMP in the future).

The prefiltered data could further be subsampled U_subset() if the sequencing depth is massively different, but usually, I like to convert the abundances of the final OTU table into relative abundance (to avoid the stochastic effects of subsampling).

2) OTU clustring

clustering Fig 1: Clustering approach in JAMP (U_cluster_otus())

Clone this wiki locally