Skip to content

Fred's metabarcoding pipeline

Frédéric Mahé edited this page Jan 15, 2019 · 18 revisions

From FASTQ files to OTU table with swarm

Welcome!

This wiki will present a complete example of metabarcoding pipeline, or at least my version of it. The pipeline is written in bash v4 (release in Feb. 2009), and it will not work with older versions. Please check your bash version number with bash --version.

You can navigate the different steps using the sidebar on the right.

An important characteristic of my pipeline is that the main filtering step, including quality filtering, is performed after clustering. Swarm scales very well and there is no need to try to eliminate reads upstream to speed up the clustering step. Instead, my pipeline tries to preserve as much information as possible, as long as possible. It is only when producing the final OTU table that harsh filters are applied.

Please feel free to comment or to contact me to discuss the different steps of the pipeline. Sequencing technologies change rapidly and the metabarcoding community is still far from having established a set of good practices for OTU picking. We should definitively have an open discussion about it.

I intend to keep that page in sync with my own work, and to update it when new software versions or new publications rock the metabarcoding world. For example future vsearch improvements (e.g., the ability to read from pipes) might simplify the pipeline by eliminating the need for temporary files.

Requirements and dependencies

You will need to install vsearch, cutadapt and swarm.

You will also need python 2.7 (not tested with python 3) and a recent version of bash. MacOS users beware, Apple's Terminal provides an old version of bash (version 3). I'll try not to use bash version 4 specific code, but I might accidentally do it.

Environmental study

You will also need metabarcoding data. We are going to work with an Illumina HiSeq dataset of environmental sequences amplified from 29 samples, using the following rRNA 18S primers: FORWARD 5'-CCCTGCCHTTTGTACACAC-3' and REVERSE 5'-CCTTCYGCAGGTTCACCTAC-3' (to add to the confusion, we are going to search for the reverse-complement of the REVERSE primer: GTAGGTGAACCTGCRGAAGG). The sequencing company sent us two fastq files: R1.fq and R2.fq.

Merge paired-reads

The very first step is to check the encoding of the fastq files with vsearch:

VSEARCH=$(which vsearch)
FORWARD="R1.fq"
OUTPUT="coolproject.assembled.fastq"

# Check quality encoding (33 or 64?)
"${VSEARCH}" \
    --fastq_chars ${FORWARD} 2> ${OUTPUT/.fastq/.log}
Reading fastq file 100%  
Read 15466354 sequences.
Qmin 35, QMax 74, Range 40
Guess: -fastq_qmin 2 -fastq_qmax 41 -fastq_ascii 33
Guess: Illumina 1.8+ format (phred+33)
...

Now we know how quality is encoded (offset of 33), we can merge paired-reads (R1 and R2). Here is how to do it with vsearch:

VSEARCH=$(which vsearch)
THREADS=4
ENCODING=33
FORWARD="R1.fq.bz2"  # bz2, gz and uncompressed fastq files are allowed
REVERSE="R2.fq.bz2"
OUTPUT="coolproject.assembled.fastq"

# Merge read pairs
"${VSEARCH}" \
    --threads ${THREADS} \
    --fastq_mergepairs ${FORWARD} \
    --reverse ${REVERSE} \
    --fastq_ascii ${ENCODING} \
    --fastqout ${OUTPUT} \
    --fastq_allowmergestagger \
    --quiet 2>> ${OUTPUT/.fastq/.log}

The percentage of assembled pairs is appr. 90%. Good.

Demultiplexing, primer clipping, sample dereplication and quality extraction

Remember that our assembled fastq files contains data from 29 samples. We are going to separate them, an operation known as demultiplexing. In order to do that, we need a file tags.list listing our sample names and tag sequences:

S04 AGCACTGTAG
S14 CGAGAGATAC
S18 TCTACGTAGC
S20 ACGACTACAG
S21 CGTAGACTAG
S22 TACGAGTATG
S29 ACTGTACAGT
...

I chain cutadapt commands to select reads containing the tag I am looking for, then I select reads containing the forward primer, and finally reads containing the reverse primer. For each sample, I extract the quality information (after primer clipping), and I produce a dereplicated fasta file (identical sequences are merged).

Wow! That script does a lot of things.

# Define variables
INPUT="coolproject.assembled.fastq"
TAGS="tags.list"
PRIMER_F="CCCTGCCHTTTGTACACAC"
PRIMER_R="GTAGGTGAACCTGCRGAAGG"
MIN_LENGTH=32
MIN_F=$(( ${#PRIMER_F} * 2 / 3 ))  # primer match is >= 2/3 of primer length
MIN_R=$(( ${#PRIMER_R} * 2 / 3 ))

# Define binaries, temporary files and output files
CUTADAPT="$(which cutadapt) --discard-untrimmed --minimum-length ${MIN_LENGTH}"
VSEARCH=$(which vsearch)
INPUT_REVCOMP=$(mktemp)
TMP_FASTQ=$(mktemp)
TMP_FASTQ2=$(mktemp)
TMP_FASTA=$(mktemp)
OUTPUT=$(mktemp)
QUALITY_FILE="${INPUT/.fastq/.qual}"
  
# Reverse complement fastq file
"${VSEARCH}" --quiet \
             --fastx_revcomp "${INPUT}" \
             --fastqout "${INPUT_REVCOMP}"

while read TAG_NAME TAG_SEQ ; do
    LOG="${TAG_NAME}.log"
    FINAL_FASTA="${TAG_NAME}.fas"
    
    # Trim tags, forward & reverse primers (search normal and antisens)
    cat "${INPUT}" "${INPUT_REVCOMP}" | \
        ${CUTADAPT} -g "${TAG_SEQ}" -O "${#TAG_SEQ}" - 2> "${LOG}" | \
        ${CUTADAPT} -g "${PRIMER_F}" -O "${MIN_F}" - 2>> "${LOG}" | \
        ${CUTADAPT} -a "${PRIMER_R}" -O "${MIN_R}" - 2>> "${LOG}" > "${TMP_FASTQ}"

    # Discard sequences containing Ns, add expected error rates
    "${VSEARCH}" \
        --quiet \
        --fastq_filter "${TMP_FASTQ}" \
        --fastq_maxns 0 \
        --relabel_sha1 \
        --eeout \
        --fastqout "${TMP_FASTQ2}" 2>> "${LOG}"

    # Discard sequences containing Ns, convert to fasta
    "${VSEARCH}" \
        --quiet \
        --fastq_filter "${TMP_FASTQ}" \
        --fastq_maxns 0 \
        --fastaout "${TMP_FASTA}" 2>> "${LOG}"
    
    # Dereplicate at the study level
    "${VSEARCH}" \
        --quiet \
        --derep_fulllength "${TMP_FASTA}" \
        --sizeout \
        --fasta_width 0 \
        --relabel_sha1 \
        --output "${FINAL_FASTA}" 2>> "${LOG}"

    # Discard quality lines, extract hash, expected error rates and read length
    sed 'n;n;N;d' "${TMP_FASTQ2}" | \
        awk 'BEGIN {FS = "[;=]"}
             {if (/^@/) {printf "%s\t%s\t", $1, $3} else {print length($1)}}' | \
        tr -d "@" >> "${OUTPUT}"
done < "${TAGS}"

# Produce the final quality file
sort -k3,3n -k1,1d -k2,2n "${OUTPUT}" | \
    uniq --check-chars=40 > "${QUALITY_FILE}"

# Clean
rm -f "${INPUT_REVCOMP}" "${TMP_FASTQ}" "${TMP_FASTA}" "${TMP_FASTQ2}" "${OUTPUT}"

Please note that I create a reverse-complement version of the fastq file at the beginning of the script. In the past, I encountered fastq files where half the sequences where reverse-complemented. To avoid loosing data, I systematically search for reads in both directions. Most of the time, it's a waste of time, but I like to be sure.

Warning that script creates large temporary files. You might want to write those files locally (see mktemp --tmpdir .) to avoid filling all your /tmp partition.

After running that script, we end up with 29 fasta files looking like this:

>d2ffbe23a59225582bdaee6fa80f6fd9c9e69162;size=12843;
CGCCCGTCGCTACTACCGATTGAATGGCTTAGTGAGGTCTCCGGATTGACTTTGGCGCACCGGCAACGGCATGCTATTGCTGAGAAGTTGATCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>c0750ac8e18a3a62a5e6d738722da624e08a67a6;size=5236;
CGCCCGTCGCTACTACCGATTGAATGGCTTAGTGAGACCTCCGGATTGGCTTTGGGGAGTCGGAAACGACACCCTGTCGCTGAGAAGTTGGTCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>e5db261d66098584f72d8d0ad1e95125206cb891;size=4374;
CGCCCGTCGCTACTACCGATTGAATGGCTTAGTGAGGTCTCCGGATTGGCTTTGGGGAGCCGGCGACGGCACCCTATTGCTGAGAAGCTGATCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>62f623cf85737f7ee9d2709035045c62bc23b136;size=3089;
CGCCCGTCGCTACTACCGATTGAATGGCTTAGTGAGGCTTTCGGATTGGATTTTGGCAGCTGGCAACAGCAGCTAGAAACTGAAAAGTTATCCAAACTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
...

We also obtained a quality file listing the lowest expected error rate observed for each unique sequence, and the length of the sequence (results are sorted by increasing sequence length):

33175da0b57959bd67b8db19b71d7cd8a9804c5f	0.0025	32
69fd186e3a43b31f3435d41065f982e573a9afdc	0.0025	32
c78fe6140524ae1757f7bab10e418f1e9b319d96	0.0025	32
d68ad93d01c18917e1fd9f746716ca6ecf0e0459	0.0027	32
e5cb168b1f6cc422278eea9166e7399967d729f7	0.0025	32
07796137b3acf6ce9ee3f57b89465ef6d72d880d	0.0026	33
...

As suggested by Tobias Guldberg Frøslev, if you have tags on both end of your reads, you need to add a third column to the file tags.list with the reverse-complemented version of your sequence tags. In the example below, the forward and reverse tags are the same:

S001	AACCGA	TCGGTT
S002	CCGGAA	TTCCGG
S003	AGTGTT	AACACT
S004	CCGCTG	CAGCGG
S005	AACGCG	CGCGTT
...

The main script needs to be modified to read that third column and run cutadapt another time to trim the reverse tag:

while read TAG_NAME TAG_SEQ RTAG_SEQ; do
    LOG="${TAG_NAME}.log"
    FINAL_FASTA="${TAG_NAME}.fas"

    # Trim tags, forward & reverse primers (search normal and antisens)
    cat "${INPUT}" "${INPUT_REVCOMP}" | \
        ${CUTADAPT} -g "${TAG_SEQ}" -O "${#TAG_SEQ}" - 2> "${LOG}" | \
        ${CUTADAPT} -g "${PRIMER_F}" -O "${MIN_F}" - 2>> "${LOG}" | \
        ${CUTADAPT} -a "${RTAG_SEQ}" -O "${#RTAG_SEQ}" - 2>> "${LOG}" | \
        ${CUTADAPT} -a "${PRIMER_R}" -O "${MIN_F}" - 2>> "${LOG}" > "${TMP_FASTQ}"

Global dereplication, clustering and chimera detection

We pool the 29 fasta files (one per sample) and we dereplicate the sequences with vsearch. The dereplicated dataset is clusterized by swarm, and the OTU representatives are passed to vsearch to detect chimeras.

VSEARCH=$(which vsearch)
SWARM=$(which swarm)
TMP_FASTA=$(mktemp --tmpdir=".")
FINAL_FASTA="coolproject_29_samples.fas"

# Pool sequences
cat S[0-9][0-9]*.fas > "${TMP_FASTA}"

# Dereplicate (vsearch)
"${VSEARCH}" --derep_fulllength "${TMP_FASTA}" \
             --sizein \
             --sizeout \
             --fasta_width 0 \
             --output "${FINAL_FASTA}" > /dev/null

rm -f "${TMP_FASTA}"

# Clustering
THREADS=16
TMP_REPRESENTATIVES=$(mktemp --tmpdir=".")
"${SWARM}" \
    -d 1 -f -t ${THREADS} -z \
    -i ${FINAL_FASTA/.fas/_1f.struct} \
    -s ${FINAL_FASTA/.fas/_1f.stats} \
    -w ${TMP_REPRESENTATIVES} \
    -o ${FINAL_FASTA/.fas/_1f.swarms} < ${FINAL_FASTA}

# Sort representatives
"${VSEARCH}" --fasta_width 0 \
             --sortbysize ${TMP_REPRESENTATIVES} \
             --output ${FINAL_FASTA/.fas/_1f_representatives.fas}
rm ${TMP_REPRESENTATIVES}
  
# Chimera checking
REPRESENTATIVES=${FINAL_FASTA/.fas/_1f_representatives.fas}
UCHIME=${REPRESENTATIVES/.fas/.uchime}
"${VSEARCH}" --uchime_denovo "${REPRESENTATIVES}" \
             --uchimeout "${UCHIME}"

At that point, the files obtained can be used to build an OTU table. One piece is missing though, the taxonomic assignment of our environmental sequences.

Taxonomic assignment

To get the most of your metabarcoding data, it is recommended to assign your sequences to a taxa. If you have a set of reference sequences for your marker, you can perform a sort-of in silico PCR to extract reference fragments. You can use cutadapt and your forward and reverse primers to trim the reference sequences. Once the reference dataset is ready, you can use vsearch and a python script to produce a last-common ancestor assignment table. The whole procedure is described here.

If for some reason, you don't need taxonomic assignment results, you can create a fake table like this:

grep "^>" coolproject_29_samples_1f_representatives.fas | \
    sed 's/^>//
         s/;size=/\t/
         s/;$/\t100.0\tNA\tNA/' > coolproject_29_samples_1f_representatives.results

Build the OTU table

Now, we can merge the taxonomic assignment results, the clustering results, the quality values, the chimera detection results and the OTU occurrences in each samples. The output will be a big table:

FASTA="coolproject_29_samples.fas"
SCRIPT="OTU_contingency_table.py"
STATS="${FASTA/.fas/_1f.stats}"
SWARMS="${FASTA/.fas/_1f.swarms}"
REPRESENTATIVES="${FASTA/.fas/_1f_representatives.fas}"
UCHIME="${FASTA/.fas/_1f_representatives.uchime}"
ASSIGNMENTS="${FASTA/.fas/_1f_representatives.results}"
QUALITY="coolproject.assembled.qual"
OTU_TABLE="${FASTA/.fas/.OTU.table}"

python \
    "${SCRIPT}" \
    "${REPRESENTATIVES}" \
    "${STATS}" \
    "${SWARMS}" \
    "${UCHIME}" \
    "${QUALITY}" \
    "${ASSIGNMENTS}" \
    S[0-9]*.fas > "${OTU_TABLE}"

Note that you must adapt the expression S[0-9]*.fas to your actual sample names. The order of the arguments is important, the script is not robust to argument permutations. The OTU_contingency_table.py script is copied below:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
    Read all fasta files and build a sorted OTU contingency
    table. Usage: python OTU_contingency_table.py [input files]
"""

from __future__ import print_function

__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
__date__ = "2016/03/07"
__version__ = "$Revision: 5.0"

import os
import re
import sys
import operator

#*****************************************************************************#
#                                                                             #
#                                  Functions                                  #
#                                                                             #
#*****************************************************************************#


def representatives_parse():
    """
    Get seed sequences.
    """
    separator = ";size="
    representatives_file = sys.argv[1]
    representatives = dict()
    with open(representatives_file, "rU") as representatives_file:
        for line in representatives_file:
            if line.startswith(">"):
                amplicon = line.strip(">;\n").split(separator)[0]
            else:
                representatives[amplicon] = line.strip()

    return representatives


def stats_parse():
    """
    Map OTU seeds and stats.
    """
    separator = "\t"
    stats_file = sys.argv[2]
    stats = dict()
    seeds = dict()
    with open(stats_file, "rU") as stats_file:
        for line in stats_file:
            cloud, mass, seed, seed_abundance = line.strip().split(separator)[0:4]
            stats[seed] = int(mass)
            seeds[seed] = (int(seed_abundance), int(cloud))
    # Sort OTUs by decreasing mass
    sorted_stats = sorted(stats.iteritems(),
                          key=operator.itemgetter(1, 0))
    sorted_stats.reverse()

    return stats, sorted_stats, seeds


def swarms_parse():
    """
    Map OTUs.
    """
    separator = "_[0-9]+|;size=[0-9]+;?| "  # parsing of abundance annotations
    swarms_file = sys.argv[3]
    swarms = dict()
    with open(swarms_file, "rU") as swarms_file:
        for line in swarms_file:
            line = line.strip()
            amplicons = re.split(separator, line)[0::2]
            seed = amplicons[0]
            swarms[seed] = [amplicons]

    return swarms


def uchime_parse():
    """
    Map OTU's chimera status.
    """
    separator = " "
    uchime_file = sys.argv[4]
    uchime = dict()
    with open(uchime_file, "rU") as uchime_file:
        for line in uchime_file:
            OTU = line.strip().split("\t")
            try:
                seed = OTU[1].split(";")[0]
            except IndexError:  # deal with partial line (missing seed)
                continue
            try:
                status = OTU[17]
            except IndexError:  # deal with unfinished chimera detection runs
                status = "NA"
            uchime[seed] = status

    return uchime


def quality_parse():
    """
    List good amplicons.
    """
    quality_file = sys.argv[5]
    quality = dict()
    with open(quality_file, "rU") as quality_file:
        for line in quality_file:
            sha1, qual, length = line.strip().split()
            quality[sha1] = float(qual) / int(length)

    return quality


def stampa_parse():
    """
    Map amplicon ids and taxonomic assignments.
    """
    separator = "\t"
    stampa_file = sys.argv[6]
    stampa = dict()
    with open(stampa_file, "rU") as stampa_file:
        for line in stampa_file:
            amplicon, abundance, identity, taxonomy, references = line.strip().split(separator)
            stampa[amplicon] = (identity, taxonomy, references)

    return stampa


def fasta_parse():
    """
    Map amplicon ids, abundances and samples.
    """
    separator = ";size="
    fasta_files = sys.argv[7:]
    samples = dict()
    amplicons2samples = dict()
    for fasta_file in fasta_files:
        sample = os.path.basename(fasta_file)
        sample = os.path.splitext(sample)[0]
        samples[sample] = samples.get(sample, 0) + 1
        with open(fasta_file, "rU") as fasta_file:
            for line in fasta_file:
                if line.startswith(">"):
                    amplicon, abundance = line.strip(">;\n").split(separator)
                    abundance = int(abundance)
                    if amplicon not in amplicons2samples:
                        amplicons2samples[amplicon] = {sample: abundance}
                    else:
                        # deal with duplicated samples
                        amplicons2samples[amplicon][sample] = amplicons2samples[amplicon].get(sample, 0) + abundance
    # deal with duplicated samples
    duplicates = [sample for sample in samples if samples[sample] > 1]
    if duplicates:
        print("Warning: some samples are duplicated", file=sys.stderr)
        print("\n".join(duplicates), file=sys.stderr)
    samples = sorted(samples.keys())

    return amplicons2samples, samples


def print_table(representatives, stats, sorted_stats,
                swarms, uchime, amplicons2samples,
                samples, quality, seeds, stampa):
    """
    Export results.
    """
    # Print table header
    print("OTU", "total", "cloud",
          "amplicon", "length", "abundance",
          "chimera", "spread", "quality",
          "sequence", "identity", "taxonomy", "references",
          "\t".join(samples),
          sep="\t", file=sys.stdout)

    # Print table content
    i = 1
    for seed, abundance in sorted_stats:
        sequence = representatives[seed]
        occurrences = dict([(sample, 0) for sample in samples])
        for amplicons in swarms[seed]:
            for amplicon in amplicons:
                for sample in samples:
                    occurrences[sample] += amplicons2samples[amplicon].get(sample, 0)
        spread = len([occurrences[sample] for sample in samples if occurrences[sample] > 0])
        sequence_abundance, cloud = seeds[seed]

        # Quality
        if seed in quality:
            high_quality = quality[seed]
        else:
            high_quality = "NA"
        
        # Chimera checking (deal with incomplete cases. Is it useful?)
        if seed in uchime:
            chimera_status = uchime[seed]
        else:
            chimera_status = "NA"

        # Chimera checking (deal with incomplete cases. Is it useful?)
        if seed in stampa:
            identity, taxonomy, references = stampa[seed]
        else:
            identity, taxonomy, references = "NA", "NA", "NA"

        # output
        print(i, abundance, cloud,
              seed, len(sequence), sequence_abundance,
              chimera_status, spread, high_quality, sequence,
              identity, taxonomy, references,
              "\t".join([str(occurrences[sample]) for sample in samples]),
              sep="\t", file=sys.stdout)
        i += 1

    return


def main():
    """
    Read all fasta files and build a sorted OTU contingency table.
    """
    # Parse taxonomic results
    representatives = representatives_parse()

    # Parse stats
    stats, sorted_stats, seeds = stats_parse()

    # Parse swarms
    swarms = swarms_parse()

    # Parse uchime
    uchime = uchime_parse()

    # Parse quality
    quality = quality_parse()

    # Parse taxonomic assignment results
    stampa = stampa_parse()
    
    # Parse fasta files
    amplicons2samples, samples = fasta_parse()

    # Print table header
    print_table(representatives, stats, sorted_stats, swarms,
                uchime, amplicons2samples, samples, quality,
                seeds, stampa)

    return


#*****************************************************************************#
#                                                                             #
#                                     Body                                    #
#                                                                             #
#*****************************************************************************#

if __name__ == '__main__':

    main()

sys.exit(0)

Filter the OTU table

Finally, you have an OTU table. So far, we've eliminated reads that we could not merge, reads without tags or primers, very short reads and reads with uncalled bases ("N"). Now is the time to discard OTUs using efficient, but somewhat arbitrary thresholds.

The OTU table starts with the following columns:

    1. OTU (each OTU is numbered)
    1. total (total number of reads in the OTU)
    1. cloud (total number of unique sequences in the OTU)
    1. amplicon (identifier of the OTU representative)
    1. length (length of the OTU representative)
    1. abundance (abundance of the OTU representative)
    1. chimera (is it a chimera? Yes, No, ?)
    1. spread (number of samples where the OTU occurs)
    1. quality (minimum expected error observed for the OTU representative, divided by sequence length)
    1. sequence (sequence of the OTU representative)
    1. identity (maximum similarity of the OTU representative with reference sequences)
    1. taxonomy (taxonomic assignment of the OTU representative)
    1. references (reference sequences closest to the OTU representative)

The rest of the columns represent OTU abundances in our 29 samples. Let's filter our results by taxonomic assignment, OTU size or spread, quality and chimeric status:

TABLE="coolproject_29_samples.OTU.table"
FILTERED="${TABLE/.table/.filtered.table}"
MY_FAVORITE_TAXA="pangolin"

head -n 1 "${TABLE}" > "${FILTERED}"
grep "${MY_FAVORITE_TAXA}" "${TABLE}" | \
    awk '$7 == "N" && $9 <= 0.0002 && ($2 >= 3 || $8 >= 2)' >> "${FILTERED}"

The table concentrates a lot of information, so feel free to go crazy with the filtering parameters (minimum-maximum sequence length, mandatory presence in replicates, taxonomic assignment, etc.). From there, I usually load the table in R and use vegan or other packages to analyze my results. If need be, it should be easy to write a script to convert the OTU table into the biom format used by Qiime.

That's it. This is how I prepare metabarcoding data with vsearch, cutadapt and swarm. Again, feel free to comment or email me if the above steps are too cursory.