
Here we walk through version 1.6 of the DADA2 pipeline on a small multi-sample dataset. Our starting point is a set of Illumina-sequenced paired-end fastq files that have been split (or "demultiplexed") by sample and from which the barcodes/adapters have already been removed. The end product is an **amplicon sequence variant (ASV) table**, a higher-resolution analogue of the traditional "OTU table", which records the number of times each amplicon sequence variant was observed in each sample. We also assign taxonomy to the output sequences, and demonstrate how the data can be imported into the popular [phyloseq](https://joey711.github.io/phyloseq/) R package for the analysis of microbiome data.

-----------------------

# Starting point

This workflow assumes that your sequencing data meets certain criteria:

* Samples have been demultiplexed, i.e. split into individual per-sample fastq files.
* Non-biological nucleotides have been removed, e.g. primers, adapters, linkers, etc.
* If paired-end sequencing data, the forward and reverse fastq files contain reads in matched order.

If these criteria are not true for your data (**are you sure there aren't any primers hanging around?**) you need to remedy those issues before beginning this workflow. See [the FAQ](faq.html) for some recommendations for common issues.

# Getting ready

First we load the `dada2` package If you don't already it, see the [dada2 installation instructions](dada-installation.html):


In [None]:
library(dada2); packageVersion("dada2")
library(readr)

library(phyloseq); packageVersion("phyloseq")
library(ggplot2); packageVersion("ggplot2")
theme_set(theme_bw())

*Older versions of this workflow associated with previous release versions or the dada2 R package are also available: [version 1.2](tutorial_1_2.html), [version 1.4](tutorial_1_4.html).*

The data we will work with are the same as those used in the [Mothur Miseq SOP](http://www.mothur.org/wiki/MiSeq_SOP). Download the [example data](http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip) and unzip. These fastq files were generated by amplicon sequencing (Illumina MiSeq, 2x250, V4 region of the 16S rRNA gene) of gut samples collected longitudinally from a mouse post-weaning, and one mock community control. For now just consider them paired-end fastq files to be processed. Define the following path variable so that it points to the extracted directory on **your** machine:

## Set Paths
We need to set up a "scratch" directory for saving files that we generate while running dada2, but don't need to save long term.

In [None]:
scratch.dir = path.expand("~/work/scratch/miseq_sop")
dir.create(scratch.dir, recursive = TRUE)

miseqsop.dir = file.path(scratch.dir,"MiSeq_SOP")

silva.url = "https://zenodo.org/record/1172783/files/silva_nr_v132_train_set.fa.gz"
silva.ref = file.path(scratch.dir, basename(silva.url))
# silva.species.ref = "/data/references/dada/silva_species_assignment_v128.fa.gz"

ps.rds = file.path(scratch.dir, "miseq_sop.rds")

## Download Data

In [None]:
miseq_sop_url = "http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip"
miseq_sop_data_zip = file.path(scratch.dir,basename(miseq_sop_url))
download.file(miseq_sop_url, destfile = miseq_sop_data_zip)

In [None]:
download.file(miseq_sop_url, destfile = miseq_sop_data_zip)

In [None]:
download.file(silva.url, destfile = silva.ref)

In [None]:
list.files(scratch.dir)
unzip(miseq_sop_data_zip, exdir=scratch.dir)
list.files(miseqsop.dir)
unlink(miseq_sop_data_zip)

In [None]:
list.files(scratch.dir)

# Filter and Trim

First we read in the names of the fastq files, and perform some string manipulation to get lists of the forward and reverse fastq files in matched order:

In [None]:
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
fnFs <- sort(list.files(miseqsop.dir, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(miseqsop.dir, pattern="_R2_001.fastq", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

## Examine quality profiles of forward and reverse reads

We start by visualizing the quality profiles of the forward reads:

In [None]:
plotQualityProfile(fnFs[1:2])

The forward reads are good quality. We generally advise trimming the last few nucleotides to avoid less well-controlled errors that can arise there. These quality profiles do not suggest that any additional trimming is needed, so we will truncate the forward reads at position 240 (trimming the last 10 nucleotides).

Now we visualize the quality profile of the reverse reads:

In [None]:
plotQualityProfile(fnRs[1:2])

The reverse reads are of significantly worse quality, especially at the end, which is common in Illumina sequencing. This isn't too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm [robust to lower quality sequence](https://twitter.com/bejcal/status/771010634074820608), but trimming as the average qualities crash will improve the algorithm's sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 160 where the quality distribution crashes.


## Perform filtering and trimming

Assign the filenames for the filtered fastq.gz files.

In [None]:
filt_path <- file.path(scratch.dir, "filtered") # Place filtered files in filtered/ subdirectory
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))

We'll use standard filtering parameters: `maxN=0` (DADA2 requires no Ns), `truncQ=2`, `rm.phix=TRUE` and `maxEE=2`. The `maxEE` parameter sets the maximum number of "expected errors" allowed in a read, which is [a better filter than simply averaging quality scores](http://www.drive5.com/usearch/manual/expected_errors.html).

**Filter the forward and reverse reads**

In [None]:
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=10, truncLen=c(240,160),
              maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
              compress=TRUE, multithread=FALSE) # On Windows set multithread=FALSE
head(out)

# Learn the Error Rates

The DADA2 algorithm depends on a parametric error model (`err`) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns the error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many optimization problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

In [None]:
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)

It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:

In [None]:
plotErrors(errF, nominalQ=TRUE)

The error rates for each possible transition (eg. A->C, A->G, ...) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence. The red line shows the error rates expected under the nominal definition of the Q-value. Here the black line (the estimated rates) fits the observed rates well, and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.


# Dereplication

Dereplication combines all identical sequencing reads into into "unique sequences" with a corresponding "abundance": the number of reads with that unique sequence. Dereplication substantially reduces computation time by eliminating redundant comparisons.

Dereplication in the DADA2 pipeline has one crucial addition from other pipelines: **DADA2 retains a summary of the quality information associated with each unique sequence**. The consensus quality profile of a unique sequence is the average of the positional qualities from the dereplicated reads. These quality profiles inform the error model of the subsequent denoising step, significantly increasing DADA2's accuracy.

**Dereplicate the filtered fastq files**

In [None]:
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names

# Sample Inference

We are now ready to apply the core sequence-variant inference algorithm to the dereplicated data. 

**Infer the sequence variants in each sample**

In [None]:
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)

Inspecting the dada-class object returned by dada:

In [None]:
dadaFs[[1]]

In [None]:
cat("blah", length(dadaFs[[1]]$sequence))

In [None]:
cat("The DADA2 algorithm inferred", 
    length(dadaFs[[1]]$sequence), 
    "real sequence variants from the", 
    length(dadaFs[[1]]$map),
    "unique sequences in the first sample")

There is much more to the `dada-class` return object than this (see `help("dada-class")` for some info), including multiple diagnostics about the quality of each inferred sequence variant, but that is beyond the scope of an introductory tutorial.

# Merge paired reads

Spurious sequence variants are further reduced by merging overlapping reads. The core function here is `mergePairs`, which depends on the forward and reverse reads being in matching order at the time they were dereplicated.

**Merge the denoised forward and reverse reads**:

In [None]:
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])

We now have a `data.frame` for each sample with the merged `$sequence`, its `$abundance`, and the indices of the merged `$forward` and `$reverse` denoised sequences. Paired reads that did not exactly overlap were removed by `mergePairs`.

# Construct sequence table

We can now construct a sequence table of our mouse samples, a higher-resolution version of the OTU table produced by traditional methods.

In [None]:
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

The sequence table is a `matrix` with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants. The lengths of our merged sequences all fall within the expected range for this V4 amplicon.


# Remove chimeras

The core `dada` method removes substitution and indel errors, but chimeras remain. Fortunately, the accuracy of the sequences after denoising makes identifying chimeras simpler than it is when dealing with fuzzy OTUs: all sequences which can be exactly reconstructed as a bimera (two-parent chimera) from more abundant sequences.

**Remove chimeric sequences**:

In [None]:
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)
sum(seqtab.nochim)/sum(seqtab)

In [None]:
cat("The fraction of chimeras varies based on factors including experimental",
    "procedures and sample complexity, but can be substantial.", 
    "Here chimeras make up about", 
    round(100*(ncol(seqtab)-ncol(seqtab.nochim))/ncol(seqtab)), 
    "% of the inferred sequence variants, but those variants account for only about",
    round(100*(sum(seqtab)-sum(seqtab.nochim))/sum(seqtab)),
    "% of the total sequence reads.")

# Track reads through the pipeline

As a final check of our progress, we'll look at the number of reads that made it through each step in the pipeline:

In [None]:
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track) <- sample.names
head(track)


Looks good, we kept the majority of our raw reads, and there is no over-large drop associated with any single step.


# Assign taxonomy

It is common at this point, especially in 16S/18S/ITS amplicon sequencing, to classify sequence variants taxonomically. The DADA2 package provides a native implementation of [the RDP's naive Bayesian classifier](http://www.ncbi.nlm.nih.gov/pubmed/17586664) for this purpose. The `assignTaxonomy` function takes a set of sequences and a training set of taxonomically classified sequences, and outputs the taxonomic assignments with at least `minBoot` bootstrap confidence. 

We maintain [formatted training fastas for the RDP training set, GreenGenes clustered at 97\% identity, and the Silva reference database](training.html). For fungal taxonomy, the General Fasta release files from the [UNITE ITS database](https://unite.ut.ee/repository.php) can be used as is. To follow along, download the `silva_nr_v128_train_set.fa.gz` file, and place it in the directory with the fastq files.

In [None]:
taxa <- assignTaxonomy(seqtab.nochim, silva.ref, multithread=TRUE)

```

**Optional:** The dada2 package also implements a method to make [species level assignments based on **exact matching**](assign.html#species-assignment) between ASVs and sequenced reference strains. Currently species-assignment training fastas are available for the Silva and RDP 16S databases. To follow the optional species addition step, download the `silva_species_assignment_v128.fa.gz` file, and place it in the directory with the fastq files.

```{r species}
taxa <- addSpecies(taxa, silva.species.ref)
```

Let's inspect the taxonomic assignments:

In [None]:
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)

Unsurprisingly, the Bacteroidetes are well represented among the most abundant taxa in these fecal samples. Few species assignments were made, both because it is often not possible to make unambiguous species assignments from segments of the 16S gene, and because there is surprisingly little coverage of the indigenous mouse gut microbiota in reference databases.

# Bonus: Handoff to phyloseq

The [phyloseq R package is a powerful framework for further analysis of microbiome data](https://joey711.github.io/phyloseq/). We now demosntrate how to straightforwardly import the tables produced by the DADA2 pipeline into phyloseq. We'll also add the small amount of metadata we have -- the samples are named by the gender (G), mouse subject number (X) and the day post-weaning (Y) it was sampled (eg. GXDY).

## Import into phyloseq

We can construct a simple sample data.frame based on the filenames. Usually this step would instead involve reading the sample data in from a file.


In [None]:
samples.out <- rownames(seqtab.nochim)
subject <- sapply(strsplit(samples.out, "D"), `[`, 1)
gender <- substr(subject,1,1)
subject <- substr(subject,2,999)
day <- as.integer(sapply(strsplit(samples.out, "D"), `[`, 2))
samdf <- data.frame(Subject=subject, Gender=gender, Day=day)
samdf$When <- "Early"
samdf$When[samdf$Day>100] <- "Late"
rownames(samdf) <- samples.out

We can now construct a phyloseq object directly from the dada2 outputs.

In [None]:
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE), 
               sample_data(samdf), 
               tax_table(taxa))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample
ps

Any R object can be saved to an RDS file.  It is a good idea to do this for any object that is time consuming to generate and is reasonably small in size.  Even when the object was generated reproducibly, it can be frustrating to wait minutes or hours to regenerate when you are ready to perform downstream analyses.

We will do this for out phyloseq object to a file since it is quite small (especially compared to the size of the input FASTQ files), and there were several time consuming computational steps required to generate it.  

In [None]:
write_rds(ps, ps.rds)

We can now confirm that it worked!


In [None]:
loaded.ps = read_rds(ps.rds)
print(loaded.ps)

We are now ready to use phyloseq!

## Visualize alpha-diversity

In [None]:
plot_richness(ps, x="Day", measures=c("Shannon", "Simpson"), color="When")

No obvious systematic difference in alpha-diversity between early and late samples.

## Ordinate

In [None]:
# Transform data to proportions as appropriate for Bray-Curtis distances
ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu))
ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")

In [None]:
plot_ordination(ps.prop, ord.nmds.bray, color="When", title="Bray NMDS")

Ordination picks out a clear separation between the early and late samples.

## Bar plot

In [None]:
top20 <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:20]
ps.top20 <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU))
ps.top20 <- prune_taxa(top20, ps.top20)
plot_bar(ps.top20, x="Day", fill="Family") + facet_wrap(~When, scales="free_x")

Nothing glaringly obvious jumps out from the taxonomic distribution of the top 20 sequences to explain the early-late differentiation.

This was just a bare bones demonstration of how the data from DADA2 can be easily imported into phyloseq and interrogated. For further examples on the many analyses possible with phyloseq, see [the phyloseq web site](https://joey711.github.io/phyloseq/)!

# Session Info
Always print `sessionInfo` for reproducibility!

In [None]:
sessionInfo()

-------------------

This tutorial is based on the [Official DADA2 v1.6 Tutorial](https://raw.githubusercontent.com/benjjneb/dada2/gh-pages/tutorial_1_6.Rmd)