# Working through the "Atacama Soil Microbiome" Tutorial in Qiime2
- The original tutorial is here in the official Qiime2 docs: https://docs.qiime2.org/2021.2/tutorials/atacama-soils/
- This workshop was also adapted from a previous workshop hosted by 
Joslynn Lee: https://cyverse-jupyter-qiime2.readthedocs-hosted.com/en/latest/

### Learning objectives
- Understand the analysis steps and core concepts of the Qiime2 software
- Generate plots to visualize biodiversity results from metabarcoding data

### How this notebook works
- Code is fill-in-the-blank, follow along with us as we work through it together!
- Stuck or Lost? We have many points of redemption throughout; just download the output files and visualize them here: https://view.qiime2.org/

### Steps covered
0. Setting up the Jupyter notebook and input files
1. Demultiplexing paired-end sequences
2. Denoising using DADA2
3. Diversity analyses
4. Bonus: Taxonomic assignment

# 0. Setting up the Jupyter and input files

This first step is only because we are using a Jupyter notebook for this workshop. You will not need to run these few lines of code when working with your own datasets on your computer. 

In [2]:
pip3 install git+https://github.com/regebro/tzlocal
qiime dev refresh-cache

SyntaxError: invalid syntax (<ipython-input-2-9813a27f6ea8>, line 1)

Let's activate Qiime2: 

In [None]:
conda activate qiime2

Now, we will make a directory (folder) for our tutorial, as well as a directory for our raw, paired-end sequences. 
mkdir = make a directory

In [7]:
mkdir qiime2-atacama-tutorial
cd qiime2-atacama-tutorial

SyntaxError: invalid syntax (<ipython-input-7-0328542b900a>, line 1)

Then, we are going to download the sample metadata file from the Qiime2 website. 

In [8]:
wget \
  -O "sample-metadata.tsv" \
  "https://data.qiime2.org/2021.2/tutorials/atacama-soils/sample_metadata.tsv"

SyntaxError: invalid syntax (<ipython-input-8-b3d3f5e06dd0>, line 2)

We are going to put these in it's own directory. You will download three fastq.gz files, corresponding to the forward, reverse, and barcode (i.e., index) reads. These files contain a subset of the reads in the full data set generated for this study, which allows for the following commands to be run relatively quickly, however, we will perform additional subsampling in this tutorial to further improve the run time.

In [9]:
mkdir emp-paired-end-sequences

SyntaxError: invalid syntax (<ipython-input-9-f55f9235fda5>, line 1)

In [10]:
wget \
  -O "emp-paired-end-sequences/forward.fastq.gz" \
  "https://data.qiime2.org/2021.2/tutorials/atacama-soils/10p/forward.fastq.gz"

SyntaxError: invalid syntax (<ipython-input-10-d9835a368b24>, line 2)

In [11]:
wget \
  -O "emp-paired-end-sequences/reverse.fastq.gz" \
  "https://data.qiime2.org/2021.2/tutorials/atacama-soils/10p/reverse.fastq.gz"

SyntaxError: invalid syntax (<ipython-input-11-38eeb672496d>, line 2)

In [12]:
wget \
  -O "emp-paired-end-sequences/barcodes.fastq.gz" \
  "https://data.qiime2.org/2021.2/tutorials/atacama-soils/10p/barcodes.fastq.gz"

SyntaxError: invalid syntax (<ipython-input-12-5bdd97c43824>, line 2)

# 1. Demultiplexing reads

To analyze these data, the sequences that you just downloaded must first be imported into an artifact of type EMPPairedEndSequences.

In [13]:
qiime tools import \
   --type EMPPairedEndSequences \
   --input-path emp-paired-end-sequences \
   --output-path emp-paired-end-sequences.qza

SyntaxError: invalid syntax (<ipython-input-13-10621773bcc3>, line 1)

We will now demultiplex the sequence reads. This requires the sample metadata file, specifically the column in the file that contains the per-sample barcodes. Here, the file is called barcode-sequence. Because the barcode reads are the reverse complement of those included in the sample metadata file, we can include the --p-rev-comp-mapping-barcodes parameter. One output of this file is looking at a summary of how many sequences were obtained per sample. 

In [None]:
qiime demux emp-paired \
  --m-barcodes-file sample-metadata.tsv \
  --m-barcodes-column barcode-sequence \
  --p-rev-comp-mapping-barcodes \
  --i-seqs emp-paired-end-sequences.qza \
  --o-per-sample-sequences demux-full.qza \
  --o-error-correction-details demux-details.qza

Let's subsample the data - this will speed up the tutorial and run time, and demonstrate the functionality. Other uses of subsampling need to be thought through with reasonable justification. 

In [None]:
qiime demux subsample-paired \
  --i-sequences demux-full.qza \
  --p-fraction 0.3 \
  --o-subsampled-sequences demux-subsample.qza

qiime demux summarize \
  --i-data demux-subsample.qza \
  --o-visualization demux-subsample.qzv

We can view the summary of our subsampled dataset to examine how many sequence counts were in each sample. Here, there are 75 samples in the data and the last 20 or so of the rows in the table have fewer than 100 reads in them, which can be filtered out of the data:

In [None]:
qiime tools export \
  --input-path demux-subsample.qzv \
  --output-path ./demux-subsample/

qiime demux filter-samples \
  --i-demux demux-subsample.qza \
  --m-metadata-file ./demux-subsample/per-sample-fastq-counts.tsv \
  --p-where 'CAST([forward sequence count] AS INT) > 100' \
  --o-filtered-demux demux.qza

# 2. Denoising
We will look at the sequence quality based on 10,000 randomly selected reads from the subsampled and filtered data, then denoise it based on the quality plots. There will be two plots showing quality scores - one for the forward reads and one for the reverse reads. These plots will help us determine what trimming parameters we want to denoise with, using DADA2. 

In this example, we have 150-base forward and reverse reads. We need to keep in mind that the paired reads need to be long enough to overlap so they (the forward and reverse sequences) can merge. We will trim off the first 13 bases of the forward and reverse reads, but will not trim the end of the sequences, to avoid reducing the read length too much. You do not have to keep the trimming or truncation lengths the same for your forward and reverse sequences, but we are doing that here:

In [14]:
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 13 \
  --p-trim-left-r 13 \
  --p-trunc-len-f 150 \
  --p-trunc-len-r 150 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

SyntaxError: invalid syntax (<ipython-input-14-36b639cdd794>, line 1)

Now, we have artifacts containing the feature table, corresponding feature sequences, and DADA2 denoising statistics. We can generate summaries of these data:

In [15]:
qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv \
  --m-sample-metadata-file sample-metadata.tsv

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

qiime metadata tabulate \
  --m-input-file denoising-stats.qza \
  --o-visualization denoising-stats.qzv

SyntaxError: invalid syntax (<ipython-input-15-bbb7e9c794c5>, line 1)

# 3. Alpha and beta diversity analysis
QIIME2 offers diversity analyses through the q2-diversity plugin, which supports computing alpha and beta diversity metrics, applying related statistical tests, and generating interactive visualizations. We will first use the core-metrics-phylogenetic method, which rarefies a FeatureTable[Frequency] to a user-specified depth,  computes several alpha and beta diversity metrics, and generates principle coordinates analysis (PCoA) plots using Emperor for each of the beta diversity metrics. The metrics computed by default are:

Alpha diversity

   - Shannon’s diversity index (a quantitative measure of community richness)
   - Observed Features (a qualitative measure of community richness)
   - Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features)
   - Evenness (or Pielou’s Evenness; a measure of community evenness)

Beta diversity

   - Jaccard distance (a qualitative measure of community dissimilarity)
   - Bray-Curtis distance (a quantitative measure of community dissimilarity)
   - unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)
   - weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

An important parameter that needs to be provided to this script is --p-sampling-depth, which is the even sampling (i.e. rarefaction) depth. Because most diversity metrics are sensitive to different sampling depths across different samples, this script will randomly subsample the counts from each sample to the value provided for this parameter. For example, if you provide --p-sampling-depth 500, this step will subsample the counts in each sample without replacement so that each sample in the resulting table has a total count of 500. If the total count for any sample(s) are smaller than this value, those samples will be dropped from the diversity analysis. Choosing this value is tricky. We recommend making your choice by reviewing the information presented in the table.qzv file that was created above. Choose a value that is as high as possible (so you retain more sequences per sample) while excluding as few samples as possible.

Looking at the table.qzv Qiime2 artifact we just generated, in particular the 'Interative Sample Detail' tab in that visualization, what value would you choose to pass for --p-sampling-depth? How many samples will be excluded from your analysis based on this choice? How many total sequences will you be analyzing in the core-metrics-phylogenetic command?

In [None]:
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table table.qza \
  --p-sampling-depth 1103 \
  --m-metadata-file sample-metadata.tsv \
  --output-dir core-metrics-results