# 🦠 Amplicon Sequencing Data Analysis with QIIME 2

## 環境設定

In [1]:
# 對外連線網路設定
import os
os.environ['http_proxy'] = "socks5://icpnp148:12345" 
os.environ['https_proxy'] = "socks5://icpnp148:12345"

In [2]:
# 執行檔路徑設定
import os
from pathlib import Path
HOME = str(Path.home())
Add_Binarry_Path=HOME+'/.local/bin:/usr/localbin'
os.environ['PATH']=os.environ['PATH']+':'+Add_Binarry_Path

In [3]:
# Qiime2 環境設定
os.environ['MPLCONFIGDIR'] = "/tmp/mplconfigdir"
os.environ['NUMBA_CACHE_DIR'] = "/tmp/numbacache"
os.environ['XDG_CONFIG_HOME'] = "/home/c00cjz00"
os.environ['CONDA_PREFIX'] = "/home/c00cjz00/.conda/qiime2-amplicon-2024.5"

# 開始吧！

現在進入有趣的部分了。我們先來看看我們的資料。在 _data_ 資料夾裡，你會找到八個 FASTQ 檔案、一個文件清單（manifest）和一個元數據檔案。首先，我們來看看清單檔案。這是一個包含所有樣本名稱和檔案路徑的文件，稍後我們在使用 QIIME2 時會需要用到它 📝。

In [4]:
import pandas as pd
manifest = pd.read_csv('data/manifest.tsv', sep = '\t')
manifest

Unnamed: 0,sample-id,absolute-filepath
0,ERR1883195,$PWD/data/ERR1883195.fastq.gz
1,ERR1883207,$PWD/data/ERR1883207.fastq.gz
2,ERR1883212,$PWD/data/ERR1883212.fastq.gz
3,ERR1883214,$PWD/data/ERR1883214.fastq.gz
4,ERR1883225,$PWD/data/ERR1883225.fastq.gz
5,ERR1883240,$PWD/data/ERR1883240.fastq.gz
6,ERR1883250,$PWD/data/ERR1883250.fastq.gz
7,ERR1883294,$PWD/data/ERR1883294.fastq.gz


In [5]:
metadata = pd.read_csv('data/metadata.tsv', sep = '\t')
metadata

Unnamed: 0,sample-id,collection_timestamp,day_relative_to_fmt,description,disease_state,host_age,host_age_units,host_body_mass_index,host_height,host_height_units,host_subject_id,host_weight,host_weight_units,race,sex
0,ERR1883195,2011-10-24,26,Donor 11,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
1,ERR1883207,2012-01-12,44,Donor 12,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
2,ERR1883212,2012-10-10,135,Donor 14,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
3,ERR1883214,2011-07-26,0,Day 0 CD1,Pre-FMT,39,years,29.3,165.1,m,CD1,80.1,kg,white,female
4,ERR1883225,2011-07-26,54,Donor CD1,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
5,ERR1883240,2012-02-14,pre-FMT,CD9 pre-FMT,Pre-FMT,47,years,35.5,1.55,m,CD9,85.1,kg,white,female
6,ERR1883250,2011-12-23,pre-FMT,CD13 pre-FMT,Pre-FMT,53,years,34.4,1.56,m,CD13,83.9,kg,white,female
7,ERR1883294,2011-09-29,0,Day 0 CD3,Pre-FMT,61,years,32.5,1.727,m,CD3,97.3,kg,white,male


看起來不錯，所有八個 FASTQ 檔案都已確認無誤，四個是健康樣本，四個是反覆性CDI的樣本。我們可以使用清單檔將我們的檔案匯入 QIIME2。

## QIIME2 流程

讓我們回顧一下 QIIME2 流程將會做什麼：
![our workflow](https://github.com/Gibbons-Lab/isb_course_2023/raw/main/docs/16S/assets/steps.png)

To use sequencing data in QIIME2, we first need to turn the FASTQ files containing our data into QIIME artifacts. Using the manifest we just checked out, let's run our first command:

-- as a reminder, adding ```!``` before the command tells the notebook this is a bash command, rather than python.

In [None]:
# fastq檔案格式轉換成qza
!mkdir -p output
!qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path data/manifest.tsv \
  --output-path output/sequences.qza \
  --input-format SingleEndFastqManifestPhred33V2

## 確認qza檔案內容物

In [None]:
# 確認qza檔案內容物
!qiime tools peek output/sequences.qza

## Visualizing our Data 🔎

Before we move on, let's use QIIME2 to visualize our sequencing data.

In [None]:
!cp output/sequences.qza output/demux.qza
!qiime demux summarize \
--i-data output/demux.qza \
--o-visualization output/demux.qzv

.qzv files like the one we just produced are visualization. You can view the plot by downloading the file and opening it using http://view.qiime2.org. To download the file click on the folder symbol to the left, open the `output` folder, and choose download from the dot menu next to the `output/demux.qzv` file.

---

## Quality Filtering

Before we can use our sequencing data, we need to "denoise" it. To do this, we'll use a plugin called DADA2. This involves three things.

1. filter and trim the reads
2. find the most likely set of unique sequences in the sample (ASVs)
3. remove chimeras
4. count the abundances of each ASV


This command will take a little time - let's run it, and head back to the presentation to discuss what's happening.

In [None]:
!qiime dada2 denoise-single \
  --i-demultiplexed-seqs output/demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 150 \
  --p-n-threads 2 \
  --o-representative-sequences output/rep-seqs.qza \
  --o-table output/table.qza \
  --o-denoising-stats output/stats.qza

Let's check to see how that went. One good way to tell if the identified ASVs are representative of the sample is to see how many reads were maintained throughout the pipeline. Here, the most common issues and solutions are:

**Large fraction of reads is lost during merging (only paired-end)**

![read overlap](https://gibbons-lab.github.io/isb_course_2023/16S/assets/read_overlap.png)

In order to merge ASVs DADA2 uses an overlap of 12 bases between forward and reverse reads by default. Thus, your reads must allow for sufficient overlap *after* trimming. So if your amplified region is 450bp long and you have 2x250bp reads and you trim the last 30 bases of each read, truncating the length to 220bp, the total length of covered sequence is 2x220 = 440 which is shorter than 450bp so there will be no overlap. To solve this issue trim less of the reads or adjust the `--p-min-overlap` parameters to something lower (but not too low).

<br>

**Most of the reads are lost as chimeric**

![read overlap](https://gibbons-lab.github.io/isb_course_2023/16S/assets/chimera.png)

This is usually an experimental issue as chimeras are introduced during amplification. If you can adjust your PCR, try to run fewer cycles. Chimeras can also be introduced by incorrect merging. If your minimum overlap is too small ASVs may be merged randomly. Possible fixes are to increase the `--p-min-overlap` parameter or run the analysis on the forward reads only (in our empirical observations, chimeras are more likely to be introduced in the joined reads). *However, losing between 5-25% of your reads to chimeras is normal and does not require any adjustments.*


Our denoising stats are contained in an artifact. To convert it to a visualization we can use `qiime metadata tabulate`.

In [None]:
!qiime feature-table tabulate-seqs \
  --i-data output/rep-seqs.qza \
  --o-visualization output/rep-seqs.qzv

!qiime feature-table summarize \
  --i-table output/table.qza \
  --m-sample-metadata-file data/metadata.tsv \
  --o-visualization output/table.qzv

!qiime metadata tabulate \
    --m-input-file output/stats.qza \
    --o-visualization output/stats.qzv

Like before, we can download the .qzv file and visualize the results using the [QIIME2 Viewer]('https://view.qiime2.org/').

It's important to understand what this output tells us. For instance, what percent of reads in our data pass the filtering step? What percent of reads were non-chimeric? Differences in these metrics between samples can affect diversity metrics.

---

## Diversity and Phylogenetics
An important metric to consider when studying microbial ecology is __diversity__. Diversity comes in two flavors: ⍺ (alpha) and β (beta).

Alpha diversity is pretty simple - how diverse is a single sample? You might consider measures like richness and evenness.

![alpha diversity](https://gibbons-lab.github.io/isb_course_2023/16S/assets/alpha_diversity.png)

Beta diversity instead looks at how different two samples are from each other - what taxa are shared, and how their abundances differ.

![beta diversity](https://gibbons-lab.github.io/isb_course_2023/16S/assets/beta_diversity.png)


### Starting our Tree

Let's start by building a phylogenetic tree for our sequences using the following command. This time, we call the _phylogeny_ plugin in QIIME2.

In [None]:
!qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences output/rep-seqs.qza \
  --o-alignment output/aligned-rep-seqs.qza \
  --o-masked-alignment output/masked-aligned-rep-seqs.qza \
  --o-tree output/unrooted-tree.qza \
  --o-rooted-tree output/rooted-tree.qza


## Calculating Diversity
Using the Diversity plugin, we can use our table and tree to calculate several diversity metrics. To account for variations in sampling depth, we'll provide QIIME2 with a cutoff at which rarefy all our samples. Since this randomly selects sequences, your results might look a little different. We'll also pass in our metadata file, so we can keep track how which samples come from each group.

In [None]:
!qiime diversity core-metrics-phylogenetic \
    --i-table output/table.qza \
    --i-phylogeny output/rooted-tree.qza \
    --p-sampling-depth 8000 \
    --m-metadata-file data/metadata.tsv \
    --output-dir diversity

## Alpha Diversity

We get a bunch of outputs from the previous command - measures of both alpha and beta diversity. To start, let's use the Shannon vector in the output directory to create a visualization of alpha diversity across samples. Generally, healthy, long-living individuals have balanced diverse microbiomes. However, this isn't necessarily a direct indicator of health or disease. Let's see how it looks in our samples

In [None]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/shannon_vector.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization diversity/alpha_groups-shannon_vector.qzv

In [None]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/faith_pd_vector.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization diversity/alpha_groups-faith_pd_vector.qzv

In [None]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/evenness_vector.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization diversity/alpha_groups-evenness_vector.qzv

Like before, we can download the visualization and open it with the QIIME2 viewer.

## Beta Diversity

Let's visualize the beta diversity and see how they separate. For this we'll look at weighted UniFrac. This time, we'll have to download the file ⬅️

<br>

We can check for 'significant' separation between samples using PERMANOVA. We can do this with the diversity plugin in QIIME2.

In [None]:
!qiime diversity adonis \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file data/metadata.tsv \
    --p-formula "disease_state" \
    --p-n-jobs 2 \
    --o-visualization diversity/permanova.qzv

In [None]:
!qiime diversity beta-group-significance \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file data/metadata.tsv \
    --m-metadata-column disease_state \
    --o-visualization diversity/beta_groups-weighted_unifrac_distance_matrix.qzv \
    --p-pairwise


In [None]:
!qiime diversity beta-group-significance \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file data/metadata.tsv \
    --m-metadata-column disease_state \
    --o-visualization diversity/beta_groups-weighted_unifrac_distance_matrix.qzv \
    --p-pairwise

In [None]:
qiime emperor plot \
  --i-pcoa core-metrics-results/unweighted_unifrac_pcoa_results.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-custom-axes days-since-experiment-start \
  --o-visualization core-metrics-results/unweighted-unifrac-emperor-days-since-experiment-start.qzv

---

## Taxonomic Classification

We can learn a lot from diversity metrics, alpha and beta. But to really dig into the data, we need to know what microbes are in each sample 🦠. To do this, we'll classify the reads in QIIME2 using a Bayesian classifier. Several such classifiers are available at https://docs.qiime2.org/2023.7/data-resources

In [None]:
!curl -sL \
  "https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes/gg-13-8-99-515-806-nb-classifier.qza" > \
  "output/gg-13-8-99-515-806-nb-classifier.qza"

In [None]:
!qiime feature-classifier classify-sklearn \
    --i-reads output/rep-seqs.qza \
    --i-classifier output/gg-13-8-99-515-806-nb-classifier.qza \
    --p-n-jobs 2 \
    --o-classification output/taxonomy.qza

In [None]:
!qiime metadata tabulate \
  --m-input-file output/taxonomy.qza \
  --o-visualization output/taxonomy.qzv

Now we've classified the reads, we can visualize the taxonomic breakdown of our samples.

In [None]:
!qiime taxa barplot \
    --i-table output/table.qza \
    --i-taxonomy output/taxonomy.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization output/taxa_barplot.qzv

Now, we can use ```table.qza```, which contains our reads, and ```taxa.qza```, which contains taxonomic classifications for reads, and collapse the data onto the genus level.

In [None]:
!qiime taxa collapse \
    --i-table output/table.qza \
    --i-taxonomy output/taxonomy.qza \
    --p-level 6 \
    --o-collapsed-table output/genus.qza

In [None]:
We'll export this as a .tsv, which will be more usable for the next portion of the course

In [None]:
!qiime tools export \
    --input-path output/genus.qza \
    --output-path exported

In [None]:
!biom convert -i exported/feature-table.biom -o exported/genus.tsv --to-tsv

In [None]:
abundances = pd.read_table("exported/genus.tsv", skiprows=1, index_col=0)
abundances