# 🦠 Amplicon Sequencing Data Analysis with QIIME 2

## 環境設定

In [1]:
# 對外連線網路設定
import os
tmp=!echo $(hostname)
HOSTNAME=tmp[0]
os.environ['http_proxy'] = "socks5:/"+HOSTNAME+":12345" 
os.environ['https_proxy'] = "socks5://"+HOSTNAME+":12345" 

In [2]:
# 執行檔路徑設定
import os
from pathlib import Path
HOME = str(Path.home())
Add_Binarry_Path=HOME+'/.local/bin:/usr/localbin'
os.environ['PATH']=os.environ['PATH']+':'+Add_Binarry_Path

# 若用Conda 請補充下列指令
#Add_Binarry_Path=HOME+'/.conda/envs/qiime2-amplicon-2024.10/bin'
#os.environ['PATH']=os.environ['PATH']+':'+Add_Binarry_Path

In [3]:
# Qiime2 初始環境設定
os.environ['MPLCONFIGDIR'] = "/tmp/mplconfigdir"
os.environ['NUMBA_CACHE_DIR'] = "/tmp/numbacache"
os.environ['XDG_CONFIG_HOME']=os.environ['HOME']
os.environ['CONDA_PREFIX']=os.environ['CONDA_PREFIX'].replace("/home/qiime2", os.environ['HOME'])
newpath=os.environ['CONDA_PREFIX']
if not os.path.exists(newpath):
    os.makedirs(newpath)

# 開始吧！

現在進入有趣的部分了。我們先來看看我們的資料。在 _data_ 資料夾裡，你會找到八個 FASTQ 檔案、一個文件清單（manifest）和一個元數據檔案。首先，我們來看看清單檔案。這是一個包含所有樣本名稱和檔案路徑的文件，稍後我們在使用 QIIME2 時會需要用到它 📝。

In [4]:
import pandas as pd
manifest = pd.read_csv('data/manifest.tsv', sep = '\t')
manifest

Unnamed: 0,sample-id,absolute-filepath
0,ERR1883195,$PWD/data/ERR1883195.fastq.gz
1,ERR1883207,$PWD/data/ERR1883207.fastq.gz
2,ERR1883212,$PWD/data/ERR1883212.fastq.gz
3,ERR1883214,$PWD/data/ERR1883214.fastq.gz
4,ERR1883225,$PWD/data/ERR1883225.fastq.gz
5,ERR1883240,$PWD/data/ERR1883240.fastq.gz
6,ERR1883250,$PWD/data/ERR1883250.fastq.gz
7,ERR1883294,$PWD/data/ERR1883294.fastq.gz


In [5]:
metadata = pd.read_csv('data/metadata.tsv', sep = '\t')
metadata

Unnamed: 0,sample-id,collection_timestamp,day_relative_to_fmt,description,disease_state,host_age,host_age_units,host_body_mass_index,host_height,host_height_units,host_subject_id,host_weight,host_weight_units,race,sex
0,ERR1883195,2011-10-24,26,Donor 11,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
1,ERR1883207,2012-01-12,44,Donor 12,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
2,ERR1883212,2012-10-10,135,Donor 14,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
3,ERR1883214,2011-07-26,0,Day 0 CD1,Pre-FMT,39,years,29.3,165.1,m,CD1,80.1,kg,white,female
4,ERR1883225,2011-07-26,54,Donor CD1,healthy,Restricted access,years,Restricted access,Restricted access,m,Donor,Restricted access,kg,Restricted access,Restricted access
5,ERR1883240,2012-02-14,pre-FMT,CD9 pre-FMT,Pre-FMT,47,years,35.5,1.55,m,CD9,85.1,kg,white,female
6,ERR1883250,2011-12-23,pre-FMT,CD13 pre-FMT,Pre-FMT,53,years,34.4,1.56,m,CD13,83.9,kg,white,female
7,ERR1883294,2011-09-29,0,Day 0 CD3,Pre-FMT,61,years,32.5,1.727,m,CD3,97.3,kg,white,male


看起來不錯，所有八個 FASTQ 檔案都已確認無誤，四個是健康樣本，四個是反覆性CDI的樣本。我們可以使用清單檔將我們的檔案匯入 QIIME2。

## QIIME2 流程

讓我們回顧一下 QIIME2 流程將會做什麼：
![our workflow](https://github.com/Gibbons-Lab/isb_course_2023/raw/main/docs/16S/assets/steps.png)

要在 QIIME2 中使用定序資料，我們首先需要將包含我們資料的 FASTQ 檔案轉換成 QIIME 工件。使用我們剛檢查過的清單，讓我們來執行第一個指令：

-- 提醒一下，在指令前加上 ```!``` 表示這是一個 bash 指令，而不是 python。

In [6]:
# fastq檔案格式轉換成qza
!mkdir -p output
!qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path data/manifest.tsv \
  --output-path output/sequences.qza \
  --input-format SingleEndFastqManifestPhred33V2

[32mImported data/manifest.tsv as SingleEndFastqManifestPhred33V2 to output/sequences.qza[0m
[0m

## 確認qza檔案內容物

In [7]:
# 確認qza檔案內容物
!qiime tools peek output/sequences.qza

[32mUUID[0m:        d6b8db49-5596-4976-a33b-03b48d6702ad
[32mType[0m:        SampleData[SequencesWithQuality]
[32mData format[0m: SingleLanePerSampleSingleEndFastqDirFmt


## 視覺化我們的數據 🔎

在我們繼續之前，讓我們使用 QIIME2 來視覺化我們的測序數據。

In [8]:
!cp output/sequences.qza output/demux.qza
!qiime demux summarize \
--i-data output/demux.qza \
--o-visualization output/demux.qzv

[32mSaved Visualization to: output/demux.qzv[0m
[0m

.qzv 檔案像我們剛剛產生的這個，是用來視覺化的檔案。你可以下載這個檔案，然後使用 http://view.qiime2.org 打開來查看圖表。要下載檔案，點擊左側的資料夾符號，打開 `output` 資料夾，然後在 `output/demux.qzv` 檔案旁邊的點選單中選擇下載。

---

## 質量過濾

在使用我們的定序數據之前，我們需要對其進行“去噪”處理。為此，我們將使用一個名為 DADA2 的插件。這個過程包含三個步驟：

1. 過濾並修剪讀取序列
2. 找出樣本中最有可能的唯一序列集 (ASVs)
3. 移除嵌合體
4. 計簡報中討論正在發生的事情。

In [9]:
!qiime dada2 denoise-single \
  --i-demultiplexed-seqs output/demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 150 \
  --p-n-threads 8 \
  --o-representative-sequences output/rep-seqs.qza \
  --o-table output/table.qza \
  --o-denoising-stats output/stats.qza

[32mSaved FeatureTable[Frequency] to: output/table.qza[0m
[32mSaved FeatureData[Sequence] to: output/rep-seqs.qza[0m
[32mSaved SampleData[DADA2Stats] to: output/stats.qza[0m
[0m

讓我們來檢查一下結果如何。判斷所識別的 ASVs 是否能代表樣本的一個好方法是檢查在整個分析流程中保留了多少讀數。以下是常見問題及其解決方案：

**在合併過程中大量讀數丟失（僅限雙端測序）**

![讀數重疊](https://gibbons-lab.github.io/isb_course_2023/16S/assets/read_overlap.png)

DADA2 在合併 ASVs 時，預設使用前向讀數和反向讀數之間有 12 個鹼基的重疊。因此，讀數在修剪後必須允許足夠的重疊。如果擴增區域長度為 450bp，而你有 2x250bp 的讀數並修剪了每個讀數的最後 30 個鹼基，將讀數長度縮短為 220bp，那麼總共覆蓋的序列長度為 2x220 = 440bp，這比 450bp 短，因此不會有重疊。要解決此問題，可以減少讀數的修剪量，或者調整 `--p-min-overlap` 參數為較低的值（但不能太低）。

<br>

**大部分讀數因為嵌合體而丟失**

![讀數重疊](https://gibbons-lab.github.io/isb_course_2023/16S/assets/chimera.png)

這通常是實驗問題，因為嵌合體是在擴增過程中產生的。如果可以調整你的 PCR，嘗試減少循環次數。嵌合體也可能由於錯誤的合併過程產生。如果最小重疊過小，ASVs 可能會被隨機合併。可能的解決方法是增加 `--p-min-overlap` 參數，或僅對前向讀數進行分析（根據我們的經驗觀察，嵌合體更可能在合併的讀數中產生）。*然而，丟失 5-25% 的讀數因為嵌合體是正常的，不需要做任何調整。*

我們的去噪統計數據包含在一個產物中。要將其轉換為可視化結果，我們可以使用 `qiime metadata tabulate`。

In [10]:
!qiime feature-table tabulate-seqs \
  --i-data output/rep-seqs.qza \
  --o-visualization output/rep-seqs.qzv

!qiime feature-table summarize \
  --i-table output/table.qza \
  --m-sample-metadata-file data/metadata.tsv \
  --o-visualization output/table.qzv

!qiime metadata tabulate \
    --m-input-file output/stats.qza \
    --o-visualization output/stats.qzv

[32mSaved Visualization to: output/rep-seqs.qzv[0m
[0m[32mSaved Visualization to: output/table.qzv[0m
[0m[32mSaved Visualization to: output/stats.qzv[0m
[0m

像之前一樣，我們可以下載 .qzv 檔案，並使用 [QIIME2 Viewer]('https://view.qiime2.org/') 來視覺化結果。

了解這些輸出內容對我們來說很重要。例如，數據中有多少百分比的讀序通過了過濾步驟？有多少百分比的讀序是非嵌合的？這些指標在樣本之間的差異可能會影響多樣性指標。

## 多樣性與系統發育  Diversity and Phylogenetics
在研究微生物生態學時，一個重要的指標是__多樣性__。多樣性主要分為兩種：⍺（alpha）多樣性和β（beta）多樣性。

Alpha 多樣性相對簡單——指的是單一樣本的多樣性。你可以考慮像是物種豐富度和均勻度等指標。

![alpha 多樣性](https://gibbons-lab.github.io/isb_course_2023/16S/assets/alpha_diversity.png)

Beta 多樣性則是看兩個樣本之間的差異——有哪些分類單元是共享的，以及它們的豐
有什麼不同。

![beta 多樣性](https://gibbons-lab.github.io/isb_course_2023/16S/assets/beta_diversit
##ng)


 Starting our Tree### 開始建立我們的樹
接下來，我們將使用以下指令來建立我們序列的系統發育樹。這次，我們將調用 QIIME2 中的 _phylogeny_ 插件。

In [11]:
!qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences output/rep-seqs.qza \
  --o-alignment output/aligned-rep-seqs.qza \
  --o-masked-alignment output/masked-aligned-rep-seqs.qza \
  --o-tree output/unrooted-tree.qza \
  --o-rooted-tree output/rooted-tree.qza

[32mSaved FeatureData[AlignedSequence] to: output/aligned-rep-seqs.qza[0m
[32mSaved FeatureData[AlignedSequence] to: output/masked-aligned-rep-seqs.qza[0m
[32mSaved Phylogeny[Unrooted] to: output/unrooted-tree.qza[0m
[32mSaved Phylogeny[Rooted] to: output/rooted-tree.qza[0m
[0m

## 計算多樣性
使用多樣性插件，我們可以利用表格和樹狀圖來計算幾個多樣性指標。為了考慮樣本深度的變異，我們會在 QIIME2 中設置一個截止點，並在該點對所有樣本進行稀釋。由於這是隨機選取序列，因此結果可能會有所不同。我們還會輸入元數據文件，以便追蹤每個樣本屬於哪個組別。輸入元數據文件，以便追蹤每個樣本屬於哪個組別。

In [12]:
!qiime diversity core-metrics-phylogenetic \
    --i-table output/table.qza \
    --i-phylogeny output/rooted-tree.qza \
    --p-sampling-depth 8000 \
    --m-metadata-file data/metadata.tsv \
    --output-dir diversity

[32mSaved FeatureTable[Frequency] to: diversity/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/faith_pd_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: diversity/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: diversity/unweighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: diversity/weighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: diversity/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: diversity/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: diversity/unweighted_unifrac_pcoa_results.qza[0m
[32mSaved PCoAResults to: diversity/weighted_unifrac_pcoa_results.qza[0m
[32mSaved PCoAResults to: diversity/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: diversity/bray_curtis_pcoa_results.qza[0m
[32mSaved Visua

## Alpha Diversity

We get a bunch of outputs from the previous command - measures of both alpha and beta diversity. To start, let's use the Shannon vector in the output directory to create a visualization of alpha diversity across samples. Generally, healthy, long-living individuals have balanced diverse microbiomes. However, this isn't necessarily a direct indicator of health or disease. Let's see how it looks in our samples

In [13]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/shannon_vector.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization diversity/alpha_groups-shannon_vector.qzv

[32mSaved Visualization to: diversity/alpha_groups-shannon_vector.qzv[0m
[0m

In [14]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/faith_pd_vector.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization diversity/alpha_groups-faith_pd_vector.qzv

[32mSaved Visualization to: diversity/alpha_groups-faith_pd_vector.qzv[0m
[0m

In [15]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/evenness_vector.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization diversity/alpha_groups-evenness_vector.qzv

[32mSaved Visualization to: diversity/alpha_groups-evenness_vector.qzv[0m
[0m

像之前一樣，我們可以下載視覺化結果並用 QIIME2 查看器打開它。

## Beta 多樣性

讓我們來視覺化 Beta 多樣性，看看它們是如何區分的。這次我們將使用加權 UniFrac。我們需要下載這個檔案 ⬅️

<br>

我們可以使用 PERMANOVA 檢查樣本之間是否有「顯著」的區分。我們可以透過 QIIME2 的多樣性插件來完成這項操作。

In [16]:
!qiime diversity adonis \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file data/metadata.tsv \
    --p-formula "disease_state" \
    --p-n-jobs 2 \
    --o-visualization diversity/permanova.qzv

[32mSaved Visualization to: diversity/permanova.qzv[0m
[0m

In [17]:
!qiime diversity beta-group-significance \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file data/metadata.tsv \
    --m-metadata-column disease_state \
    --o-visualization diversity/beta_groups-weighted_unifrac_distance_matrix.qzv \
    --p-pairwise


[32mSaved Visualization to: diversity/beta_groups-weighted_unifrac_distance_matrix.qzv[0m
[0m

In [18]:
!qiime diversity beta-group-significance \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file data/metadata.tsv \
    --m-metadata-column disease_state \
    --o-visualization diversity/beta_groups-weighted_unifrac_distance_matrix.qzv \
    --p-pairwise

[32mSaved Visualization to: diversity/beta_groups-weighted_unifrac_distance_matrix.qzv[0m
[0m

## 系統分類

我們可以從多樣性指標、α多樣性和β多樣性中學到很多東西。但要真正深入了解數據，我們需要知道每個樣本中有哪些微生物 🦠。為此，我們將使用貝葉斯分類器在 QIIME2 中對讀數進行分類。可以在 https://docs.qiime2.org/2425.7/data-resources 找到幾種這樣的分類器。

In [19]:
!curl -sL \
  "https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes/gg-13-8-99-515-806-nb-classifier.qza" > \
  "output/gg-13-8-99-515-806-nb-classifier.qza"

In [20]:
!qiime feature-classifier classify-sklearn \
    --i-reads output/rep-seqs.qza \
    --i-classifier output/gg-13-8-99-515-806-nb-classifier.qza \
    --p-n-jobs 2 \
    --o-classification output/taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: output/taxonomy.qza[0m
[0m

In [21]:
!qiime metadata tabulate \
  --m-input-file output/taxonomy.qza \
  --o-visualization output/taxonomy.qzv

[32mSaved Visualization to: output/taxonomy.qzv[0m
[0m

現在我們已經將讀取資料分類，我們可以視覺化我們樣本的分類學分佈。

In [22]:
!qiime taxa barplot \
    --i-table output/table.qza \
    --i-taxonomy output/taxonomy.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization output/taxa_barplot.qzv

[32mSaved Visualization to: output/taxa_barplot.qzv[0m
[0m

現在，我們可以使用包含我們讀取資料的 ```table.qza``` 和包含讀取資料分類資訊的 ```taxa.qza```，將資料整合到屬 (genus) 的層級。

In [23]:
!qiime taxa collapse \
    --i-table output/table.qza \
    --i-taxonomy output/taxonomy.qza \
    --p-level 6 \
    --o-collapsed-table output/genus.qza

[32mSaved FeatureTable[Frequency] to: output/genus.qza[0m
[0m

我們會將這個匯出為 .tsv 格式，這樣對於課程的下一部分會更方便使用。

In [24]:
!qiime tools export \
    --input-path output/genus.qza \
    --output-path exported

[32mExported output/genus.qza as BIOMV210DirFmt to directory exported[0m
[0m

In [25]:
!biom convert -i exported/feature-table.biom -o exported/genus.tsv --to-tsv

In [26]:
abundances = pd.read_table("exported/genus.tsv", skiprows=1, index_col=0)
abundances

Unnamed: 0_level_0,ERR1883195,ERR1883207,ERR1883212,ERR1883214,ERR1883225,ERR1883240,ERR1883250,ERR1883294
#OTU ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia,23.0,37.0,14.0,51036.0,0.0,1779.0,172.0,9014.0
k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Verrucomicrobiaceae;g__Akkermansia,6.0,598.0,0.0,54728.0,0.0,35554.0,11467.0,0.0
k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides,24784.0,23969.0,11679.0,59.0,5209.0,39.0,46.0,17.0
k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;__,0.0,0.0,27.0,1553.0,0.0,15863.0,26.0,9230.0
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Veillonellaceae;g__Dialister,5.0,3.0,17.0,2650.0,0.0,15826.0,0.0,1199.0
...,...,...,...,...,...,...,...,...
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Dehalobacteriaceae;g__Dehalobacterium,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
k__Bacteria;p__Cyanobacteria;c__Chloroplast;o__Streptophyta;f__;g__,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__[Tissierellaceae];g__WAL_1855D,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
k__Bacteria;p__TM7;c__TM7-3;o__Blgi18;f__;g__,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
