# Data retrive and denosing 

In [2]:
# importing required packages, setting data and results directories

import os
import qiime2
from qiime2 import Visualization
import pandas as pd

data_dir = 'Data'
results_dir = "Results"

if not os.path.isdir(data_dir):
    os.makedirs(data_dir)

if not os.path.isdir(results_dir):
    os.makedirs(results_dir)

## 1. Data retrieve

We use QIIME2 fondue plugin to download sequence data and the corresponding metadata with study ID PRJEB19491. 

In [None]:
!echo -e "id\nPRJEB19491" > $data_dir/0-study-id.tsv

In [None]:
fondue_env = '/home/jovyan/.conda/envs/fondue/bin'

In [None]:
# append the env location of fondue to PATH so that qiime can find all required executables
%%script env fondue_env="$fondue_env" data_dir="$data_dir" bash

export PATH=$fondue_env:$PATH
    
$fondue_env/qiime tools import \
      --type NCBIAccessionIDs \
      --input-path $data_dir/0-study-id.tsv \
      --output-path $data_dir/0-study-id.qza

$fondue_env/qiime fondue get-all \
    --i-accession-ids $data_dir/0-study-id.qza \
    --p-email zakirul.islam@usys.ethz.ch \
    --output-dir $data_dir/0-fondue-output

## 2. Data export

**16s rRNA gene were amplified using two sets of primer pairs, including `27F-308R` and `Arch349-Arch806`. We export sequence data and metadata from qza files, and make manifests of two sets of data.**

In [None]:
!qiime tools export \
    --input-path $data_dir/0-fondue-output/paired_reads.qza \
    --output-path $data_dir/0-paired_reads

In [None]:
! gunzip Data/0-paired_reads/*.fastq.gz

In [None]:
%%script env fondue_env="$fondue_env" data_dir="$data_dir" bash
export PATH=$fondue_env:$PATH
$fondue_env/qiime tools export \
    --input-path $data_dir/0-fondue-output/metadata.qza \
    --output-path $data_dir/0-exported-metadata

In [None]:
metadata = pd.read_csv(f'{data_dir}/0-exported-metadata/sra-metadata.tsv', sep = '\t')

In [None]:
metadata['Description [sample]'].value_counts()

In [None]:
metadata[['Phase', "Diets"]] = metadata['Description [sample]'].str.rsplit(" ",expand=True, n = 1)
metadata = metadata[['ID', 'Phase', 'Diets']]
metadata.head()

**The first 27 samples listed in the metadata are amplified with 27F-308R and others are amplified with Arch349-Arch806R. We have distinguish this by inspect the primers of sampled sequences from each sample.**

In [None]:
metadata_bac = metadata[0:27]
metadata_arc = metadata[27:54]

In [None]:
metadata_bac.to_csv(f'{data_dir}/0-metadata_bac.tsv', sep = '\t', index=False)
metadata_arc.to_csv(f'{data_dir}/0-metadata_arc.tsv', sep = '\t', index=False)

# 3 Data import

Paired-end sequences in fastq files are imported into qiime2 artfacts again separately for downstream analysis. Manifest files are generated based on the locations of each fastq files. 

In [None]:
manifest = pd.read_csv(f'{data_dir}/0-paired_reads/MANIFEST')

In [None]:
manifest['filename'] = f'$PWD/Data/0-paired_reads/' + manifest['filename'].str.slice(0,31)

In [None]:
manifest = pd.pivot(manifest, columns= 'direction', values='filename', index = ['sample-id'])

In [None]:
manifest.reset_index(inplace=True)

In [None]:
manifest.rename(columns={"forward": "forward-absolute-filepath", "reverse": "reverse-absolute-filepath"}, inplace=True)

In [None]:
manifest_bac = manifest[0:27]
manifest_arc = manifest[27:54]

In [None]:
manifest_bac.to_csv(f'{data_dir}/0-manifest_bac', sep = '\t', index=False)
manifest_arc.to_csv(f'{data_dir}/0-manifest_arc', sep = '\t', index=False)

In [None]:
!head Data/0-manifest_arc -n 3

In [None]:
! qiime tools import \
    --type "SampleData[PairedEndSequencesWithQuality]" \
    --input-format PairedEndFastqManifestPhred33V2 \
    --input-path Data/0-manifest_arc \
    --output-path Data/1-seqs_arc.qza

In [None]:
! qiime tools import \
    --type "SampleData[PairedEndSequencesWithQuality]" \
    --input-format PairedEndFastqManifestPhred33V2 \
    --input-path Data/0-manifest_bac \
    --output-path Data/1-seqs_bac.qza

# 4.Denosing-Bacteria

Paired-end sequences from two PCR libraries were denoised separately via q2-dada2. Quality scores of bases are first inspected here.

In [None]:
!qiime demux summarize \
      --i-data Data/1-seqs_bac.qza \
      --o-visualization Results/1-seqs_bac.qzv

In [3]:
Visualization.load('Results/1-seqs_bac.qzv')

Forward and reverse sequences were truncated to 207 and 199 bp respectively to discard low quality bases and meanwhile have a sufficient overlap for read merging.

Barcode, linker, and primer were trimmed by setting trimming parameters for forward sequences (p-trim-left-f, 28) and reverse sequences (p-trim-left-r, 19). To discard bases with low quality scores and meanwhile main the sufficient overlap for read merging, forward and reverse sequences were truncated to 207 and 199 bp respectively. All other parameters were default settings of q2-dada2.  

In [None]:
!qiime dada2 denoise-paired \
    --i-demultiplexed-seqs Data/1-seqs_bac.qza \
    --p-trim-left-f 28 \
    --p-trim-left-r 19 \
    --p-trunc-len-f 207 \
    --p-trunc-len-r 199 \
    --p-n-threads 3 \
    --o-table Data/1-feature-table_bac.qza \
    --o-representative-sequences Data/1-rep-seqs_bac.qza \
    --o-denoising-stats Data/1-dada2-stats_bac.qza

Among 1,617,074 high-quality reads obtained after denoising, 8128 amplicon sequence variants (ASVs) were identified for bacteria in 27 samples. 

In [None]:
!qiime feature-table summarize \
    --i-table Data/1-feature-table_bac.qza \
    --m-sample-metadata-file Data/0-metadata_bac.tsv \
    --o-visualization Results/1-feature-table_bac.qzv

In [3]:
Visualization.load('Results/1-feature-table_bac.qzv')

A relatively high proportion (48.95 % ~ 66.53%) of sequences retained after denosing.

In [None]:
!qiime metadata tabulate \
    --m-input-file Data/1-dada2-stats_bac.qza \
    --o-visualization Results/1-dada2-stats_bac.qzv

In [4]:
Visualization.load('Results/1-dada2-stats_bac.qzv')

## 5. Denoising-Archaea

Barcode, linker, and primer were trimmed by setting trimming parameters for forward sequences (p-trim-left-f, 25) and reverse sequences (p-trim-left-r, 20). To discard bases with low quality scores and meanwhile main the sufficient overlap for read merging, forward and reverse sequences were truncated to 247 and 200 bp respectively. All other parameters were default settings of q2-dada2.  

In [None]:
!qiime demux summarize \
      --i-data Data/1-seqs_arc.qza \
      --o-visualization Results/1-seqs_arc.qzv

In [4]:
Visualization.load('Results/1-seqs_arc.qzv')

In [None]:
!qiime dada2 denoise-paired \
    --i-demultiplexed-seqs Data/1-seqs_arc.qza \
    --p-trim-left-f 25 \
    --p-trim-left-r 20 \
    --p-trunc-len-f 247 \
    --p-trunc-len-r 200 \
    --p-n-threads 3 \
    --o-table Data/1-feature-table_arc.qza \
    --o-representative-sequences Data/1-rep-seqs_arc.qza \
    --o-denoising-stats Data/1-dada2-stats_arc.qza

Among 570,769 high-quality reads obtained after denoising, 815 amplicon sequence variants (ASVs) were  in 27 samples. 

In [None]:
!qiime feature-table summarize \
    --i-table Data/1-feature-table_arc.qza \
    --m-sample-metadata-file Data/0-metadata_arc.tsv \
    --o-visualization Results/1-feature-table_arc.qzv

In [7]:
Visualization.load('Results/1-feature-table_arc.qzv')

In [None]:
!qiime metadata tabulate \
    --m-input-file Data/1-dada2-stats_arc.qza \
    --o-visualization Results/1-dada2-stats_arc.qzv

Visualization.load('Results/1-dada2-stats_arc.qzv')