### 0. Setup

Import packages and create folder for data

In [19]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt

%matplotlib inline

In [20]:
# location of this week's data and all the results produced by this notebook 
# - this should be a path relative to your working directory
data_dir = 'seq_data_new'

if not os.path.isdir(data_dir):
    os.makedirs(data_dir)

### 1. Data Import

Import Metadata 

In [21]:
metadata_df = pd.read_csv(f'{data_dir}/sample_meta_data.tsv', sep='\t', index_col=0)

In [22]:
metadata_df.head()

Unnamed: 0_level_0,GEN_age_cat,GEN_age_corrected,GEN_bmi_cat,GEN_bmi_corrected,GEN_cat,GEN_collection_timestamp,GEN_country,GEN_dog,GEN_elevation,GEN_geo_loc_name,...,NUT_probiotic_frequency,NUT_red_meat_frequency,NUT_salted_snacks_frequency,NUT_seafood_frequency,NUT_sugary_sweets_frequency,NUT_vegetable_frequency,NUT_vitamin_b_supplement_frequency,NUT_vitamin_d_supplement_frequency,NUT_whole_eggs,NUT_whole_grain_frequency
sampleid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10317.000046,20s,20.0,Normal,23.75,False,2016-08-25 18:30:00,USA,True,1919.3,USA:CO,...,Rarely,Regularly,Occasionally,Rarely,Occasionally,Occasionally,Never,Never,Daily,Daily
10317.00005,Not provided,,Overweight,25.61,False,2016-07-06 09:00:00,United Kingdom,False,65.5,United Kingdom:England,...,Rarely,Rarely,Regularly,Occasionally,Regularly,Regularly,Never,Never,Rarely,Occasionally
10317.000038,30s,39.0,Overweight,27.67,False,2016-06-29 09:30:00,United Kingdom,False,44.5,United Kingdom:England,...,Never,Occasionally,Daily,Occasionally,Rarely,Occasionally,Never,Never,Regularly,Occasionally
10317.000047,50s,56.0,Normal,19.71,False,2016-07-12 17:30:00,Germany,False,8.7,Germany:HH,...,Daily,Occasionally,Rarely,Not provided,Rarely,Regularly,Daily,Daily,Rarely,Regularly
10317.000046,40s,45.0,Normal,23.15,False,2016-05-24 19:00:00,United Kingdom,True,68.8,United Kingdom:Unspecified,...,Regularly,Never,Never,Occasionally,Never,Daily,Rarely,Occasionally,Regularly,Daily


Import sequence data

In [23]:
! wget -nv -O $data_dir/seq_data.qza 'https://polybox.ethz.ch/index.php/s/AsLORlvUbwgBWTq/download'

2022-10-20 19:11:46 URL:https://polybox.ethz.ch/index.php/s/AsLORlvUbwgBWTq/download [1506379068/1506379068] -> "seq_data_new/seq_data.qza" [1]


### 2. Sequence loading and summary visulazation

Our data was already demultiplexed before, sequencing barcodes were removed and reads were mapped to sample id. The data is already provided as a Qiime artifact. Thus, we do not need to import the sequences with a MANIFEST file and we can directly visulaize our data. As our data was produced with a MiSeq System which can generate 2 × 300 bp paired-end reads in a single run, we assume they were preprocessed and cut to 150bp in some way.

In [24]:
! qiime tools peek $data_dir/seq_data.qza

[32mUUID[0m:        32a1795b-d6fb-4ecc-9166-4fe29fb8206a
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [25]:
! ls $data_dir/seq_data.qza

seq_data_new/seq_data.qza


In [26]:
! qiime demux summarize \
    --i-data $data_dir/seq_data.qza \
    --o-visualization $data_dir/seq_data.qzv

[32mSaved Visualization to: seq_data_new/seq_data.qzv[0m
[0m

In [27]:
Visualization.load(f'{data_dir}/seq_data.qzv')

## Denoise

As we have sequences form the size V4 region of the 16S rRNA (~254bp) , we at least need 2x 130 bp inorder to generate a full read. In addition, we want a minimal overlap of the two reads of at least 12. Furthermore, we inspected our initial forward and reverse sequences. Their quality did not drop towards the end and all reads are of lenth 150bp. Thus, we set trunc-len to 0 so no truncation or lenth filtering will be performed. 

Alternative: to let some space we set truncation length to 145, all reads shorter than this will be remooved, as they would not be able to form a large enough overlap. 

if we set trunc-len to 0 no truncation or lenth filtering will be performed. Might be a good idea as we know that there is no read shorter than 150bp?

In [28]:
!qiime dada2 denoise-paired --help

Usage: [94mqiime dada2 denoise-paired[0m [OPTIONS]

  This method denoises paired-end sequences, dereplicates them, and filters
  chimeras.

[1mInputs[0m:
  [94m[4m--i-demultiplexed-seqs[0m ARTIFACT [32mSampleData[PairedEndSequencesWithQuality][0m
                         The paired-end demultiplexed sequences to be
                         denoised.                                  [35m[required][0m
[1mParameters[0m:
  [94m[4m--p-trunc-len-f[0m INTEGER
                         Position at which forward read sequences should be
                         truncated due to decrease in quality. This truncates
                         the 3' end of the of the input sequences, which will
                         be the bases that were sequenced in the last cycles.
                         Reads that are shorter than this value will be
                         discarded. After this parameter is applied there must
                         still be at least a 12 nucleotide overla

In [29]:
! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_dir/seq_data.qza \
    --p-trunc-len-f 0 \
    --p-trunc-len-r 0 \
    --p-n-threads 3 \
    --o-table $data_dir/dada2_table.qza \
    --o-representative-sequences $data_dir/dada2_rep_set.qza \
    --o-denoising-stats $data_dir/dada2_stats.qza

[32mSaved FeatureTable[Frequency] to: seq_data_new/dada2_table.qza[0m
[32mSaved FeatureData[Sequence] to: seq_data_new/dada2_rep_set.qza[0m
[32mSaved SampleData[DADA2Stats] to: seq_data_new/dada2_stats.qza[0m
[0m

In [30]:
! qiime metadata tabulate \
    --m-input-file $data_dir/dada2_stats.qza \
    --o-visualization $data_dir/dada2_stats.qzv

[32mSaved Visualization to: seq_data_new/dada2_stats.qzv[0m
[0m

In [31]:
Visualization.load(f'{data_dir}/dada2_stats.qzv')

In [32]:
! qiime feature-table tabulate-seqs \
    --i-data $data_dir/dada2_rep_set.qza \
    --o-visualization $data_dir/dada2_rep_set.qzv

[32mSaved Visualization to: seq_data_new/dada2_rep_set.qzv[0m
[0m

In the following visualization we can see that almost all sequences are arround the expected length for the V4 region (~254nts), which indecated successul denoising:

In [33]:
Visualization.load(f'{data_dir}/dada2_rep_set.qzv')

In [34]:
! qiime feature-table summarize \
    --i-table $data_dir/dada2_table.qza \
    --m-sample-metadata-file $data_dir/sample_meta_data.tsv \
    --o-visualization $data_dir/dada2_table.qzv

[32mSaved Visualization to: seq_data_new/dada2_table.qzv[0m
[0m

In [35]:
Visualization.load(f'{data_dir}/dada2_table.qzv')

## Clustering

### Join the reads

Dada 2 makes the same as the quality filterin and clustering togeter! Thus we can just use the data 2. If we want to use the clustering approach we first need to join the reads!

In [36]:
!qiime vsearch join-pairs \
    --i-demultiplexed-seqs $data_dir/seq_data.qza \
    --p-minovlen 5 \
    --o-joined-sequences $data_dir/demux-joined.qza

[32mSaved SampleData[JoinedSequencesWithQuality] to: seq_data_new/demux-joined.qza[0m
[0m

In [None]:
!qiime demux summarize \
    --i-data $data_dir/demux-joined.qza \
    --o-visualization $data_dir/demux-joined.qzv

In [None]:
Visualization.load(f'{data_dir}/demux-joined.qzv')

### Quality filtering

In [15]:
! qiime quality-filter q-score \
    --i-demux $data_dir/demux-joined.qza \
    --p-min-quality 25 \
    --p-min-length-fraction 0.75 \
    --o-filtered-sequences $data_dir/demux_seqs_qc.qza \
    --o-filter-stats $data_dir/demux_seqs_qc_stats.qza

[32mSaved SampleData[JoinedSequencesWithQuality] to: seq_data_new/demux_seqs_qc.qza[0m
[32mSaved QualityFilterStats to: seq_data_new/demux_seqs_qc_stats.qza[0m
[0m

In [16]:
! qiime metadata tabulate \
    --m-input-file $data_dir/demux_seqs_qc_stats.qza \
    --o-visualization $data_dir/demux_seqs_qc_stats.qzv

[32mSaved Visualization to: seq_data_new/demux_seqs_qc_stats.qzv[0m
[0m

In [17]:
Visualization.load(f'{data_dir}/demux_seqs_qc_stats.qzv')

In [19]:
! qiime demux summarize \
    --i-data $data_dir/demux_seqs_qc.qza \
    --o-visualization $data_dir/demux_seqs_qc.qzv

[32mSaved Visualization to: seq_data_new/demux_seqs_qc.qzv[0m
[0m

In [20]:
Visualization.load(f'{data_dir}/demux_seqs_qc.qzv')

### Dereplication and Chimera removal

In [None]:
# Dereplication

! qiime vsearch dereplicate-sequences \
    --i-sequences $data_dir/demux_seqs_qc.qza \
    --o-dereplicated-sequences $data_dir/demux_seqs_derep.qza \
    --o-dereplicated-table $data_dir/demux_table_derep.qza

In [None]:
# Chimera removal

! qiime vsearch uchime-denovo \
    --i-sequences $data_dir/demux_seqs_derep.qza \
    --i-table $data_dir/demux_table_derep.qza \
    --o-chimeras $data_dir/demux_chimeras.qza \
    --o-nonchimeras $data_dir/demux_nonchimeras.qza \
    --o-stats $data_dir/demux_chimera_stats.qza


In [None]:
! qiime feature-table filter-features \
    --i-table $data_dir/demux_table_derep.qza \
    --m-metadata-file $data_dir/demux_nonchimeras.qza \
    --o-filtered-table $data_dir/demux_table_filtered.qza

! qiime feature-table filter-seqs \
    --i-data $data_dir/demux_seqs_derep.qza \
    --m-metadata-file $data_dir/demux_nonchimeras.qza \
    --o-filtered-data $data_dir/demux_seqs_filtered.qza

! qiime feature-table summarize \
    --i-table $data_dir/demux_table_filtered.qza \
    --o-visualization $data_dir/demux_table_filtered.qzv

In [None]:
Visualization.load(f'{data_dir}/demux_table_filtered.qzv')

### Clustering

a) De novo clustering

In [None]:
! qiime vsearch cluster-features-de-novo \
    --i-table $data_dir/demux_table_filtered.qza \
    --i-sequences $data_dir/demux_seqs_filtered.qza \
    --p-perc-identity 0.91 \
    --p-threads 3 \
    --o-clustered-table $data_dir/demux_table_de_novo_91.qza \
    --o-clustered-sequences $data_dir/demux_rep_set_de_novo_91.qza

In [None]:
! qiime feature-table tabulate-seqs \
    --i-data $data_dir/demux_rep_set_de_novo_91.qza \
    --o-visualization $data_dir/demux_rep_set_de_novo_91.qzv

! qiime feature-table summarize \
    --i-table $data_dir/demux_table_de_novo_91.qza \
    --m-sample-metadata-file $data_dir/metadata.tsv \
    --o-visualization $data_dir/demux_table_de_novo_91.qzv

In [None]:
Visualization.load(f'{data_dir}/demux_rep_set_de_novo_91.qzv')

In [None]:
Visualization.load(f'{data_dir}/demux_table_de_novo_91.qzv')

b) Open reference clustering

In [None]:
! qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path $data_dir/91_otus.fasta \
    --output-path $data_dir/91_otus.qza

In [None]:
! qiime vsearch cluster-features-open-reference \
    --i-table $data_dir/demux_table_filtered.qza \
    --i-sequences $data_dir/demux_seqs_filtered.qza \
    --i-reference-sequences $data_dir/91_otus.qza \
    --p-perc-identity 0.91 \
    --p-threads 3 \
    --o-clustered-table $data_dir/demux_table_open_ref_91.qza \
    --o-clustered-sequences $data_dir/demux_seqs_open_ref_91.qza \
    --o-new-reference-sequences $data_dir/demux_seqs_open_ref_new_91.qza

In [None]:
! qiime feature-table tabulate-seqs \
    --i-data $data_dir/demux_seqs_open_ref_91.qza \
    --o-visualization $data_dir/demux_seqs_open_ref_91.qzv

! qiime feature-table summarize \
    --i-table $data_dir/demux_table_open_ref_91.qza \
    --m-sample-metadata-file $data_dir/metadata.tsv \
    --o-visualization $data_dir/demux_table_open_ref_91.qzv

In [None]:
Visualization.load(f'{data_dir}/demux_seqs_open_ref_91.qzv')

In [None]:
Visualization.load(f'{data_dir}/demux_table_open_ref_91.qzv')