### 0. Setup

Import packages and create folder for data

In [2]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
# location of this week's data and all the results produced by this notebook 
# - this should be a path relative to your working directory
data_dir = 'project_data'

if not os.path.isdir(data_dir):
    os.makedirs(data_dir)

### 1. Data Import

Import Metadata 

In [25]:
! wget -nv -O $data_dir/cleaned_sample_meta_data.tsv 'https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=cleaned_sample_meta_data.tsv'

2022-11-12 13:41:37 URL:https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=cleaned_sample_meta_data.tsv [295152/295152] -> "project_data/cleaned_sample_meta_data.tsv" [1]


In [9]:
metadata_df = pd.read_csv(f'{data_dir}/sample_meta_data.tsv', sep='\t', index_col=0)

In [10]:
metadata_df.shape

(523, 56)

Import sequence data

In [28]:
! wget -nv -O $data_dir/seq_data.qza 'https://polybox.ethz.ch/index.php/s/AsLORlvUbwgBWTq/download'

2022-11-12 13:41:43 URL:https://polybox.ethz.ch/index.php/s/AsLORlvUbwgBWTq/download [1506379068/1506379068] -> "project_data/seq_data.qza" [1]


### 2. Sequence loading and summary visulazation

Our data was already demultiplexed before, sequencing barcodes were removed and reads were mapped to sample id. The data is already provided as a Qiime artifact. Thus, we do not need to import the sequences with a MANIFEST file and we can directly visulaize our data. As our data was produced with a MiSeq System which can generate 2 × 300 bp paired-end reads in a single run, we assume they were preprocessed and cut to 150bp in some way.

In [29]:
! qiime tools peek $data_dir/seq_data.qza

[32mUUID[0m:        32a1795b-d6fb-4ecc-9166-4fe29fb8206a
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [30]:
! ls $data_dir/seq_data.qza

project_data/seq_data.qza


In [59]:
! qiime tools export \
    --input-path $data_dir/seq_data.qzv\
    --output-path $data_dir

[32mExported project_data/seq_data.qzv as Visualization to directory project_data[0m


In [13]:
! qiime demux filter-samples \
    --i-demux $data_dir/seq_data.qza \
    --m-metadata-file $data_dir/cleaned_sample_meta_data.tsv \
    --o-filtered-demux $data_dir/seq_data_cleaned.qza

[31m[1mPlugin error from demux:

  '10317.00004738' is not a sample present in the demultiplexed data.

Debug info has been saved to /tmp/qiime2-q2cli-err-5y3c38ot.log[0m
[0m

In [63]:
! qiime tools peek $data_dir/seq_data_cleaned.qza

[32mUUID[0m:        76b3ef42-c2c1-452a-a9fa-ddb0d5d3fd94
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [64]:
! qiime demux summarize \
    --i-data $data_dir/seq_data_cleaned.qza \
    --o-visualization $data_dir/seq_data_cleaned.qzv

[32mSaved Visualization to: project_data/seq_data_cleaned.qzv[0m
[0m

In [5]:
Visualization.load(f'{data_dir}/seq_data_cleaned.qzv')

In [36]:
! qiime demux summarize \
    --i-data $data_dir/seq_data.qza \
    --o-visualization $data_dir/seq_data.qzv

[32mSaved Visualization to: project_data/seq_data.qzv[0m
[0m

In [4]:
Visualization.load(f'{data_dir}/seq_data.qzv')

## Denoise

As we have sequences form the size V4 region of the 16S rRNA (~254bp) , we at least need 2x 130 bp inorder to generate a full read. In addition, we want a minimal overlap of the two reads of at least 12. Furthermore, we inspected our initial forward and reverse sequences. Their quality did not drop towards the end and all reads are of lenth 150bp. Thus, we set trunc-len to 0 so no truncation or lenth filtering will be performed. 

In [38]:
!qiime dada2 denoise-paired --help

Usage: [94mqiime dada2 denoise-paired[0m [OPTIONS]

  This method denoises paired-end sequences, dereplicates them, and filters
  chimeras.

[1mInputs[0m:
  [94m[4m--i-demultiplexed-seqs[0m ARTIFACT [32mSampleData[PairedEndSequencesWithQuality][0m
                         The paired-end demultiplexed sequences to be
                         denoised.                                  [35m[required][0m
[1mParameters[0m:
  [94m[4m--p-trunc-len-f[0m INTEGER
                         Position at which forward read sequences should be
                         truncated due to decrease in quality. This truncates
                         the 3' end of the of the input sequences, which will
                         be the bases that were sequenced in the last cycles.
                         Reads that are shorter than this value will be
                         discarded. After this parameter is applied there must
                         still be at least a 12 nucleotide overla

In [40]:
! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_dir/seq_data_cleaned.qza \
    --p-trunc-len-f 0 \
    --p-trunc-len-r 0 \
    --p-n-threads 3 \
    --o-table $data_dir/dada2_table_cleaned.qza \
    --o-representative-sequences $data_dir/dada2_rep_set_cleaned.qza \
    --o-denoising-stats $data_dir/dada2_stats_cleaned.qza

[32mSaved FeatureTable[Frequency] to: project_data/dada2_table_cleaned.qza[0m
[32mSaved FeatureData[Sequence] to: project_data/dada2_rep_set_cleaned.qza[0m
[32mSaved SampleData[DADA2Stats] to: project_data/dada2_stats_cleaned.qza[0m
[0m

In [41]:
! qiime metadata tabulate \
    --m-input-file $data_dir/dada2_stats_cleaned.qza \
    --o-visualization $data_dir/dada2_stats_cleaned.qzv

[32mSaved Visualization to: project_data/dada2_stats_cleaned.qzv[0m
[0m

In [42]:
Visualization.load(f'{data_dir}/dada2_stats_cleaned.qzv')

In [43]:
! qiime feature-table tabulate-seqs \
    --i-data $data_dir/dada2_rep_set_cleaned.qza \
    --o-visualization $data_dir/dada2_rep_set_cleaned.qzv

[32mSaved Visualization to: project_data/dada2_rep_set_cleaned.qzv[0m
[0m

In the following visualization we can see that almost all sequences are arround the expected length for the V4 region (~254nts), which indecated successul denoising:

In [44]:
Visualization.load(f'{data_dir}/dada2_rep_set_cleaned.qzv')

In [45]:
! qiime feature-table summarize \
    --i-table $data_dir/dada2_table_cleaned.qza \
    --m-sample-metadata-file $data_dir/cleaned_sample_meta_data.tsv \
    --o-visualization $data_dir/dada2_table_cleaned.qzv

[31m[1mPlugin error from feature-table:

  The following IDs are not present in the metadata: '10317.000002930', '10317.000027920', '10317.000028654', '10317.000032650', '10317.000036170', '10317.000036950', '10317.000037960', '10317.000039980', '10317.000040490', '10317.000041730', '10317.000042590', '10317.000042660', '10317.000044340', '10317.000044550', '10317.000046270', '10317.000046290', '10317.000046336', '10317.000047140', '10317.000047141', '10317.000047220', '10317.000047230', '10317.000047370', '10317.000047380', '10317.000047610', '10317.000047620', '10317.000047680', '10317.000048326', '10317.000050240', '10317.000050273', '10317.000050290', '10317.000051100', '10317.000051130', '10317.000051160', '10317.000051180', '10317.000051210', '10317.000051560', '10317.000052030', '10317.000052055', '10317.000052260', '10317.000052280', '10317.000052370', '10317.000052380', '10317.000052430', '10317.000052450', '10317.000053310', '10317.000053410', '10317.000053430', '10317.0000

In [46]:
Visualization.load(f'{data_dir}/dada2_table_cleaned.qzv')

ValueError: project_data/dada2_table_cleaned.qzv does not exist.

## Clustering

We tried out Denoising and Custering to compare the methods. We decided to use the Denoising method. The following code is commented out and is just provided for completeness.

### Join the reads

Dada 2 makes the same as the quality filtering and clustering togeter! Thus we can just use the data 2. If we want to use the clustering approach we first need to join the reads!

### Quality filtering

### Dereplication and Chimera removal

### Clustering

a) De novo clustering

b) Open reference clustering