In [31]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt

%matplotlib inline

our_data = 'poop_data/Denoising'
data_dir = 'poop_data'

In [30]:
pwd

'/home/jovyan/Poop-Power'

With the follwoing step we insert our sequences and we analyzed their data format. It appears that are paired end sequences and this is an important information that allows us to select a certain typ of denoising command.

In [33]:
! qiime tools peek $data_dir/sequences.qza

[32mUUID[0m:        32a1795b-d6fb-4ecc-9166-4fe29fb8206a
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [34]:
! qiime demux summarize \
    --i-data $data_dir/sequences.qza \
    --o-visualization $data_dir/sequences.qzv

[32mSaved Visualization to: poop_data/sequences.qzv[0m
[0m

In [35]:
Visualization.load(f'{data_dir}/sequences.qzv')

### Denoising - Amplicon Sequence Variants (ASV)
At first we want to denoise the data to create a feature table of ASVs. Our stept to apply Denoising (= remove noise) are the following):
1.Find the cleanest sequence (and/or model probabilities based on quality scores)
2.Correct and/or discard super noisy sequences (high expected errors)
3.DADA2: DADA2 builds an error model which can identify differences between sequences, filters out noisy sequences and generates a feature table with error-corrected sequences.


In [36]:
! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_dir/sequences.qza \
    --p-trunc-len-f 131 \
    --p-trunc-len-r 145 \
    --p-n-threads 3 \
    --o-table $our_data/dada2_table.qza \
    --o-representative-sequences $our_data/dada2_rep_set.qza \
    --o-denoising-stats $our_data/dada2_stats.qza

[32mSaved FeatureTable[Frequency] to: poop_data/Denoising/dada2_table.qza[0m
[32mSaved FeatureData[Sequence] to: poop_data/Denoising/dada2_rep_set.qza[0m
[32mSaved SampleData[DADA2Stats] to: poop_data/Denoising/dada2_stats.qza[0m
[0m

In [37]:
! qiime metadata tabulate \
    --m-input-file $our_data/dada2_stats.qza \
    --o-visualization $our_data/dada2_stats.qzv

[32mSaved Visualization to: poop_data/Denoising/dada2_stats.qzv[0m
[0m

In [38]:
Visualization.load(f'{our_data}/dada2_stats.qzv')

In [39]:
! qiime feature-table tabulate-seqs \
    --i-data $our_data/dada2_rep_set.qza \
    --o-visualization $our_data/dada2_rep_set.qzv

[32mSaved Visualization to: poop_data/Denoising/dada2_rep_set.qzv[0m
[0m

In [40]:
Visualization.load(f'{our_data}/dada2_rep_set.qzv')

### Clustering - Operational Taxonomic Units (OUT)
With Clustering we are going to remove noisy sequences and reduce the amount of sequences to process. It works based on a given threshold, i.e. 97% similarity. Is less accurate then the denoising method.-->not needed