In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt

%matplotlib inline

our_data = 'poop_data'

In [3]:
pwd

'/home/jovyan/Poop-Power'

In [4]:
! qiime tools peek $our_data/sequences.qza

[32mUUID[0m:        32a1795b-d6fb-4ecc-9166-4fe29fb8206a
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [5]:
! qiime demux summarize \
    --i-data $our_data/sequences.qza \
    --o-visualization $our_data/sequences.qzv

[32mSaved Visualization to: poop_data/sequences.qzv[0m
[0m

In [6]:
Visualization.load(f'{our_data}/sequences.qzv')

### Denoising - Amplicon Sequence Variants (ASV)
At first we want to denoise the data to create a feature table of ASVs. Our stept to apply Denoising (= remove noise) are the following):
1.Find the cleanest sequence (and/or model probabilities based on quality scores)
2.Correct and/or discard super noisy sequences (high expected errors)
3.DADA2: DADA2 builds an error model which can identify differences between sequences, filters out noisy sequences and generates a feature table with error-corrected sequences.


In [14]:
! qiime dada2 denoise-single \
    --i-demultiplexed-seqs $our_data/sequences.qza \
    --p-trunc-len 131 \   #we will truncate the reads up 131 bp (sequences longer than this have  smaller quality score).131 bp is our good enough, is it ok?
    --p-n-threads 3 \
    --o-table $our_data/dada2_table.qza \
    --o-representative-sequences $our_data/dada2_rep_set.qza \
    --o-denoising-stats $our_data/dada2_stats.qza

[32mSaved FeatureTable[Frequency] to: poop_data/dada2_table.qza[0m
[32mSaved FeatureData[Sequence] to: poop_data/dada2_rep_set.qza[0m
[32mSaved SampleData[DADA2Stats] to: poop_data/dada2_stats.qza[0m
[0m

In [15]:
! qiime metadata tabulate \
    --m-input-file $our_data/dada2_stats.qza \
    --o-visualization $our_data/dada2_stats.qzv

[32mSaved Visualization to: poop_data/dada2_stats.qzv[0m
[0m

In [16]:
Visualization.load(f'{our_data}/dada2_stats.qzv')

In [17]:
! qiime feature-table tabulate-seqs \
    --i-data $our_data/dada2_rep_set.qza \
    --o-visualization $our_data/dada2_rep_set.qzv

[32mSaved Visualization to: poop_data/dada2_rep_set.qzv[0m
[0m

In [18]:
Visualization.load(f'{our_data}/dada2_rep_set.qzv')

### Clustering - Operational Taxonomic Units (OUT)
With Clustering we are going to remove noisy sequences and reduce the amount of sequences to process. It works based on a given threshold, i.e. 97% similarity. Is less accurate then the denoising method.

In [19]:
! qiime quality-filter q-score --help

Usage: [94mqiime quality-filter q-score[0m [OPTIONS]

  This method filters sequence based on quality scores and the presence of
  ambiguous base calls.

[1mInputs[0m:
  [94m[4m--i-demux[0m ARTIFACT [32mSampleData[SequencesWithQuality |[0m
    [32mPairedEndSequencesWithQuality]¹ | SampleData[JoinedSequencesWithQuality]²[0m
                       The demultiplexed sequence data to be quality
                       filtered.                                    [35m[required][0m
[1mParameters[0m:
  [94m--p-min-quality[0m INTEGER
                       The minimum acceptable PHRED score. All PHRED scores
                       less that this value are considered to be low PHRED
                       scores.                                    [35m[default: 4][0m
  [94m--p-quality-window[0m INTEGER
                       The maximum number of low PHRED scores that can be
                       observed in direct succession before truncating a
                       seque