In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import numpy as np

import qiime2 as q2

%matplotlib inline

data_dir = 'poop_data/Taxonomy'

In [None]:
! qiime rescript get-silva-data \
    --p-version '138' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences $data_dir/silva-138-ssu-nr99-seqs.qza \
    --o-silva-taxonomy $data_dir/silva-138-ssu-nr99-tax.qza

In [10]:
#using the cleaned database from w4 exercise due to time saving/computing benefits
#removing sequences that are shorter than certain threshold (based on whether they belong to Archaea, Bacteria or Eukaryota).
#do the numbers also apply for us?
#i steht für input, p parameter, o output.

! qiime rescript filter-seqs-length-by-taxon \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy $data_dir/silva-138-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs $data_dir/silva-138-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs $data_dir/silva-138-ssu-nr99-seqs-discard.qza

[32mSaved FeatureData[Sequence] to: poop_data/Taxonomy/silva-138-ssu-nr99-seqs-filt.qza[0m
[32mSaved FeatureData[Sequence] to: poop_data/Taxonomy/silva-138-ssu-nr99-seqs-discard.qza[0m
[0m

As there may be multiple identical sequences sharing same or different taxonomies in the SILVA database, we will also dereplicate our database. We will keep identical sequence records that have differing taxonomies (`uniq` mode).

In [11]:
! qiime rescript dereplicate \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-filt.qza  \
    --i-taxa $data_dir/silva-138-ssu-nr99-tax.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $data_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa $data_dir/silva-138-ssu-nr99-tax-derep-uniq.qza

[32mSaved FeatureData[Sequence] to: poop_data/Taxonomy/silva-138-ssu-nr99-seqs-derep-uniq.qza[0m
[32mSaved FeatureData[Taxonomy] to: poop_data/Taxonomy/silva-138-ssu-nr99-tax-derep-uniq.qza[0m
[0m



#the forward and reverse primers used in this experiment:

    FWD: GTGYCAGCMGCCGCGGTAA
    REV: GGACTACNVGGGTWTCTAAT


In [12]:
! qiime feature-classifier extract-reads \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer GTGYCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACNVGGGTWTCTAAT \
    --p-n-jobs 3 \
    --p-read-orientation 'forward' \
    --o-reads $data_dir/silva-our-primers.qza

[32mSaved FeatureData[Sequence] to: poop_data/Taxonomy/silva-our-primers.qza[0m
[0m

Since we now have significantly shorter sequences than with started with, we need to dereplicate the database again. Conversely, after extraction some unique sequences may point to different taxonomies so we need to handle those too.

In [13]:
! qiime rescript dereplicate \
    --i-sequences $data_dir/silva-our-primers.qza \
    --i-taxa $data_dir/silva-138-ssu-nr99-tax-derep-uniq.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $data_dir/silva-our-primers-derep-uniq.qza \
    --o-dereplicated-taxa  $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza

[32mSaved FeatureData[Sequence] to: poop_data/Taxonomy/silva-our-primers-derep-uniq.qza[0m
[32mSaved FeatureData[Taxonomy] to: poop_data/Taxonomy/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza[0m
[0m

**Training classifier:** using pretrained classifier will probably not work if we have to change the primer sequneces

In [2]:
 ! qiime feature-classifier fit-classifier-naive-bayes \
     --i-reference-reads $data_dir/silva-our-primers-derep-uniq.qza \
     --i-reference-taxonomy $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza \
     --o-classifier $data_dir/trained-classifier.qza --verbose



using pre trained classifier with full length sequences because our primers don't fit the primers of specifically trained ones
used human stool classifier, still too big. everything with silva is too large. only working is greengenes 515f weighted. don't know the impact of weighted compared to the first from the exercise. output taxonomy file is twice as big...?
515f 806r using is okay because our primers align with the same base/place in sequence, checked in blast against 16s e. coli

In [3]:
! wget -nv -O $data_dir/weighted-greengenes-515f-806r-classifier.qza https://data.qiime2.org/2022.8/common/gg-13-8-99-515-806-nb-weighted-classifier.qza

2022-11-02 13:48:38 URL:https://s3-us-west-2.amazonaws.com/qiime2-data/2022.8/common/gg-13-8-99-515-806-nb-weighted-classifier.qza [28738550/28738550] -> "poop_data/Taxonomy/weighted-greengenes-515f-806r-classifier.qza" [1]


**Assigning Taxonomy** greengenes is okay or should we use better one? who can run this on their computer?

In [4]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/weighted-greengenes-515f-806r-classifier.qza \
    --i-reads $'poop_data/Denoising'/dada2_rep_set.qza \
    --o-classification $data_dir/taxonomy_new.qza

[32mSaved FeatureData[Taxonomy] to: poop_data/Taxonomy/taxonomy_new.qza[0m
[0m

**Visualization**

In [5]:
! qiime metadata tabulate \
    --m-input-file $data_dir/taxonomy_new.qza \
    --o-visualization $data_dir/taxonomy_new.qzv

[32mSaved Visualization to: poop_data/Taxonomy/taxonomy_new.qzv[0m
[0m

In [6]:
Visualization.load(f'{data_dir}/taxonomy.qzv')

In [7]:
Visualization.load(f'{data_dir}/taxonomy_new.qzv')

**filtering out mitochondria and chloroplasts**

In [8]:
! qiime taxa filter-table \
    --i-table $'poop_data/Denoising'/dada2_table.qza \
    --i-taxonomy $data_dir/taxonomy_new.qza \
    --p-exclude mitochondria,chloroplast \
    --o-filtered-table $data_dir/table-filtered_new.qza

! qiime taxa filter-seqs \
    --i-sequences $'poop_data/Denoising'/dada2_rep_set.qza \
    --i-taxonomy $data_dir/taxonomy_new.qza \
    --p-exclude mitochondria \
    --o-filtered-sequences $data_dir/rep-seqs-filtered_new.qza

#removes 7 ASVs
#removes 10 ASVs with new tax

[32mSaved FeatureTable[Frequency] to: poop_data/Taxonomy/table-filtered_new.qza[0m
[0m[32mSaved FeatureData[Sequence] to: poop_data/Taxonomy/rep-seqs-filtered_new.qza[0m
[0m

In [9]:
! qiime metadata tabulate \
    --m-input-file $data_dir/rep-seqs-filtered_new.qza \
    --o-visualization $data_dir/rep-seqs-filtered_new.qzv

[32mSaved Visualization to: poop_data/Taxonomy/rep-seqs-filtered_new.qzv[0m
[0m

In [10]:
#vis of filtered sequences. don't have blast links, not the same table
Visualization.load(f'{data_dir}/rep-seqs-filtered_new.qzv')

In [11]:
! qiime metadata tabulate \
    --m-input-file $data_dir/table-filtered_new.qza \
    --o-visualization $data_dir/table-filtered_new.qzv

[32mSaved Visualization to: poop_data/Taxonomy/table-filtered_new.qzv[0m
[0m

In [None]:
! qiime metadata tabulate \
    --m-input-file $'poop_data'/dada2_rep_set.qza \
    --o-visualization $data_dir/test2.qzv

In [None]:
Visualization.load(f'{data_dir}/test2.qzv')

In [15]:
! qiime taxa barplot \
    --i-table $data_dir/table-filtered_new.qza \
    --i-taxonomy $data_dir/taxonomy_new.qza \
    --m-metadata-file $'poop_data'/metadata.tsv \
    --o-visualization $data_dir/table-filtered_new.qzv

[32mSaved Visualization to: poop_data/Taxonomy/table-filtered_new.qzv[0m
[0m

In [17]:
Visualization.load(f'{data_dir}/table-filtered.qzv')

**putting it in Panda**

In [None]:
pd.set_option('max_colwidth', 150)

In [None]:
# note: QIIME 2 artifact files can be loaded as python objects! This is how.
taxa = q2.Artifact.load(f'{data_dir}/taxonomy_new.qza')
# view as a `pandas.DataFrame`. Note: Only some Artifact types can be transformed to DataFrames
taxa = taxa.view(pd.DataFrame)

In [None]:
taxa.head()

need to do sth to produce change and stage