In [17]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import numpy as np

import qiime2 as q2

%matplotlib inline

data_dir = 'poop_data/Taxonomy'

In [4]:
! qiime rescript get-silva-data \
    --p-version '138' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences $data_dir/silva-138-ssu-nr99-seqs.qza \
    --o-silva-taxonomy $data_dir/silva-138-ssu-nr99-tax.qza

[32mSaved FeatureData[RNASequence] to: poop_data/silva-138-ssu-nr99-seqs.qza[0m
[32mSaved FeatureData[Taxonomy] to: poop_data/silva-138-ssu-nr99-tax.qza[0m
[0m



**I don't know how to find the primers used. not in the metadata. for now I am taking the same as in w4 exercise**

#To do this, we will require sequences of both, the forward and reverse, primers used in this experiment - you can look those up in the metadata of this experiment using the SRA Run Selector. We see the following sequences:

    forward: GTGCCAGCMGCCGCGGTAA
    reverse: GGACTACHVGGGTWTCTAAT


In [13]:
! qiime feature-classifier extract-reads \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer GTGCCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACHVGGGTWTCTAAT \
    --p-n-jobs 3 \
    --p-read-orientation 'forward' \
    --o-reads $data_dir/silva-138-ssu-nr99-seqs-515f-806r.qza

[32mSaved FeatureData[Sequence] to: poop_data/silva-138-ssu-nr99-seqs-515f-806r.qza[0m
[0m

Since we now have significantly shorter sequences than with started with, we need to dereplicate the database again. Conversely, after extraction some unique sequences may point to different taxonomies so we need to handle those too.

In [14]:
! qiime rescript dereplicate \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-515f-806r.qza \
    --i-taxa $data_dir/silva-138-ssu-nr99-tax-derep-uniq.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $data_dir/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
    --o-dereplicated-taxa  $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza

[32mSaved FeatureData[Sequence] to: poop_data/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza[0m
[32mSaved FeatureData[Taxonomy] to: poop_data/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza[0m
[0m

**Training classifier:** using pretrained classifier will probably not work if we have to change the primer sequneces

In [15]:
! wget -nv -O $data_dir/515f-806r-classifier.qza https://data.qiime2.org/2021.4/common/gg-13-8-99-515-806-nb-classifier.qza

2022-10-13 09:08:03 URL:https://s3-us-west-2.amazonaws.com/qiime2-data/2021.4/common/gg-13-8-99-515-806-nb-classifier.qza [28289645/28289645] -> "poop_data/515f-806r-classifier.qza" [1]


**Assigning Taxonomy** greengenes is okay or should we use better one? who can run this on their computer?

In [21]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/515f-806r-classifier.qza \
    --i-reads $poop_data/Denoising/dada2_rep_set.qza \
    --o-classification $data_dir/taxonomy.qza

Usage: [94mqiime feature-classifier classify-sklearn[0m [OPTIONS]

  Classify reads by taxon using a fitted classifier.

[1mInputs[0m:
  [94m[4m--i-reads[0m ARTIFACT [32mFeatureData[Sequence][0m
                         The feature data to be classified.         [35m[required][0m
  [94m[4m--i-classifier[0m ARTIFACT
    [32mTaxonomicClassifier[0m  The taxonomic classifier for classifying the reads.
                                                                    [35m[required][0m
[1mParameters[0m:
  [94m--p-reads-per-batch[0m VALUE [32mInt % Range(1, None) | Str % Choices('auto')[0m
                         Number of reads to process in each batch. If "auto",
                         this parameter is autoscaled to min( number of query
                         sequences / [4mn-jobs[0m, 20000).         [35m[default: 'auto'][0m
  [94m--p-n-jobs[0m INTEGER     The maximum number of concurrently worker processes.
                         If -1 all CPUs are u

In [3]:
! qiime tools peek $data_dir/taxonomy.qza

[32mUUID[0m:        92749e22-0fbc-45d0-96f7-40f11acb3083
[32mType[0m:        FeatureData[Taxonomy]
[32mData format[0m: TSVTaxonomyDirectoryFormat


**Visualization**

In [4]:
! qiime metadata tabulate \
    --m-input-file $data_dir/taxonomy.qza \
    --o-visualization $data_dir/taxonomy.qzv

[32mSaved Visualization to: poop_data/taxonomy.qzv[0m
[0m

In [18]:
Visualization.load(f'{data_dir}/taxonomy.qzv')

**filtering out mitochondria and chloroplasts**

In [12]:
! qiime taxa filter-table \
    --i-table $data_dir/dada2_table.qza \
    --i-taxonomy $data_dir/taxonomy.qza \
    --p-exclude mitochondria,chloroplast \
    --o-filtered-table $data_dir/table-filtered.qza

! qiime taxa filter-seqs \
    --i-sequences $data_dir/dada2_rep_set.qza \
    --i-taxonomy $data_dir/taxonomy.qza \
    --p-exclude mitochondria \
    --o-filtered-sequences $data_dir/rep-seqs-filtered.qza

#removes 7 ASVs

[32mSaved FeatureTable[Frequency] to: poop_data/table-filtered.qza[0m
[0m[32mSaved FeatureData[Sequence] to: poop_data/rep-seqs-filtered.qza[0m
[0m

In [16]:
! qiime metadata tabulate \
    --m-input-file $data_dir/rep-seqs-filtered.qza \
    --o-visualization $data_dir/test.qzv

[32mSaved Visualization to: poop_data/test.qzv[0m
[0m

In [3]:
#vis of filtered sequences. don't have blast links, not the same table
Visualization.load(f'{data_dir}/test.qzv')

In [4]:
#why does using the sequences befor filtering yield a completely dif table?
Visualization.load(f'{data_dir}/dada2_rep_set.qzv')

In [25]:
! qiime metadata tabulate \
    --m-input-file $'poop_data'/dada2_rep_set.qza \
    --o-visualization $data_dir/test2.qzv

[32mSaved Visualization to: poop_data/Taxonomy/test2.qzv[0m
[0m

In [26]:
Visualization.load(f'{data_dir}/test2.qzv')

In [17]:
! qiime taxa barplot \
    --i-table $data_dir/table-filtered.qza \
    --i-taxonomy $data_dir/taxonomy.qza \
    --m-metadata-file $data_dir/metadata.tsv \
    --o-visualization $data_dir/table-filtered.qzv

[32mSaved Visualization to: poop_data/table-filtered.qzv[0m
[0m

In [8]:
Visualization.load(f'{data_dir}/table-filtered.qzv')

**putting it in Panda**

In [21]:
pd.set_option('max_colwidth', 150)

In [19]:
# note: QIIME 2 artifact files can be loaded as python objects! This is how.
taxa = q2.Artifact.load(f'{data_dir}/taxonomy.qza')
# view as a `pandas.DataFrame`. Note: Only some Artifact types can be transformed to DataFrames
taxa = taxa.view(pd.DataFrame)

In [22]:
taxa.head()

Unnamed: 0_level_0,Taxon,Confidence
Feature ID,Unnamed: 1_level_1,Unnamed: 2_level_1
722e762a907f370c61fcc4ab5cc1578a,k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae,0.9999978196043772
482906834375950714dba091f15bd1b8,k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__,0.9980128570362364
5d6693622c5103b6bf234ba83b4b2440,k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Faecalibacterium; s__prausnitzii,0.9916722342146834
e3744bbda8b2e44422065cb607997451,k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Roseburia; s__faecis,0.9198694779450922
e0fabeb6364beec636c116ed5eb93fbc,k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__,0.9923577523873522


need to do sth to produce change and stage