Taxonomy
---
In this file we will assign taxonomy labels to the ASVs obtained from the denoising. In order to do that we will use the `classify-sklearn` action from the `feature-classifier` plugin and fetch:
- a pretrained classifier (from greengenes as the silva needs too much RAM)
- the sequences which we want to have classified (from the denoising step: dada2_rep_set.qza)
**Note:** As we don't know which primers were used we actually cannot just take a pre-trained classifier as such classifiers corresponds to a certain primer pair/rRNA region. But we do know the sequences (101bp) originate from the V4 region (300bp) of the 16S rRNA. As the most common primer pair (515f/806r) fully covers the V4 region we think it should be possible to use a classifier which was trained with 515f/806r seqeunces. Why not use a full lenght classifier? It was reported that Species-level classification performance of 16S rRNA gene simulated reads had a slightly lower accuracy in full-lenght sequences than in V1–3 and V4 subdomains. (from https://doi.org/10.1186/s40168-018-0470-z) 

In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import numpy as np

import qiime2 as q2

%matplotlib inline
data_dir = 'CE'

In [2]:
#fetch the pre-trained classifier 
! wget -nv -O $data_dir/515f-806r-classifier.qza https://data.qiime2.org/2021.4/common/gg-13-8-99-515-806-nb-classifier.qza

2022-11-02 10:06:54 URL:https://s3-us-west-2.amazonaws.com/qiime2-data/2021.4/common/gg-13-8-99-515-806-nb-classifier.qza [28289645/28289645] -> "CE/515f-806r-classifier.qza" [1]


In [3]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/515f-806r-classifier.qza \
    --i-reads $data_dir/dada2_rep_set.qza \
    --o-classification $data_dir/taxonomy_v4.qza

[32mSaved FeatureData[Taxonomy] to: CE/taxonomy_v4.qza[0m
[0m

This should have created a new taxonomy.qza artifact (containing our taxonomic assignments per feature)

In [4]:
! qiime tools peek $data_dir/taxonomy_v4.qza

[32mUUID[0m:        681c3a56-4a2a-4db6-8bd9-ad30c3a5b7bb
[32mType[0m:        FeatureData[Taxonomy]
[32mData format[0m: TSVTaxonomyDirectoryFormat


Visualizations
---
We can make a tabular representation of all the features labeled with their corresponding taxonomy:

In [5]:
! qiime metadata tabulate \
    --m-input-file $data_dir/taxonomy_v4.qza \
    --o-visualization $data_dir/taxonomy_v4.qzv

[32mSaved Visualization to: CE/taxonomy_v4.qzv[0m
[0m

In [15]:
Visualization.load(f'{data_dir}/taxonomy_v4.qzv')

<<<<<<< local




>>>>>>> remote


By combining the taxonomic inforamtion with the metadata of our samples we can also get a hint of what the distribution of species is in the different samples! 
**Note:** dada2_table is a feature table from the denoising step which includes all ASVs (with error-corrected sequences)

In [7]:
#filter the feature table by excluding samples not present in metadata
! qiime feature-table filter-samples \
    --i-table $data_dir/dada2_table.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --o-filtered-table $data_dir/dada2_table_aligned.qza

[32mSaved FeatureTable[Frequency] to: CE/dada2_table_aligned.qza[0m
[0m

In [8]:
! qiime feature-table summarize \
    --i-table $data_dir/dada2_table_aligned.qza \
    --m-sample-metadata-file $data_dir/food-metadata.tsv \
    --o-visualization $data_dir/dada2_table_aligned.qzv

[32mSaved Visualization to: CE/dada2_table_aligned.qzv[0m
[0m

In [9]:
Visualization.load(f'{data_dir}/dada2_table_aligned.qzv')

In [10]:
! qiime taxa barplot \
    --i-table $data_dir/dada2_table_aligned.qza \
    --i-taxonomy $data_dir/taxonomy_v4.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --o-visualization $data_dir/taxa-bar-plots_v4.qzv

[32mSaved Visualization to: CE/taxa-bar-plots_v4.qzv[0m
[0m

In [11]:
Visualization.load(f'{data_dir}/taxa-bar-plots_v4.qzv')

To investigate how our taxonomic analysis compares to e.g. BLAST. We can open the tabular view of our ASVs and search for some features manually (click on sequence)

**de novo tree**

In [12]:
#Sequence alignment
 ! qiime alignment mafft \
     --i-sequences $data_dir/rep-seqs-filtered.qza \
     --o-alignment $data_dir/aligned-rep-seqs.qza

IndentationError: unexpected indent (1759767849.py, line 2)

In [None]:
#Alignment masking
! qiime alignment mask \
    --i-alignment $data_dir/aligned-rep-seqs.qza \
    --o-masked-alignment $data_dir/masked-aligned-rep-seqs.qza

In [None]:
#Tree construction
! qiime phylogeny fasttree \
    --i-alignment $data_dir/masked-aligned-rep-seqs.qza \
    --o-tree $data_dir/fasttree-tree.qza

! qiime phylogeny midpoint-root \
    --i-tree $data_dir/fasttree-tree.qza \
    --o-rooted-tree $data_dir/fasttree-tree-rooted.qza

In [None]:
#Tree visualization
 ! qiime empress tree-plot \
     --i-tree $data_dir/fasttree-tree-rooted.qza \
     --m-feature-metadata-file $data_dir/taxonomy.qza \
     --o-visualization $data_dir/fasttree-tree-rooted.qzv

In [None]:
Visualization.load(f'{data_dir}/fasttree-tree-rooted.qzv')

**fragment insertion**

In [None]:
! wget -nv -O $data_dir/sepp-refs-gg-13-8.qza https://data.qiime2.org/2021.4/common/sepp-refs-gg-13-8.qza

In [None]:
! qiime fragment-insertion sepp \
    --i-representative-sequences $data_dir/rep-seqs-filtered.qza \
    --i-reference-database $data_dir/sepp-refs-gg-13-8.qza \
    --p-threads 2 \
    --o-tree $data_dir/sepp-tree.qza \
    --o-placements $data_dir/sepp-tree-placements.qza

In [None]:
  ! qiime empress tree-plot \
      --i-tree $data_dir/sepp-tree.qza \
      --m-feature-metadata-file $data_dir/taxonomy.qza \
      --o-visualization $data_dir/sepp-tree-placements-tree.qzv

In [None]:
Visualization.load(f'{data_dir}/sepp-tree-placements-tree.qzv')