# Taxonomy & Phylogeny 
---
In this file we will assign taxonomy labels to the ASVs obtained from the denoising. In order to do that we will use the `classify-sklearn` action from the `feature-classifier` plugin and fetch:
- a pretrained classifier (from greengenes as the silva needs too much RAM)
- the sequences which we want to have classified (from the denoising step: dada2_rep_set.qza)
**Note:** As we don't know which primers were used we actually cannot just take a pre-trained classifier as such classifiers corresponds to a certain primer pair/rRNA region. But we do know the sequences (101bp) originate from the V4 region (300bp) of the 16S rRNA. As the most common primer pair (515f/806r) fully covers the V4 region we think it should be possible to use a classifier which was trained with 515f/806r seqeunces. Why not use a full lenght classifier? It was reported that Species-level classification performance of 16S rRNA gene simulated reads had a slightly lower accuracy in full-lenght sequences than in V1–3 and V4 subdomains. (from https://doi.org/10.1186/s40168-018-0470-z) 

In [None]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import numpy as np

import qiime2 as q2

%matplotlib inline
data_dir = 'CE'

## Taxonomy
So lets start with the taxonomic classification of our ASVs !

In [None]:
#fetch the pre-trained classifier 
! wget -nv -O $data_dir/515f-806r-classifier.qza https://data.qiime2.org/2021.4/common/gg-13-8-99-515-806-nb-classifier.qza

In [None]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/515f-806r-classifier.qza \
    --i-reads $data_dir/dada2_rep_set.qza \
    --o-classification $data_dir/taxonomy_v4.qza

This should have created a new taxonomy.qza artifact (containing our taxonomic assignments per feature)

In [None]:
! qiime tools peek $data_dir/taxonomy_v4.qza

### Visualizations

We can make a tabular representation of all the features labeled with their corresponding taxonomy:

In [None]:
! qiime metadata tabulate \
    --m-input-file $data_dir/taxonomy_v4.qza \
    --o-visualization $data_dir/taxonomy_v4.qzv

In [None]:
Visualization.load(f'{data_dir}/taxonomy_v4.qzv')

By combining the taxonomic inforamtion with the metadata of our samples we can also get a hint of what the distribution of species is in the different samples! 
**Note:** dada2_table is a feature table from the denoising step which includes all ASVs (with error-corrected sequences)

In [None]:
#filter the feature table by excluding samples not present in metadata
! qiime feature-table filter-samples \
    --i-table $data_dir/dada2_table.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --o-filtered-table $data_dir/dada2_table_aligned.qza

In [None]:
! qiime feature-table summarize \
    --i-table $data_dir/dada2_table_aligned.qza \
    --m-sample-metadata-file $data_dir/food-metadata.tsv \
    --o-visualization $data_dir/dada2_table_aligned.qzv

In [None]:
Visualization.load(f'{data_dir}/dada2_table_aligned.qzv')

In [None]:
! qiime taxa barplot \
    --i-table $data_dir/dada2_table_aligned.qza \
    --i-taxonomy $data_dir/taxonomy_v4.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --o-visualization $data_dir/taxa-bar-plots_v4.qzv

In [None]:
Visualization.load(f'{data_dir}/taxa-bar-plots_v4.qzv')

To investigate how our taxonomic analysis compares to e.g. BLAST. We can open the tabular view of our ASVs and search for some features manually. 

### Filtering the feature table

In [None]:
#filter feature table and exclude mitochondria,chloroplast
! qiime taxa filter-table \
--i-table $data_dir/dada2_table_aligned.qza \
--i-taxonomy $data_dir/taxonomy_v4.qza \
--p-exclude mitochondria,chloroplast \
--o-filtered-table $data_dir/dada2_table_align_filtered.qza

In [None]:
#filter sequences and exclude mitochondria,chloroplast
! qiime taxa filter-seqs \
--i-sequences $data_dir/dada2_rep_set.qza \
--i-taxonomy $data_dir/taxonomy_v4.qza \
--p-exclude mitochondria,chloroplast \
--o-filtered-sequences $data_dir/dada2_rep_set_filtered.qza

In [None]:
#this is the new barplot with the filtered deature atble and sequences NO MITOCHONDRIA AND CHLOOROPLAST VISIBLE
! qiime taxa barplot \
--i-table $data_dir/dada2_table_align_filtered.qza \
--i-taxonomy $data_dir/taxonomy_v4.qza \
--m-metadata-file $data_dir/food-metadata.tsv \
--o-visualization $data_dir/taxa-bar-plots_v4_filtered.qzv

In [None]:
Visualization.load(f'{data_dir}/taxa-bar-plots_v4_filtered.qzv')

BUT I AM NOT DONE YET WITH FILTERING: Filtering the feature table and the sequences only to **assigned phylum level** and exclude the **unassigned**: (because when exploring barplot we discovered some features which had k__Bacteria;__;__;__;__;__;__ or Unassigned;__;__;__;__;__;__. These 2 features give us no additional information if we keep it in our data?)

In [None]:
! qiime taxa filter-table \
--i-table $data_dir/dada2_table_align_filtered.qza \
--i-taxonomy $data_dir/taxonomy_v4.qza \
--p-include p__ \
--o-filtered-table $data_dir/dada2_table_align_filtered2.qza

In [None]:
#filter sequences and exclude the unassigned
#edit from future milena: we did not continue with this qza as we decided to keep the unassigned sequences as they may be the key to the differnces between the microbiomes
! qiime taxa filter-seqs \
--i-sequences $data_dir/dada2_rep_set.qza \
--i-taxonomy $data_dir/taxonomy_v4.qza \
--p-include p__ \
--o-filtered-sequences $data_dir/dada2_rep_set_filtered2.qza

In [None]:
#this is the new barplot with the filtered Feature table and sequences ONLY TO PHYLUM LEVEL
! qiime taxa barplot \
--i-table $data_dir/dada2_table_align_filtered2.qza \
--i-taxonomy $data_dir/taxonomy_v4.qza \
--m-metadata-file $data_dir/food-metadata.tsv \
--o-visualization $data_dir/taxa-bar-plots_v4_filtered2.qzv

In [None]:
Visualization.load(f'{data_dir}/taxa-bar-plots_v4_filtered2.qzv')

## Phylogeny
We will use the two main phylogeny reconstruction approaches:

1. de novo reconstruction
2. reference-based fragment insertion


### 1. de novo reconstruction

In [None]:
#Sequence alignment
! qiime alignment mafft \
     --i-sequences $data_dir/dada2_rep_set_filtered.qza \
     --o-alignment $data_dir/aligned-rep-seqs.qza

In [None]:
#Alignment masking
#removing regions that are phylogenetically uninformative due e.g. to alignment errors
! qiime alignment mask \
    --i-alignment $data_dir/aligned-rep-seqs.qza \
    --o-masked-alignment $data_dir/masked-aligned-rep-seqs.qza

In [None]:
#Tree construction
! qiime phylogeny fasttree \
    --i-alignment $data_dir/masked-aligned-rep-seqs.qza \
    --o-tree $data_dir/fasttree-tree.qza

! qiime phylogeny midpoint-root \
    --i-tree $data_dir/fasttree-tree.qza \
    --o-rooted-tree $data_dir/fasttree-tree-rooted.qza

In [None]:
#Tree visualization
! qiime empress tree-plot \
     --i-tree $data_dir/fasttree-tree-rooted.qza \
     --m-feature-metadata-file $data_dir/taxonomy_v4.qza \
     --o-visualization $data_dir/fasttree-tree-rooted.qzv

In [None]:
Visualization.load(f'{data_dir}/fasttree-tree-rooted.qzv')

### 2. reference-based fragment insertion
Now we use a tree that was already constructed and only insert our sequences into the existing tree.

In [None]:
# fetch the tree that was built from the Greengenes 13_8 database at 99% identity
! wget -nv -O $data_dir/sepp-refs-gg-13-8.qza https://data.qiime2.org/2021.4/common/sepp-refs-gg-13-8.qza

In [None]:
#insert our sequences
#note: this is an already rooted tree
! qiime fragment-insertion sepp \
    --i-representative-sequences $data_dir/dada2_rep_set_filtered.qza \
    --i-reference-database $data_dir/sepp-refs-gg-13-8.qza \
    --p-threads 2 \
    --o-tree $data_dir/sepp-tree.qza \
    --o-placements $data_dir/sepp-tree-placements.qza

In [None]:
#tree visualization
! qiime empress tree-plot \
      --i-tree $data_dir/sepp-tree.qza \
      --m-feature-metadata-file $data_dir/taxonomy_v4.qza \
      --o-visualization $data_dir/sepp-tree-placements-tree.qzv

In [None]:
Visualization.load(f'{data_dir}/sepp-tree-placements-tree.qzv')