# Taxonomy classification

**Overview:**<br>
[1. Setup](#setup)<br>
[2. Taxonomy assignment](#tax_assignment_main)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.1 Taxonomy classifier](#ref_db)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2 Taxonomy assignment](#tax_assignment)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3 Taxonomy visualization](#tax_visualization)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.4 Evaluation classifier](#evaluation_classifier)<br>

<a id='setup'></a>

## 1. Setup

The cell below will import all the packages required in the downstream analyses as well as set all the necessary variables and data paths.

In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import numpy as np

import qiime2 as q2

%matplotlib inline

# location of the data and all the results produced by this notebook 
data_dir = 'project_data'

if not os.path.isdir(data_dir):
    os.makedirs(data_dir)

Download the `FeatureData[Sequence]` from our data which was made in the FirstLook.ipynb:

In [None]:
! wget -nv -O $data_dir/rep-seqs.qza 'https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=Sequences_rep_set.qza'

Download the `FeatureTable[Frequency]` containing a mapping of the dereplicated sequences to samples from our data which was made in the FirstLook.ipynb:



In [None]:
! wget -nv -O $data_dir/table.qza 'https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=Feature_table.qza'

<a id='tax_assignment_main'></a>

## 2. Taxonomy assignment

To classify the sequences into bacterial species, the assignment can be done with BLAST search of the sequences against a data base of known sequence. Another methode is a machine learning classifier which is trained on a reference database to recognize the bacterial species in the samples. 
Unfortunately with the memory from the JupyterLab it was not possible to train our own classifier. 
The approach how to classify our own classifier can be found in the additional D_Taxonomy Python Notebooks. 

Here pre-trained classifiers are downloaded and used on the project dataset to assign the taxon to the samples.


<a id='ref_db'></a>

### 2.1 Taxonomy classifier
As it is not possible to train our own classifier, it is necessary to find a classifier which is trained on very similar data as our. Therefore a classifier on human stool or gut sample is choosen. 


In the following cell the classifier is downloaded. 
Uniform and weighted naive Bayes classifiers trained on Silva 138.1 data for use with QIIME 2 q2-feature-classifier.

full-length-average-classifier.qza and 515f-806r-average-classifier.qza are classifiers using weights averaged across 14 EMPO 3 habitat types. If in doubt, use one of these.

Original weights derived from Qiita, scripts used to derive them, and additional information available at https://github.com/BenKaehler/readytowear.

Classifiers trained on full-length 16S or 515F/806R region as labelled.

Full length Silva 138.1 reference sequences and corresponding taxonomies are in ref-seqs.qza an ref-tax.qza.

If you use any of the weighted classifiers, please cite

Kaehler BD, Bokulich NA, McDonald D, Knight R, Caporaso JG, Huttley GA. (2019). Species-level microbial sequence classification is improved by source-environment information. Nature Communications 10: 4643. doi: https://doi.org/10.1038/s41467-019-12669-6
If you use the any of the classifiers (weighted or otherwise), please cite

Bokulich, N.A., Kaehler, B.D., Rideout, J.R. et al. (2018). Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome 6, 90. doi: https://doi.org/10.1186/s40168-018-0470-z

If you use any file from here, please cite:

Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. 41 (D1): D590-D596

Robeson, M. S., O’Rourke, D. R., Kaehler, B. D., Ziemski, M., Dillon, M. R., Foster, J. T., & Bokulich, N. A. (2021). RESCRIPt: Reproducible sequence taxonomy reference database management. PLoS Comp. Bio., 17(11). doi: https://doi.org/10.1371/journal.pcbi.1009581

Warning: Pre-trained classifiers that can be used with q2-feature-classifier currently present a security risk. If using a pre-trained classifier such as the ones provided here, you should trust the person who trained the classifier and the person who provided you with the qza file.

In [2]:
! wget -nv -O $data_dir/515f-806r-human-stool-classifier.qza https://zenodo.org/record/6395539/files/515f-806r-human-stool-classifier.qza?download=1

2022-11-06 12:58:55 URL:https://zenodo.org/record/6395539/files/515f-806r-human-stool-classifier.qza?download=1 [152194741/152194741] -> "project_data/515f-806r-human-stool-classifier.qza" [1]


In [3]:
! wget -nv -O $data_dir/full-length-human-stool-classifier.qza https://zenodo.org/record/6395539/files/full-length-human-stool-classifier.qza?download=1

2022-11-06 12:59:03 URL:https://zenodo.org/record/6395539/files/full-length-human-stool-classifier.qza?download=1 [532863962/532863962] -> "project_data/full-length-human-stool-classifier.qza" [1]


In [4]:
! wget -nv -O $data_dir/515f-806r-average-classifier.qza https://zenodo.org/record/6395539/files/515f-806r-average-classifier.qza?download=1

2022-11-06 12:59:06 URL:https://zenodo.org/record/6395539/files/515f-806r-average-classifier.qza?download=1 [152982184/152982184] -> "project_data/515f-806r-average-classifier.qza" [1]


In [None]:
! wget -nv -O $data_dir/full-length-average-classifier.qza https://zenodo.org/record/6395539/files/full-length-average-classifier.qza?download=1

Ref-seq which are used for the classifier:

In [5]:
! wget -nv -O $data_dir/ref-seqs.qza https://zenodo.org/record/6395539/files/ref-seqs.qza?download=1

2022-11-06 12:59:09 URL:https://zenodo.org/record/6395539/files/ref-seqs.qza?download=1 [159181191/159181191] -> "project_data/ref-seqs.qza" [1]


Ref-tax used for the classifier are downloaded: 

In [6]:
! wget -nv -O $data_dir/ref-tax.qza https://zenodo.org/record/6395539/files/ref-tax.qza?download=1

2022-11-06 12:59:11 URL:https://zenodo.org/record/6395539/files/ref-tax.qza?download=1 [11614482/11614482] -> "project_data/ref-tax.qza" [1]


<a id='tax_assignment'></a>

### 2.2 Taxonomy assignment

After all the preprocessing steps it is time to assign taxonomy labels to the ASVs from the project data. The `classify-sklearn` action from the `feature-classifier` plugin needs two things: 
- the classifier which was downloaded
- the sequences to be classified

This step will require the `FeatureData[Sequence]` artifact (containing our ASVs) that were generated beforhand.
To run the following cell at least 10 GB of available RAM is required.

In [8]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/515f-806r-human-stool-classifier.qza \
    --i-reads $data_dir/rep-seqs.qza \
    --o-classification $data_dir/515f-806r-human-stool-taxonomy.qza

^C

Aborted!


In [7]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/full-length-human-stool-classifier.qza \
    --i-reads $data_dir/rep-seqs.qza \
    --o-classification $data_dir/full-length-human-stool-taxonomy.qza

In [None]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/515f-806r-average-classifier.qza \
    --i-reads $data_dir/rep-seqs.qza \
    --o-classification $data_dir/515f-806r-average-taxonomy.qza

In [None]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/full-length-average-classifier.qza \
    --i-reads $data_dir/rep-seqs.qza \
    --o-classification $data_dir/full-length-average-taxonomy.qza

A new `FeatureData[Taxonomy]` artifact should be created, containing our taxonomic assignments per feature.

In [None]:
! qiime tools peek $data_dir/515f-806r-human-stool-taxonomy.qza

! qiime tools peek $data_dir/full-length-human-stool-taxonomy.qza

! qiime tools peek $data_dir/515f-806r-average-taxonomy.qza

! qiime tools peek $data_dir/full-length-average-taxonomy.qza

<a id='tax_visualization'></a>

### 2.3 Taxonomy visualization

The composition of the project samples is in the coming section analyzed . A tabular representation of all the features labeled with their corresponding taxonomy is created:

In [None]:
! qiime metadata tabulate \
    --m-input-file $data_dir/515f-806r-human-stool-taxonomy.qza \
    --o-visualization $data_dir/515f-806r-human-stool-taxonomy.qzv

In [None]:
! qiime metadata tabulate \
    --m-input-file $data_dir/full-length-human-stool-taxonomy.qza \
    --o-visualization $data_dir/full-length-human-stool-taxonomy.qzv

In [None]:
! qiime metadata tabulate \
    --m-input-file $data_dir/515f-806r-average-taxonomy.qza \
    --o-visualization $data_dir/515f-806r-average-taxonomy.qzv

In [None]:
! qiime metadata tabulate \
    --m-input-file $data_dir/full-length-average-taxonomy.qza \
    --o-visualization $data_dir/full-length-average-taxonomy.qzv

In the table an ID of every ASV is asigned with its corresponding taxonomic assignment and the prediction confidence.

In [None]:
Visualization.load(f'{data_dir}/515f-806r-human-stool-taxonomy.qzv')

In [None]:
Visualization.load(f'{data_dir}/full-length-human-stool-taxonomy.qzv')

In [None]:
Visualization.load(f'{data_dir}/515f-806r-average-taxonomy.qzv')

In [None]:
Visualization.load(f'{data_dir}/full-length-average-taxonomy.qzv')

The taxonomic information per feature can be combined with the information about the samples to get an idea of the taxonomic distribution of species is in the different samples. The data can be visualized in a bar plot. Each bar represents a single sample and is broken down proportionally to counts of every taxon.

In [None]:
! qiime taxa barplot \
    --i-table $data_dir/table.qza \
    --i-taxonomy $data_dir/515f-806r-human-stool-taxonomy.qza \
    --m-metadata-file $data_dir/cleand_sample_meta_data.tsv \
    --o-visualization $data_dir/515f-806r-human-stool-barplot.qzv

In [None]:
Visualization.load(f'{data_dir}/515f-806r-human-stool-barplot.qzv')

In [None]:
! qiime taxa barplot \
    --i-table $data_dir/table.qza \
    --i-taxonomy $data_dir/515f-806r-average-taxonomy.qza \
    --m-metadata-file $data_dir/cleand_sample_meta_data.tsv \
    --o-visualization $data_dir/515f-806r-average-barplot.qzv

In [None]:
Visualization.load(f'{data_dir}/515f-806r-average-barplot.qzv')

In [None]:
! qiime taxa barplot \
    --i-table $data_dir/table.qza \
    --i-taxonomy $data_dir/515f-806r-human-stool-taxonomy.qza \
    --m-metadata-file $data_dir/cleand_sample_meta_data.tsv \
    --o-visualization $data_dir/515f-806r-human-stool-barplot.qzv

In [None]:
Visualization.load(f'{data_dir}/515f-806r-human-stool-barplot.qzv')

In [None]:
! qiime taxa barplot \
    --i-table $data_dir/table.qza \
    --i-taxonomy $data_dir/full-length-average-taxonomy.qza \
    --m-metadata-file $data_dir/cleand_sample_meta_data.tsv \
    --o-visualization $data_dir/full-length-average-barplot.qzv

In [None]:
Visualization.load(f'{data_dir}/full-length-average-barplot.qzv')

Some of the taxonomic assignment are compared to  BLAST for validation. As from the generated ASVs  the BLAST links are equipped there.

In [None]:
Visualization.load(f'{data_dir}/rep-seqs.qzv')

Using BLAST as the taxonomic identifier are the same taxonomies observed as with q2-feature-classifier?

Mitochondrial sequences may have to be filtered out of the feature table and sequences. Therefore the `filter-table` and `filter-seqs` action from the `taxa` plugin are used. To exclude features meeting certain criteria we can use the `p-exclude` parameter as follows:

In [None]:
#! qiime taxa filter-table \
#    --i-table $data_dir/table.qza \
#    --i-taxonomy $data_dir/silva-taxonomy.qza \
#    --p-exclude mitochondria,chloroplast \
#    --o-filtered-table $data_dir/table-filtered.qza

#! qiime taxa filter-seqs \
#    --i-sequences $data_dir/rep-seqs.qza \
#    --i-taxonomy $data_dir/silva-taxonomy.qza \
#    --p-exclude mitochondria \
#    --o-filtered-sequences $data_dir/rep-seqs-filtered.qza

The taxa barplot is regenerated using the filtered sequences to compare it to the previous visualization  to check if the distribution on different taxonomic level have changed between samples. 

In [None]:
#! qiime feature-classifier classify-sklearn \
#    --i-classifier $data_dir/515f-806r-classifier.qza \
#    --i-reads $data_dir/rep-seqs-filtered.qza \
#    --o-classification $data_dir/silva-taxonomy-filtered.qza

In [None]:
#! qiime metadata tabulate \
#    --m-input-file $data_dir/silva-taxonomy-filtered.qza \
#    --o-visualization $data_dir/silva-taxonomy-filtered.qzv

In [None]:
#Visualization.load(f'{data_dir}/silva-taxonomy-filtered.qzv')

In [None]:
#! qiime taxa barplot \
#    --i-table $data_dir/table-filtered.qza \
#    --i-taxonomy $data_dir/silva-taxonomy-filtered.qza \
#    --m-metadata-file $data_dir/cleaned_sample_metadata.tsv \
#    --o-visualization $data_dir/taxa-bar-plots-filtered.qzv

In [None]:
#Visualization.load(f'{data_dir}/taxa-bar-plots-filtered.qzv')

Maybe not all ASVs are annotated at species level! Some get cut off due to insufficient taxonomic resolution when classifying short sequences (i.e., they have matches to multiple clades).

CHECK FOR THIS!!!

<a id='evaluation_classifier'></a>


### 2.4 Evaluation classifier

In [None]:
! qiime rescript evaluate-classifications \
    --i-expected-taxonomies ref-taxonomy.qza \
    --i-observed-taxonomies 515f-806r-human-stool-taxonomy.qza \
    --o-evaluation 515f-806r-human-stool-classifier-evaluation.qzv