# Taxonomy classification

**Overview:**<br>
[1. Setup](#setup)<br>
[2. Taxonomy assignment](#tax_assignment_main)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.1 Reference database construction](#ref_db)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2 Training taxonomy classifier](#train_classifier)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3 Taxonomy assignment](#tax_assignment)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.4 Taxonomy visualization](#tax_visualization)<br>

<a id='setup'></a>

## 1. Setup

The cell below will import all the packages required in the downstream analyses as well as set all the necessary variables and data paths.

In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import numpy as np

import qiime2 as q2

%matplotlib inline

# location of the data and all the results produced by this notebook 
data_dir = 'project_data'

if not os.path.isdir(data_dir):
    os.makedirs(data_dir)

Download the `FeatureData[Sequence]` from our data which was made in the FirstLook.ipynb:

In [3]:
! wget -nv -O $data_dir/rep-seqs.qza 'https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=Sequences_rep_set.qza'

2022-11-03 11:47:11 URL:https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=Sequences_rep_set.qza [390624/390624] -> "project_data/rep-seqs.qza" [1]


Download the `FeatureTable[Frequency]` containing a mapping of the dereplicated sequences to samples from our data which was made in the FirstLook.ipynb:



In [4]:
! wget -nv -O $data_dir/table.qza 'https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=Feature_table.qza'

2022-11-03 11:47:13 URL:https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download?path=%2F&files=Feature_table.qza [504534/504534] -> "project_data/table.qza" [1]


In [None]:
https://polybox.ethz.ch/index.php/s/MBLSUQXzglnn66u/download

<a id='tax_assignment_main'></a>

## 2. Taxonomy assignment

To classify the sequences into bacterial species, the assignment can be done with BLAST search of the sequences against a data base of known sequence. The methode used here is a machine learning classifier which is trained on a reference database to recognize the bacterial species in the samples.  


<a id='ref_db'></a>

### 2.1 Reference database construction

The SSU SILVA database (version 138) using the `RESCRIPt` plugin is downloaded, eventually lower-quality seqeunces are removed from the database. The 16S rRNA gene is used as the region of interest. Therefore we extract it based on the primers used to amplify the NGS library.

In a further step the[a custom database from NCBI sequences using a custom entrez query](https://forum.qiime2.org/t/using-rescript-to-compile-sequence-databases-and-taxonomy-classifiers-from-ncbi-genbank/15947) is build up and cleaned from lower-quality sequences. This is done in the Taxonomy_NCBI.ipynb.

**Citation:** Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; [doi: 10.1371/journal.pcbi.1009581](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009581)

#### 2.1.1 Data download

The sequences and corresponding taxonomies are downloaded (SSU SILVA database) , and save them into respective QIIME 2 artifacts.

In [2]:
! qiime rescript get-silva-data \
    --p-version '138' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences $data_dir/silva-138-ssu-nr99-seqs.qza \
    --o-silva-taxonomy $data_dir/silva-138-ssu-nr99-tax.qza

[32mSaved FeatureData[RNASequence] to: project_data/silva-138-ssu-nr99-seqs.qza[0m
[32mSaved FeatureData[Taxonomy] to: project_data/silva-138-ssu-nr99-tax.qza[0m
[0m

#### 2.1.2 Database curation

To clean up the Database, sequences that contain 5 or more ambigous bases and any homopolymers that are at least 8 bases long are removed from the silva database. 

In [2]:
! qiime rescript cull-seqs \
     --i-sequences $data_dir/silva-138-ssu-nr99-seqs.qza \
     --p-num-degenerates 5 \
     --p-homopolymer-length 8 \
     --p-n-jobs 3 \
     --o-clean-sequences $data_dir/silva-138-ssu-nr99-seqs-cleaned.qza

[32mSaved FeatureData[Sequence] to: project_data/silva-138-ssu-nr99-seqs-cleaned.qza[0m
[0m

Sequences that are shorter than certain threshold are removed as well from the database. The threshold is specific whether the sequence belong to Archaea, Bacteria or Eukaryota.

In [3]:
# SILVA
! qiime rescript filter-seqs-length-by-taxon \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy $data_dir/silva-138-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs $data_dir/silva-138-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs $data_dir/silva-138-ssu-nr99-seqs-discard.qza

[32mSaved FeatureData[Sequence] to: project_data/silva-138-ssu-nr99-seqs-filt.qza[0m
[32mSaved FeatureData[Sequence] to: project_data/silva-138-ssu-nr99-seqs-discard.qza[0m
[0m


In order to remove identical sequences having the same taxonomies, the database is dereplicated with the `uniq` mode. In case where identical sequence refere to several taxonmies the recordes are not discarded. 

In [4]:
! qiime rescript dereplicate \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-filt.qza  \
    --i-taxa $data_dir/silva-138-ssu-nr99-tax.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $data_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa $data_dir/silva-138-ssu-nr99-tax-derep-uniq.qza

[32mSaved FeatureData[Sequence] to: project_data/silva-138-ssu-nr99-seqs-derep-uniq.qza[0m
[32mSaved FeatureData[Taxonomy] to: project_data/silva-138-ssu-nr99-tax-derep-uniq.qza[0m
[0m

#### 2.1.3 PCR-region extraction

The clasiffiers have to be trained on the same type of data as the data that is analyzed in the project in order to get the most accurate classification results. In class the region amplified during NGS library creation out of the full RNA sequence contained in the database is extracted. 

The forward and reverse primers are used (same as in the experiments from the exercise).
In the metadata of this experiment using the [SRA Run Selector](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERP021896&o=acc_s%3Aa). We see the following sequences:

- forward: `GTGCCAGCMGCCGCGGTAA`
- reverse: `GGACTACHVGGGTWTCTAAT`


In [5]:
! qiime feature-classifier extract-reads \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer GTGCCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACHVGGGTWTCTAAT \
    --p-n-jobs 3 \
    --p-read-orientation 'forward' \
    --o-reads $data_dir/silva-138-ssu-nr99-seqs-515f-806r.qza

[32mSaved FeatureData[Sequence] to: project_data/silva-138-ssu-nr99-seqs-515f-806r.qza[0m
[0m

The database has to be dereplicated again, because the reads are significantly shorter than at the beginning. The `uniq` mode is choosen again, as some sequences are annotated to different taxonomies.  

In [6]:
! qiime rescript dereplicate \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-515f-806r.qza \
    --i-taxa $data_dir/silva-138-ssu-nr99-tax-derep-uniq.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $data_dir/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
    --o-dereplicated-taxa  $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza

[32mSaved FeatureData[Sequence] to: project_data/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza[0m
[32mSaved FeatureData[Taxonomy] to: project_data/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza[0m
[0m

<a id='train_classifier'></a>

### 2.2 Training taxonomy classifier

The processed database with the extracted, dereplicated reads is used to train the classifier with the corresponding taxonomies. The Naive Bayes classifier is used as it has shown very good classification results while not being computationally too expensive.

The classifier is trained based on data with known taxonomies as it is in a further step used to predict taxonomy of unknown sequences.

(As we are only having 8 GB of available RAM in the JupyterLab, it is not possible to train this particullar classifier.)

In [8]:
! qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads $data_dir/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
    --i-reference-taxonomy $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza \
    --o-classifier $data_dir/515f-806r-classifier.qza

In [9]:
! qiime rescript evaluate-fit-classifier \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
    --i-taxonomy $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza \
    --o-classifier $data_dir/silva-refseqs-classifier.qza \
    --o-evaluation $data_dir/silva-refseqs-classifier-evaluation.qzv \
    --o-observed-taxonomy $data_dir/silva-taxonomy.qza

In the following cell the classifier which we already used in class is downloaded. 

In [11]:
! wget -nv -O $data_dir/515f-806r-classifier.qza https://data.qiime2.org/2021.4/common/gg-13-8-99-515-806-nb-classifier.qza

2022-11-03 09:57:11 URL:https://s3-us-west-2.amazonaws.com/qiime2-data/2021.4/common/gg-13-8-99-515-806-nb-classifier.qza [28289645/28289645] -> "project_data/515f-806r-classifier.qza" [1]


<a id='tax_assignment'></a>

### 2.3 Taxonomy assignment

After all the preprocessing steps it is time to assign taxonomy labels to the ASVs from the project data. The `classify-sklearn` action from the `feature-classifier` plugin needs two things: 
- the classifier which was trained the previous step
- the sequences to be classified

This step will require the `FeatureData[Sequence]` artifact (containing our ASVs) that were generated beforhand.
To run the following cell at least 10 GB of available RAM is required.

In [12]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/515f-806r-classifier.qza \
    --i-reads $data_dir/rep-seqs.qza \
    --o-classification $data_dir/silva-taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: project_data/silva-taxonomy.qza[0m
[0m

A new `FeatureData[Taxonomy]` artifact should be created, containing our taxonomic assignments per feature.

In [2]:
! qiime tools peek $data_dir/silva-taxonomy.qza

[32mUUID[0m:        fc291b0b-2fa9-421f-ac94-c1aaad987464
[32mType[0m:        FeatureData[Taxonomy]
[32mData format[0m: TSVTaxonomyDirectoryFormat


<a id='tax_visualization'></a>

### 2.4 Taxonomy visualization

The composition of the project samples is in the coming section analyzed . A tabular representation of all the features labeled with their corresponding taxonomy is created:

In [3]:
! qiime metadata tabulate \
    --m-input-file $data_dir/silva-taxonomy.qza \
    --o-visualization $data_dir/silva-taxonomy.qzv

[32mSaved Visualization to: project_data/silva-taxonomy.qzv[0m
[0m

In the table an ID of every ASV is asigned with its corresponding taxonomic assignment and the prediction confidence.

In [4]:
Visualization.load(f'{data_dir}/silva-taxonomy.qzv')

The taxonomic information per feature can be combined with the information about the samples to get an idea of the taxonomic distribution of species is in the different samples. The data can be visualized in a bar plot. Each bar represents a single sample and is broken down proportionally to counts of every taxon.

In [7]:
! qiime taxa barplot \
    --i-table $data_dir/table.qza \
    --i-taxonomy $data_dir/silva-taxonomy.qza \
    --m-metadata-file $data_dir/cleaned_sample_meta_data.tsv \
    --o-visualization $data_dir/taxa-bar-plots.qzv

[31m[1mThere was an issue with loading the file project_data/cleaned_sample_meta_data.tsv as metadata:

  Found unrecognized ID column name '' while searching for header. The first column name in the header defines the ID column, and must be one of these values:

  Case-insensitive: 'feature id', 'feature-id', 'featureid', 'id', 'sample id', 'sample-id', 'sampleid'

  Case-sensitive: '#OTU ID', '#OTUID', '#Sample ID', '#SampleID', 'sample_name'

  NOTE: Metadata files must contain tab-separated values.

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2022.2/tutorials/metadata/[0m

[0m

In [6]:
Visualization.load(f'{data_dir}/taxa-bar-plots.qzv')

ValueError: project_data/taxa-bar-plots.qzv does not exist.

Some of the taxonomic assignment are compared to  BLAST for validation. As from the generated ASVs  the BLAST links are equipped there.

In [6]:
Visualization.load(f'{data_dir}/rep-seqs.qzv')

Using BLAST as the taxonomic identifier are the same taxonomies observed as with q2-feature-classifier?

Mitochondrial sequences may have to be filtered out of the feature table and sequences. Therefore the `filter-table` and `filter-seqs` action from the `taxa` plugin are used. To exclude features meeting certain criteria we can use the `p-exclude` parameter as follows:

In [8]:
! qiime taxa filter-table \
    --i-table $data_dir/table.qza \
    --i-taxonomy $data_dir/silva-taxonomy.qza \
    --p-exclude mitochondria,chloroplast \
    --o-filtered-table $data_dir/table-filtered.qza

! qiime taxa filter-seqs \
    --i-sequences $data_dir/rep-seqs.qza \
    --i-taxonomy $data_dir/silva-taxonomy.qza \
    --p-exclude mitochondria \
    --o-filtered-sequences $data_dir/rep-seqs-filtered.qza

[32mSaved FeatureTable[Frequency] to: project_data/table-filtered.qza[0m
[0m[32mSaved FeatureData[Sequence] to: project_data/rep-seqs-filtered.qza[0m
[0m

The taxa barplot is regenerated using the filtered sequences to compare it to the previous visualization  to check if the distribution on different taxonomic level have changed between samples. 

In [9]:
! qiime feature-classifier classify-sklearn \
    --i-classifier $data_dir/515f-806r-classifier.qza \
    --i-reads $data_dir/rep-seqs-filtered.qza \
    --o-classification $data_dir/silva-taxonomy-filtered.qza

[32mSaved FeatureData[Taxonomy] to: project_data/silva-taxonomy-filtered.qza[0m
[0m

In [10]:
! qiime metadata tabulate \
    --m-input-file $data_dir/silva-taxonomy-filtered.qza \
    --o-visualization $data_dir/silva-taxonomy-filtered.qzv

[32mSaved Visualization to: project_data/silva-taxonomy-filtered.qzv[0m
[0m

In [11]:
Visualization.load(f'{data_dir}/silva-taxonomy-filtered.qzv')

In [16]:
! qiime taxa barplot \
    --i-table $data_dir/table-filtered.qza \
    --i-taxonomy $data_dir/silva-taxonomy-filtered.qza \
    --m-metadata-file $data_dir/cleaned_sample_meta_data.tsv \
    --o-visualization $data_dir/taxa-bar-plots-filtered.qzv

[32mSaved Visualization to: w4_data/taxa-bar-plots-filtered.qzv[0m
[0m

In [17]:
Visualization.load(f'{data_dir}/taxa-bar-plots-filtered.qzv')

Maybe not all ASVs are annotated at species level! Some get cut off due to insufficient taxonomic resolution when classifying short sequences (i.e., they have matches to multiple clades).

CHECK FOR THIS!!!