## CoDA methods
Because traditional tests will lead to spurious results, it is recommended to transform or normalize your post-ASV clustering data before computing alpha and beta (distance calculation) diversity, differential abundance and ordinations (unsupervised clustering).

Outside of QIIME2 using Python 3

The first thing you will need to do is export your feature table from QIIME2 to a .tsv file

In your terminal,
</br>
1. Activate your QIIME2 environment (whichever it is)
    * `conda activate qiime2.2020.11`
2. Move to your working directory with your files
    * cd FILEPATH
3. Clone from github
    * `git clone https://github.com/dianahaider/normalization_pipeline`
4. Move to the cloned directory and run this code to make the scripts executable (it gives permission to the files to be executable)
    * `chmod a+x ./*`
5. Run ~/normalization-pipeline/export-asv-results.sh
    * You might get an error if you use a mac about the shell name, just make sure your QIIME2 environment is active
    * If you use LINUX, and you get an error, make sure the qiime2 version in the file is the same as you use in your computer (I use .2020.11, just change it to whichever you are using).

This should outputs two files in a new directory `exported_table`
* feature-table.biom.tsv
* feature-table.biom

Now, go back to the cloned directory from github, and run `jupyter notebook` and open this file in jupyter notebook and follow along the code below

In [None]:
#import the relevant packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.decomposition import PCA
from skbio.diversity import alpha_diversity
from skbio.stats.distance import permanova
from skbio import DistanceMatrix
from scipy.spatial.distance import cdist
from skbio.stats.composition import clr
from skbio.stats.composition import alr
from skbio.stats.composition import ilr
from skbio.diversity.alpha import chao1

In [None]:
#First you have to import your tsv table to the notebook
feature_table_sparse = pd.read_csv('~/normalization_pipeline/test/exported_table/feature-table.biom.tsv', sep='\t',skiprows=1,index_col=0)

In [None]:
#Preview your table to make sure it was correctly exported from QIIME and imported to jup
feature_table_sparse.head()

## Sparse feature table
If your table contains 0s (most likely it does), you need to add a pseudo-count to it (a small value). There is <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5682008/pdf/fmicb-08-02114.pdf" target="_blank">litterature</a> that suggests other methods, but this one is widely used and accepted. If you want to dive into it, there are 3 types of 0s <br>
* Real 0s (structural 0s)
    <br> i.e taxa not supposed to be present, and not detected
* False 0s (either outlier 0, or sampling 0s)
    1. outlier 0 
    <br> i.e taxa not sampled for unknown reasons
    2. sampling 0
    <br> i.e taxa not sampled because of sampling depth

In [None]:
#Add a 0.1 pseudo count to all 0s to compute log
feature_table=feature_table_sparse.mask(feature_table==0).fillna(0.1)

In [None]:
#The head function allows us to look at the 5 first rows
feature_table_sparse.head()

In [None]:
#clr transformation is applied to the dataframe (with the row and column names), but it
#ouputs an array so we will store the result, and reconstruct the dataframe
#We could have used alr or ilr too
clr_transformed_array = clr(feature_table)

In [None]:
#storing the sample and asv names from the original dataframe
samples = feature_table.columns
asvs = feature_table.index

In [None]:
#Creating the dataframe with the clr transformed data, and assigning the sample names
clr_transformed = pd.DataFrame(clr_transformed_array, columns=samples)

In [None]:
#Assigning the asv names
clr_transformed['asvs'] = asvs
clr_transformed = clr_transformed.set_index('asvs')

In [None]:
clr_transformed.head()

### Alpha diversity ** Not completed
The CoDA method for alpha diversity is to compute CHAO1 (richness) or Shannon (evenness). <br>
The CHAO1 index input is non-rarefied/non-transformed data (aka raw counts) and the Shannon diversity index takes normalized/rarefied/transformed

In [None]:
#This will calculate the number of samples this asv was seen in
obs_otus = alpha_diversity('observed_otus', feature_table_sparse, asvs)
obs_otus

In [None]:
chao1 = samples.plot_metadata(vaxis="chao1", haxis="geo_loc_name", return_chart=True)

In [None]:
adiv_faith_pd

### Beta diversity
The CoDA method for beta diversity calculations is the Aitchison distance, which is simply the Euclian distance between samples after clr transformation

In [None]:
#You need to use the transpose of the table because our columns are samples, and cdist computes the distance
#between pairs of rows, not pairs of columns
dist = cdist(clr_transformed.T, clr_transformed.T, 'euclid')

In [None]:
#Just as we previously did with clr, we need to reconstruct the table after applying cdist
distance_matrix = pd.DataFrame(dist, columns=samples)

In [None]:
distance_matrix['samples'] = samples
distance_matrix = distance_matrix.set_index('samples')

In [None]:
distance_matrix

In [None]:
#Plot a covariance matrix to visualize the distance between each pairs of samples
fig = px.imshow(distance_matrix)
fig.show()

In [None]:
#Let's import the metadata to make sense of the PCA
#skip rows is now in brackets, meaning we remove the row 1, not the first row (python counts 0,1,2...)
mtda = pd.read_csv('~/normalization_pipeline/test/sample-metadata.tsv', sep='\t',skiprows=[1])

In [None]:
mtda.head()

In [None]:
#In order to compute the permanova test, the distance_matrix has to be reformatted
dm = DistanceMatrix(distance_matrix)

A permanova test statistically determines if two groups (their centre in geometrical space) are different. Are the samples different between body sites? <a href="https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118445112.stat07841" target="_blank">More readings</a>

In [None]:
#perMANOVA tests the association of the microbiome composition with any of the covariate of
#interest, 'body-site' is the column name of the covariate from your metadata
permanova(dm, grouping=mtda['body-site'])

### Make a PCA plot to visualize your data

In [None]:
#Relationship between samples with the distance matrix
pca = PCA(n_components=2)
components = pca.fit_transform(distance_matrix)

fig = px.scatter(components, x=0, y=1, color=mtda['body-site'])
fig.show()

In [None]:
#Relationship between samples with the clr transformed data
pca = PCA(n_components=2)
components = pca.fit_transform(clr_transformed.T)

fig = px.scatter(components, x=0, y=1, color=mtda['body-site'])
fig.show()

In [None]:
#Here is the PCA with the raw data
pca = PCA(n_components=2)
components = pca.fit_transform(feature_table_sparse.T)

fig = px.scatter(components, x=0, y=1,color=mtda['body-site'])
fig.show()

In [None]:
#Relationship between samples in 3d
pca = PCA(n_components=3)
components = pca.fit_transform(clr_transformed.T)

fig = px.scatter_3d(components, x=0, y=1, z=2, color=mtda['body-site'])
fig.show()

In [None]:
#Relationship between samples in 3d
pca = PCA(n_components=3)
components = pca.fit_transform(feature_table_sparse.T)

fig = px.scatter_3d(components, x=0, y=1, z=2, color=mtda['body-site'])
fig.show()