# Diversity analysis


In [1]:
import os
import pandas as pd
from qiime2 import Visualization
from skbio import OrdinationResults
import matplotlib.pyplot as plt
from seaborn import scatterplot
import numpy as np

import qiime2 as q2

%matplotlib inline
data_dir = 'CE'

Artifacts used for alpha-diversity (start)
- dada2_table_align_filtered.qza
- fasttree-tree-rooted.qza
- sepp-tree.qza

## Alpha rarefaction

Which tree should be chosen to conduct alpha rarefaction? the de-novo tree or fragment insertion tree? 
-> fragment insertion tree was chosen, das we have rather short sequences which might not bring up enough information.
Artifacts of both trees are visualized here:

In [2]:
#denovo
Visualization.load(f'{data_dir}/fasttree-tree-rooted.qzv')

ValueError: CE/fasttree-tree-rooted.qzv does not exist.

In [None]:
#fragment-insertion
Visualization.load(f'{data_dir}/sepp-tree-placements-tree.qzv')

Using alpha-rarefaction we can decide which rarefying threshold is best suited for our data.
1. downloaded metadata (somehow wasn't in my CE). if you already have it run normally, if not, remove the hashtags # in the following two cells.
2. visualized diversity alpha-rarefaction (--p-max-depth 10000) to find out which sequencing depth to use -> 1500. Also no difference was noticed between de novo and fragment insertion tree. at the end we used the insertion tree.
3. use core-metrics-phylogenetic plugin
4. use diversity alpha-group-significance to run some statisical tests
5. use diversity alpha-correlation to check for correlations (columns - richness of microbial community)
6. some fun with pandas. nice poxplots

In [None]:
! wget -nv -O $data_dir/food-metadata.tsv 'https://polybox.ethz.ch/index.php/s/nEd4l5CWGWGEtae/download'

In [None]:
df_meta = pd.read_csv(f'{data_dir}/food-metadata.tsv', sep='\t', index_col=0)

In [None]:
! qiime diversity alpha-rarefaction \
    --i-table $data_dir/dada2_table_align_filtered.qza \
    --i-phylogeny $data_dir/fasttree-tree-rooted.qza \
    --p-max-depth 100000 \
    --m-metadata-file $data_dir/food-metadata.tsv\
    --o-visualization $data_dir/alpha-rarefaction_denovo.qzv

In [None]:
#denovo tree: number of samples has plateau at around 7000 for rind-type
#could change --p-max-depth 7000 \
Visualization.load(f'{data_dir}/alpha-rarefaction_denovo.qzv')

## Diversity analysis

In [None]:
! qiime diversity alpha-rarefaction \
    --i-table $data_dir/dada2_table_align_filtered.qza \
    --i-phylogeny $data_dir/sepp-tree.qza \
    --p-max-depth 100000 \
    --m-metadata-file $data_dir/food-metadata.tsv\
    --o-visualization $data_dir/alpha-rarefaction_insertion.qzv

In [None]:
#fragment insertion tree: number of samples has plateau at around 7000 for rind-type
#could change --p-max-depth  7000\
Visualization.load(f'{data_dir}/alpha-rarefaction_insertion.qzv')

Observation: no difference was found for denovo vs fragment insertion tree.
I would cut at 1500 and choose that as sequencing depth. The plateau is already reched for almost all fators then and the number of samples starts decreasing. With 1500 we don't loose too much. It could even be set lower, at around 1250.

at 1500: X samples are left (loose X samples)

## Alpha diversity

Sequencing depth of 1500 for rarefaction: now let's have a look at the whitin-sample diversity (= alpha div.) Done for fragment-insertion tree.

In [None]:
! qiime diversity core-metrics-phylogenetic \
  --i-table $data_dir/dada2_table_align_filtered.qza \
  --i-phylogeny $data_dir/sepp-tree.qza \
  --m-metadata-file $data_dir/food-metadata.tsv \
  --p-sampling-depth 1500 \
  --output-dir $data_dir/core-metrics-results_insertion_1500

In [None]:
! qiime diversity alpha-group-significance \
  --i-alpha-diversity $data_dir/core-metrics-results_insertion_1500/faith_pd_vector.qza \
  --m-metadata-file $data_dir/food-metadata.tsv \
  --o-visualization $data_dir/core-metrics-results_insertion_1500/faith-pd-group-significance.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/faith-pd-group-significance.qzv')

In [None]:
! qiime diversity alpha-correlation \
  --i-alpha-diversity $data_dir/core-metrics-results_insertion_1500/faith_pd_vector.qza \
  --m-metadata-file $data_dir/food-metadata.tsv \
  --o-visualization $data_dir/core-metrics-results_insertion_1500/faith-pd-group-significance-numeric.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/faith-pd-group-significance-numeric.qzv')

I feel like this correlation thing isn't necessary....
->> should try ANOVA q2-longitudinal? are assumptions correct tho?

#### PANDAS fun

In [None]:
art = q2.Artifact.load(os.path.join(data_dir, 'core-metrics-results_insertion_1500/faith_pd_vector.qza')).view(pd.Series)
md = pd.read_csv(os.path.join(data_dir, 'food-metadata.tsv'), sep='\t', index_col=0)['rindtype']

In [None]:
pd.concat([art, md], join = 'inner', axis = 1)

In [None]:
#no change, same rows and columns
artmd = pd.concat([art, md], join = 'inner', axis = 1).dropna()

In [None]:
artmd.boxplot(by = 'rindtype', rot=90, grid = False)
plt.ylabel('Faith PD')
plt.xlabel('Rind Type')

## Beta diversity

Beta diversity measures the similarity between samples or groups of samples.

Analyse beta diversity from core-metrics-phylogenetic action. Inspect unweighted_unifrac_emperor.qzv:

Some clustering according to rindtype. But no clear clustering.

#### Unweighted UniFrac

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/unweighted_unifrac_emperor.qzv')

#### Jaccard

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/jaccard_emperor.qzv')

#### Weighted UniFrac

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/weighted_unifrac_emperor.qzv')

#### Bray-Curtis

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/bray_curtis_emperor.qzv')

### PERMANOVA test
Statistical testing of associations between beta diversity and categorical variables.
We perform a PERMANOVA test checking whether the observed categories are significantly grouped in QIIME 2 with the qiime diversity beta-group-significance method.

Inspect the beta diversity metrics of rindtype groupings:

Result: Distances between samples in groups are significantly different from the distance between samples from the other groups as the p-values are all 0.001.

In [None]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results_insertion_1500/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --m-metadata-column rindtype \
    --p-pairwise \
    --o-visualization $data_dir/core-metrics-results_insertion_1500/uw_unifrac-rindtype-significance.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/uw_unifrac-rindtype-significance.qzv')

Inspect the beta diversity metrics of continent groupings:

Result: 

In [None]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results_insertion_1500/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --m-metadata-column continent \
    --p-pairwise \
    --o-visualization $data_dir/core-metrics-results_insertion_1500/uw_unifrac-continent-significance.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/uw_unifrac-continent-significance.qzv')

Inspect the beta diversity metrics of animal source groupings:

In [None]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results_insertion_1500/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --m-metadata-column animal_source \
    --p-pairwise \
    --o-visualization $data_dir/core-metrics-results_insertion_1500/uw_unifrac-animal-significance.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/uw_unifrac-animal-significance.qzv')

#### ADONIS

PERMANOVA test can also be performed with the adonis implementation. --> to explain which covariates explain the most variation in our dataset.

Order in formula can make a difference in the outcomes. --> try different orders

In [None]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results_insertion_1500/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --p-formula "rindtype+continent+country+region+animal_source+pasteurized" \
    --o-visualization $data_dir/core-metrics-results_insertion_1500/adonis.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/adonis.qzv')

Order the columns by R2. The highest R2 comes first.

Result: The region explains the most variation in our dataset (highest R2 value). --> But where is "continent"? :(

In [None]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results_insertion_1500/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --p-formula "region+rindtype+country+animal_source+continent+pasteurized" \
    --o-visualization $data_dir/core-metrics-results_insertion_1500/adonis_neworder.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/adonis_neworder.qzv')

In [None]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results_insertion_1500/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --p-formula "rindtype+continent+country+animal_source+pasteurized" \
    --o-visualization $data_dir/core-metrics-results_insertion_1500/adonis_3.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/adonis_3.qzv')

In [None]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results_insertion_1500/unweighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/food-metadata.tsv \
    --p-formula "rindtype+country+animal_source+continent+pasteurized" \
    --o-visualization $data_dir/core-metrics-results_insertion_1500/adonis_4.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results_insertion_1500/adonis_4.qzv')

#### PANDAS fun

In [None]:
unw_UF_pcoa_res = q2.Artifact.load(os.path.join(data_dir, 'core-metrics-results_insertion_1500/unweighted_unifrac_pcoa_results.qza'))
unw_UF_pcoa_res = unw_UF_pcoa_res.view(OrdinationResults)
# let's just take the first 3 columns (i.e., first 3 PCoA axes)
unw_UF_pcoa_res_data = unw_UF_pcoa_res.samples.iloc[:,:3]
# rename the columns for clarity
unw_UF_pcoa_res_data.columns = ['Axis 1', 'Axis 2', 'Axis 3']

In [None]:
unw_UF_pcoa_res_data.head(3)

Join this dataframe with metadata column:

In [None]:
unw_UF_pcoa_res_data_with_rindtype = unw_UF_pcoa_res_data.join(df_meta['rindtype'])
unw_UF_pcoa_res_data_with_rindtype.head()

Visualize data using Python visualization library seaborn:

In [None]:
scatterplot(data=unw_UF_pcoa_res_data_with_rindtype,
            x='Axis 1',
            y='Axis 2',
            hue='rindtype',
            palette='viridis')