# Beta Diversity Analysis, Rarefaction and Significance Tests

        

## 0. Set_up

In [22]:
import os
import pandas as pd
import qiime2 as q2
from skbio import OrdinationResults
from qiime2 import Visualization
import matplotlib.pyplot as plt
from seaborn import scatterplot

%matplotlib inline

In [4]:
data_dir ='project_data'

In [3]:
ls

A_FirstLook.ipynb                [0m[01;34mfigures[0m/
B_Eating_habits.ipynb            G_Beta_Diversity.ipynb
C_Sequence_import.ipynb          H_DifferentialAbundance.ipynb
D_Taxonomy_NCBI.ipynb            I_Metagenomics.ipynb
D_Taxonomy_pre-classifier.ipynb  J_Functional_Redoundancy_and_Stability.ipynb
D_Taxonomy_SILVA.ipynb           [01;34mproject_data[0m/
E_Phylogeny.ipynb                Z_GanttChart.ipynb
F_Alpha_Diversity.ipynb


In order to run this code, you need results of functions done on previous file : 
- `metadata` gathering is done in file A
- `diversity computation` is done in file F

<a id='sec1'></a>

## 1. Import metadata

Metadata of refers to students additional informations we collected into the file "sample_meta_data.tsv".

In [14]:
df_meta = pd.read_csv(f'{data_dir}/sample_meta_data.tsv', sep='\t')

## 2.Compute diversity

Beta diversity measures the similarity between samples or groups of samples.        
To inspect groupings of beta diversity metrics across metadata categories, we will start by inspecting the principal coordinates (PCoA) plots created with the `qiime diversity core-metrics-phylogenetic` method in file F (directory core-metrics-results)

#### **Selection of categorical variables of interest**

In [23]:
df_cat = df_meta.select_dtypes(include=['object'])
df_cat.describe()

Unnamed: 0,GEN_age_cat,GEN_bmi_cat,GEN_collection_timestamp,GEN_country,GEN_geo_loc_name,GEN_host_common_name,GEN_last_move,GEN_last_travel,GEN_level_of_education,GEN_race,...,NUT_probiotic_frequency,NUT_red_meat_frequency,NUT_salted_snacks_frequency,NUT_seafood_frequency,NUT_sugary_sweets_frequency,NUT_vegetable_frequency,NUT_vitamin_b_supplement_frequency,NUT_vitamin_d_supplement_frequency,NUT_whole_eggs,NUT_whole_grain_frequency
count,523,523,523,523,523,523,523,523,523,523,...,523,523,523,523,523,523,523,523,523,523
unique,9,5,505,17,62,1,6,6,8,6,...,6,6,6,6,6,6,6,6,6,6
top,50s,Normal,2016-08-01 08:00:00,United Kingdom,United Kingdom:England,human,I have lived in my current state of residence ...,I have not been outside of my country of resid...,Graduate or Professional degree,Caucasian,...,Never,Occasionally,Rarely,Occasionally,Rarely,Daily,Never,Never,Occasionally,Regularly
freq,121,274,5,255,169,523,475,168,232,474,...,216,200,198,232,150,260,328,268,215,147


In [26]:
for col in df_cat.columns:
    print(col)

GEN_age_cat
GEN_bmi_cat
GEN_collection_timestamp
GEN_country
GEN_geo_loc_name
GEN_host_common_name
GEN_last_move
GEN_last_travel
GEN_level_of_education
GEN_race
GEN_sample_type
GEN_sex
NUT_alcohol_frequency
NUT_artificial_sweeteners
NUT_diet_type
NUT_drinks_per_session
NUT_fed_as_infant
NUT_fermented_plant_frequency
NUT_frozen_dessert_frequency
NUT_fruit_frequency
NUT_gluten
NUT_high_fat_red_meat_frequency
NUT_homecooked_meals_frequency
NUT_meat_eggs_frequency
NUT_milk_cheese_frequency
NUT_milk_substitute_frequency
NUT_olive_oil
NUT_poultry_frequency
NUT_prepared_meals_frequency
NUT_probiotic_frequency
NUT_red_meat_frequency
NUT_salted_snacks_frequency
NUT_seafood_frequency
NUT_sugary_sweets_frequency
NUT_vegetable_frequency
NUT_vitamin_b_supplement_frequency
NUT_vitamin_d_supplement_frequency
NUT_whole_eggs
NUT_whole_grain_frequency


In [73]:
md = q2.Metadata.load(data_dir + '/sample_meta_data.tsv').to_dataframe()
pd.DataFrame([str(sorted(md[col].astype(str).unique())) for col in md.columns],
             index=pd.Index(md.columns, name='Column'), columns=['Values'])

Unnamed: 0_level_0,Values
Column,Unnamed: 1_level_1
GEN_age_cat,"['20s', '30s', '40s', '50s', '60s', '70+', 'No..."
GEN_age_corrected,"['11.0', '14.0', '15.0', '16.0', '17.0', '18.0..."
GEN_bmi_cat,"['Normal', 'Not provided', 'Obese', 'Overweigh..."
GEN_bmi_corrected,"['11.2', '11.57', '14.34', '14.62', '14.79', '..."
GEN_cat,"['False', 'True']"
GEN_collection_timestamp,"['2014-05-10 12:00:00', '2015-01-01 09:00:00',..."
GEN_country,"['Australia', 'Belgium', 'Canada', 'Georgia', ..."
GEN_dog,"['False', 'True']"
GEN_elevation,"['-0.9', '0.0', '1.4', '10.4', '10.8', '10.9',..."
GEN_geo_loc_name,"['Australia:ACT', 'Australia:QLD', 'Australia:..."


Almost all the variables present in the dataframe df_cat can be interesting to test for beta-analysis. However, for the purpose of our research goals here, we will focus on data about age, BMI, country, level of education

#### **Permanova testing of categorical variables associations**

Associations between beta diversity and categorical variables can be statistically tested using a PERMANOVA test. This is a non-parametric statistical test that checks the null hypothesis that the distances between samples of one group are equivalent to distances to samples of another group. If this null hypothesis is rejected, we can infer that the distances between samples of one group differ significantly from the distances to samples in at least one other group. We can perform a PERMANOVA test checking whether the observed categories are significantly grouped in QIIME 2 with the `qiime diversity beta-group-significance` method: 

In [40]:
os.mkdir('/home/jovyan/HealthyFood/JupyterDocs/project_data/core-metrics-results-bd')

**Testing differences between samples according to bmi = unsignificant**

**Testing differences between samples according to age = significant**

In [53]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --m-metadata-column GEN_age_cat \
--p-pairwise\
--o-visualization $data_dir/core-metrics-results-bd/bc-age-significance.qzv

[32mSaved Visualization to: project_data/core-metrics-results-bd/bc-age-significance.qzv[0m
[0m

In [54]:
Visualization.load(f'{data_dir}/core-metrics-results-bd/bc-age-significance.qzv')

In [55]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --m-metadata-column GEN_age_cat \
--p-pairwise \
--o-visualization $data_dir/core-metrics-results-bd/wu_age-significance.qzv

[32mSaved Visualization to: project_data/core-metrics-results-bd/wu_age-significance.qzv[0m
[0m

In [57]:
Visualization.load(f'{data_dir}/core-metrics-results-bd/wu_age-significance.qzv')

**Testing differences between samples according to race = significant**

In [60]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --m-metadata-column GEN_race \
--p-pairwise\
--o-visualization $data_dir/core-metrics-results-bd/bc-race-significance.qzv

[32mSaved Visualization to: project_data/core-metrics-results-bd/bc-race-significance.qzv[0m
[0m

In [61]:
Visualization.load(f'{data_dir}/core-metrics-results-bd/bc-race-significance.qzv')

**Testing differences between samples according to level of education = unsignificant**

In [62]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --m-metadata-column GEN_level_of_education \
--p-pairwise\
--o-visualization $data_dir/core-metrics-results-bd/bc-loe-significance.qzv

[32mSaved Visualization to: project_data/core-metrics-results-bd/bc-loe-significance.qzv[0m
[0m

In [63]:
Visualization.load(f'{data_dir}/core-metrics-results-bd/bc-loe-significance.qzv')

#### **Adonis implementation pf PERMANOVA tests**

The `adonis` implementation of PERMANOVA (part of the r-vegan package) accepts a formula as input, which can consist of one or more independent terms. This might be useful for testing which covariates explain the most variation in our datasets.

In [71]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "GEN_bmi_cat*GEN_age_cat*GEN_race*GEN_level_of_education" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-bc-bmiageraceloe.qzv

[32mSaved Visualization to: project_data/core-metrics-results-bd/adonis-bc-bmiageraceloe.qzv[0m
[0m

<a id='sec3'></a>

In [72]:
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-bc-bmiageraceloe.qzv')

Reste à faire : génèrer automatiquement la liste des catégories que l'on pense d'intérêt (voir avec les filles si elles veulent garder toutes les catégories mais proposer avant), comme ça on fait qu'un seul test adonis pour tout et on screen juste ce qui ets inférieur à 0.05 et on aura rien oublié