# Beta Diversity Analysis, Rarefaction and Significance Tests

        

#### Notebook overview 

[1. Setup](#sus)<br>
[2. Metadata](#mdata)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.1 _Importing metadata in panda_](#import)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2 _Selection of categorical variables of interest_](#selcatvar)<br>
[3. Visual inspection](#visui)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.1 _3D PCoA plots inspection_](#3d)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.2 _2D plot of main PCoA axis_](#2d)<br>
[4. Statistical analysis](#statistics)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.1 _Non-parametric multivariate analysis of variance : PERMANOVA testing of variable_](#permanova)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.2 _Adonis implementation of PERMANOVA tests : Multfactor testing_](#adonis)<br>

<a id='sus'></a>
## 1. Set_up

In [None]:
import os
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import qiime2 as q2
import seaborn as sns
from skbio import OrdinationResults
from qiime2 import Visualization
import matplotlib.pyplot as plt
from seaborn import scatterplot
from matplotlib.patches import Ellipse
import matplotlib.transforms as transforms

%matplotlib inline

In [None]:
data_dir ='project_data'

In [None]:
def confidence_ellipse(x, y, ax, n_std=3.0, facecolor='none', **kwargs):
    """
    Create a plot of the covariance confidence ellipse of *x* and *y*.

    Parameters
    ----------
    x, y : array-like, shape (n, )
        Input data.

    ax : matplotlib.axes.Axes
        The axes object to draw the ellipse into.

    n_std : float
        The number of standard deviations to determine the ellipse's radiuses.

    **kwargs
        Forwarded to `~matplotlib.patches.Ellipse`

    Returns
    -------
    matplotlib.patches.Ellipse
    """
    if x.size != y.size:
        raise ValueError("x and y must be the same size")

    cov = np.cov(x, y)
    pearson = cov[0, 1]/np.sqrt(cov[0, 0] * cov[1, 1])
    # Using a special case to obtain the eigenvalues of this
    # two-dimensional dataset.
    ell_radius_x = np.sqrt(1 + pearson)
    ell_radius_y = np.sqrt(1 - pearson)
    ellipse = Ellipse((0, 0), width=ell_radius_x * 2, height=ell_radius_y * 2,
                      facecolor=facecolor, **kwargs)

    # Calculating the standard deviation of x from
    # the squareroot of the variance and multiplying
    # with the given number of standard deviations.
    scale_x = np.sqrt(cov[0, 0]) * n_std
    mean_x = np.mean(x)

    # calculating the standard deviation of y ...
    scale_y = np.sqrt(cov[1, 1]) * n_std
    mean_y = np.mean(y)

    transf = transforms.Affine2D() \
        .rotate_deg(45) \
        .scale(scale_x, scale_y) \
        .translate(mean_x, mean_y)

    ellipse.set_transform(transf + ax.transData)
    return ax.add_patch(ellipse)

In order to run this code, you need results of functions done on previous file : 
- `metadata` gathering is done in file A
- `diversity computation` is done in file F

<a id='sec1'></a>

<a id='mdata'></a>
## 2. Metadata

<a id='import'></a>
### 2.1. Importing metadata in panda

Metadata of refers to students additional informations we collected into the file "sample_meta_data.tsv".

In [None]:
df_meta = pd.read_csv(f'{data_dir}/sample_meta_data.tsv', sep='\t')

<a id='selcatvar'></a>
### 2.2. Selection of categorical variables of interest

In [None]:
df_cat = df_meta.select_dtypes(include=['object'])
df_cat.describe()

In [None]:
for col in df_cat.columns:
    print(col)

In [None]:
md = q2.Metadata.load(data_dir + '/sample_meta_data.tsv').to_dataframe()
pd.DataFrame([str(sorted(md[col].astype(str).unique())) for col in md.columns],
             index=pd.Index(md.columns, name='Column'), columns=['Values'])

Almost all the variables present in the dataframe df_cat can be interesting to test for beta-analysis. However, for the purpose of our research goals here, we will focus on data about age, BMI, country, level of education

<a id='visui'></a>
## 3. Visual inspection

Beta diversity measures the similarity between samples or groups of samples.        
To inspect groupings of beta diversity metrics across metadata categories, we will start by inspecting the principal coordinates (PCoA) plots created with the `qiime diversity core-metrics-phylogenetic` method in file F (directory core-metrics-results)

<a id='3d'></a>
### 3.1. 3D PCoA plots inspection

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results/weighted_unifrac_emperor.qzv')

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results/bray_curtis_emperor.qzv')

<a id='2d'></a>
### 3.2. 2D plot of main PCoA axis

In [None]:
pcs = q2.Artifact.load(os.path.join(data_dir, 'core-metrics-results/bray_curtis_pcoa_results.qza'))
pcs = pcs.view(OrdinationResults)
pcs_data = pcs.samples.iloc[:,:3]
pcs_data.columns = ['Axis 1', 'Axis 2', 'Axis 3']

In [None]:
pcs_data.head()

In [None]:
pcs_data_with_md = pcs_data.join(md['NUT_vegetable_frequency'])

In [None]:
selNe= pcs_data_with_md.loc[pcs_data_with_md['NUT_vegetable_frequency']=='Never']
selRa= pcs_data_with_md.loc[pcs_data_with_md['NUT_vegetable_frequency']=='Rarely']
selO= pcs_data_with_md.loc[pcs_data_with_md['NUT_vegetable_frequency']=='Occasionally']
selRe= pcs_data_with_md.loc[pcs_data_with_md['NUT_vegetable_frequency']=='Regularly']
selD= pcs_data_with_md.loc[pcs_data_with_md['NUT_vegetable_frequency']=='Daily']
selNp= pcs_data_with_md.loc[pcs_data_with_md['NUT_vegetable_frequency']=='Not provided']

pcs_data_with_md['NUT_vegetable_frequency'] = pd.Categorical(pcs_data_with_md['NUT_vegetable_frequency'], 
                      categories=["Never", "Rarely", "Occasionally", "Regularly", "Daily", 'Not provided'],
                      ordered=True)


with sns.axes_style("whitegrid"):
    fig, ax = plt.subplots()
    sns.set(rc={'figure.figsize':(10,10)}, style='white')
    sns.scatterplot(data=pcs_data_with_md, x='Axis 1', y='Axis 2', ax=ax, hue=pcs_data_with_md['NUT_vegetable_frequency'],palette='rocket')
        
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), fontsize=14, title_fontsize=14)
    legend = ax.get_legend()
    legend.set_title('NUT_vegetable_frequency')
    

    ax.tick_params(axis='x', which='major', bottom=True) 
    ax.tick_params(axis='y', which='major', left=True)

    confidence_ellipse(selRe['Axis 2'], selRe['Axis 1'], ax, edgecolor='red')
    confidence_ellipse(selNp['Axis 2'], selNp['Axis 1'], ax, edgecolor='lightsalmon')
    confidence_ellipse(selD['Axis 2'], selD['Axis 1'], ax, edgecolor='orangered')
    confidence_ellipse(selNe['Axis 2'], selNe['Axis 1'], ax, edgecolor='black')
    confidence_ellipse(selO['Axis 2'], selO['Axis 1'], ax, edgecolor='mediumvioletred')
    confidence_ellipse(selRa['Axis 2'], selRa['Axis 1'], ax, edgecolor='purple')
    
    
    plt.savefig('spveg2.png',bbox_inches='tight', dpi=300)


<a id='statistics'></a>
## 4.Statistical analysis

<a id='permanova'></a>
### 4.1. Non-parametric multivariate analysis of variance : PERMANOVA testing of variable

Associations between beta diversity and categorical variables can be statistically tested using a PERMANOVA test. This is a non-parametric statistical test that checks the null hypothesis that the distances between samples of one group are equivalent to distances to samples of another group. If this null hypothesis is rejected, we can infer that the distances between samples of one group differ significantly from the distances to samples in at least one other group. We can perform a PERMANOVA test checking whether the observed categories are significantly grouped in QIIME 2 with the `qiime diversity beta-group-significance` method: 

In [None]:
os.mkdir('/home/jovyan/HealthyFood/JupyterDocs/project_data/core-metrics-results-bd')

**a)** Example : variable "NUT_prepared_meals_frequency"

##### *with  weighted UniFrac distance matrix*

In [None]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --m-metadata-column NUT_prepared_meals_frequency \
--p-pairwise \
--o-visualization $data_dir/core-metrics-results-bd/wu_pmf-significance.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results-bd/wu_vf-significance.qzv')

##### *with Bray curtis distance matrix*

In [None]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --m-metadata-column GEN_sex\
--p-pairwise \
--o-visualization $data_dir/core-metrics-results-bd/bc_sex-significance.qzv

In [None]:
Visualization.load(f'{data_dir}/core-metrics-results-bd/bc_sex-significance.qzv')

<a id='adonis'></a>
### 4.2. Adonis implementation of PERMANOVA tests : Multfactor testing

The `adonis` implementation of PERMANOVA (part of the r-vegan package) accepts a formula as input, which can consist of one or more independent terms. This might be useful for testing which covariates explain the most variation in our datasets.

##### Hypothesis 1 : Milk products 

In [None]:
# testing with bray curtis matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_milk_cheese_frequency*NUT_milk_substitute_frequency*NUT_vitamin_d_supplement_frequency" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-bc-H1.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-bc-H1.qzv')

In [None]:
# testing with weighted UniFrac matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_milk_cheese_frequency*NUT_milk_substitute_frequency*NUT_vitamin_d_supplement_frequency" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-wu-H1.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-wu-H1.qzv')

#####  Hypothesis 2 : Mediterranean diet

In [None]:
# testing with bray curtis matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_olive_oil*NUT_seafood_frequency*NUT_vegetable_frequency*NUT_fruit_frequency*NUT_whole_grain_frequency" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-bc-H2.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-bc-H2.qzv')

In [None]:
# testing with weighted unifrac matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_olive_oil*NUT_seafood_frequency*NUT_vegetable_frequency*NUT_fruit_frequency*NUT_whole_grain_frequency" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-wu-H2.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-wu-H2.qzv')

#####  Hypothesis 3 : Poultry and Meat

In [None]:
# testing with bray curtis matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_whole_eggs*NUT_poultry_frequency*NUT_high_fat_red_meat_frequency*NUT_meat_eggs_frequency*NUT_red_meat_frequency*NUT_vitamin_b_supplement_frequency" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-bc-H3.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-bc-H3.qzv')

In [None]:
# testing with weighted unifrac matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_whole_eggs*NUT_poultry_frequency*NUT_high_fat_red_meat_frequency*NUT_meat_eggs_frequency*NUT_red_meat_frequency*NUT_vitamin_b_supplement_frequency" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-wu-H3.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-wu-H3.qzv')

In [None]:
# testing with bray curtis matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_alcohol_frequency*NUT_drinks_per_session" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-bc-H4.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-bc-H4.qzv')

In [None]:
# testing with weighted unifrac matrix
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file $data_dir/sample_meta_data.tsv \
    --p-formula "NUT_alcohol_frequency*NUT_drinks_per_session" \
    --o-visualization $data_dir/core-metrics-results-bd/adonis-wu-H4.qzv
Visualization.load(f'{data_dir}/core-metrics-results-bd/adonis-wu-H4.qzv')