**Functional redundancy and resilience**

Examination of functional redundancy and reslience in the context of microbial biodiversity measurements.
1. use alpha rarefaction and other diagnostics to select an appropriate normalization depth
2. interpret microbial alpha and beta diversity results to compare samples based on taxonomic and phylogenetic diversity
3. compare taxonomic and functional diversity in these same samples to evaluate _functional redundancy_ in microbial communities.
4. examine longitudinal change in these biodiversity metrics to measure _resilience_ in microbial communities

Dataset used: (loaded in file A)
Metadata : (loaded in file A)/
Taxonomy : (loaded in file D)/
Diversity analysis : (loaded in file F and G)/
Metagenome content predicted by PICRUST2 : (loaded in file I)/


<a id='setup'></a>
## 0. Setup

In [2]:
import os
import qiime2 as q2
import pandas as pd
from qiime2 import Visualization

data_dir = 'w10_data'
if not os.path.isdir(data_dir):
    os.makedirs(data_dir)
    
# do not increase this value!
n_jobs = 3
    
%matplotlib inline

In [2]:
# grab this week's dataset
data_url = 'https://polybox.ethz.ch/index.php/s/a0VspvcHpqINwzO/download'
! wget -nv -O $data_dir/w10_data.zip $data_url
! unzip -jq $data_dir/w10_data.zip 'W10_resilience/*' -d $data_dir
! rm $data_dir/w10_data.zip

2022-11-18 10:12:34 URL:https://polybox.ethz.ch/index.php/s/a0VspvcHpqINwzO/download [6153727] -> "w10_data/w10_data.zip" [1]


These are the files present in the `w10_data` directory that you just downloaded:
1. `child-table.qza` is your main feature table, of ASV counts per sample.
2. `metadata.tsv` contains sample metadata. You will use this below for diversity analysis and group comparisons.
3. `pathway_abundance.qza` is your feature table of predicted functional pathways.
4. `insertion-tree.qza` is a phylogenetic tree of your ASVs in case you would like to use it.
5. `filtered-table-deblur.qza` is a feature table containing both children and their mothers.

And here is a description of the sample metadata columns:
* **abx_exposure**: Recent exposure to antibiotics (`yes`/`no`, `nan`=unknown)
* **day_of_life**: Day of life of the infant (days since birth)
* **delivery**: delivery mode (`Cesarean` section or `Vaginal` birth)
* **diet**: predominant diet during first 3 months of life (`bd` = `breastmilk`, `fd` = `formula`)
* **host_subject_id**: unique ID for each infant/mother pair. Infants and their mothers share the same ID.
* **mom_or_child**: is the sample from a child (`C`) or its mother (`M`)?
* **sex**: `Female` or `Male`
* **month**: Month of file. Mothers all have the value `-1` because their stool samples were collected shortly pre-partum.
* **month_category**: same as `month` but encoded as text so that some tests interpret this as categorical metadata.

This is a summary of the feature table (including counts of samples per group) if you care to look:

**[1. Feature table summary visualization](https://view.qiime2.org/visualization/?type=html&src=https%3A%2F%2Fdl.dropbox.com%2Fs%2Fpssfwswqous1pn0%2F1.%2520child-table.qzv%3Fdl%3D1)**


I have summarized the sample metadata for you below:

In [3]:
md = q2.Metadata.load(data_dir + '/metadata.tsv').to_dataframe()
pd.DataFrame([str(sorted(md[col].astype(str).unique())) for col in md.columns],
             index=pd.Index(md.columns, name='Column'), columns=['Values'])

Unnamed: 0_level_0,Values
Column,Unnamed: 1_level_1
abx_exposure,"['nan', 'no', 'yes']"
day_of_life,"['0.0', '1.0', '10.0', '11.0', '13.0', '166.0'..."
delivery,"['Cesarean', 'Vaginal']"
diet,"['bd', 'fd', 'nan']"
host_subject_id,"['S1', 'S10', 'S11', 'S12', 'S14', 'S16', 'S17..."
mom_or_child,"['C', 'M']"
sex,"['Female', 'Male']"
month,"['-1.0', '0.0', '12.0', '24.0', '6.0']"
month_category,"['-1.0', '0.0', '12.0', '24.0', '6.0']"


<a id='normalize'></a>
## 1. Select an appropriate rarefaction depth

You first challenge is to select an appropriate sequencing depth to use for normalizing your data. You should proceed as we did in week 7, to examine the impact of sequencing depth on alpha diversity in each sample. (hint: you can set `--p-iterations` and `--p-steps` to lower values than the default settings to speed up this step)

Now it's your turn! Work as a group to select an appropriate normalization depth based on your data characteristics.

**Checkpoint 1: What do you consider an appropriate normalization depth (# of sequences per samples) for even subsampling of this dataset?**

In [None]:
# your turn: insert commands to 
# choose an appropriate normalization depth and 
# visualize the outputs

In [5]:
! qiime diversity alpha-rarefaction \
    --i-table $data_dir/child-table.qza\
    --i-phylogeny $data_dir/insertion-tree.qza \
    --p-max-depth 10000 \
    --m-metadata-file $data_dir/metadata.tsv \
    --o-visualization $data_dir/alpha-rarefaction.qzv

[32mSaved Visualization to: w10_data/alpha-rarefaction.qzv[0m
[0m

In [2]:
Visualization.load(f'{data_dir}/alpha-rarefaction.qzv')

An appropriate sampling depth would be around 1500, as a satisfying number of observed features is reached for a null or quasi-null loss of samples.

<a id='diversity'></a>
## 2. Diversity analysis

Now compute the core alpha and beta diversity metrics on your data (selecting a `sampling-depth` based on your findings above) and answer the following questions (using appropriate statistical tests and visualizations):
1. _Qualitatively_ do weighted or unweighted beta diversity metrics lead to clearer differences between age groups? How do you interpret this difference? (look at PCoA plots and examine the `month` category.) 
2. Is there a significant difference in unweighted UniFrac distances between diet and delivery mode groups over time? How do you interpret this result? (run an adonis test with `month`, `delivery`, and `diet` as independent, interacting factors. See `qiime diversity adonis --help` for usage details). 

Which factor(s) are most significantly associated with differences in unweighted UniFrac distance? How do you interpret this finding?


In [None]:
# your turn: compute phylogenetic diversity metrics! Then view the outputs.
# Use this as the output directory name, to avoid changing some paths below: $data_dir/core-metrics-results/

In [4]:
! qiime diversity core-metrics-phylogenetic \
  --i-table $data_dir/child-table.qza \
  --i-phylogeny $data_dir/insertion-tree.qza \
  --m-metadata-file $data_dir/metadata.tsv \
  --p-sampling-depth 1500 \
  --output-dir $data_dir/core-metrics-results1500

[32mSaved FeatureTable[Frequency] to: w10_data/core-metrics-results1500/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-results1500/faith_pd_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-results1500/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-results1500/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-results1500/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-results1500/unweighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-results1500/weighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-results1500/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-results1500/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: w10_data/core-metrics-results1500/unweighted_unifrac_pcoa_res

In [3]:
Visualization.load(f'{data_dir}/core-metrics-results/unweighted_unifrac_emperor.qzv')

In [5]:
Visualization.load(f'{data_dir}/core-metrics-results/weighted_unifrac_emperor.qzv')

Even though no absolute clear distinction between age groups (month or month_category variables) could be seen in either weighted or unweighted beta diversity metrics PCoA plots, the start of a segregation pattern is qualitatively clearer in unweighed metrics.
Interestingly, this would mean that weighting the beta diversity metrics by feature abundance tend to blur the age distinctions. In the unweighted PCoA plot, we can notably observe that the older age groups are, the better defined and circumscribed they are (resilience after different types of birth/diet ?). In the oposite, younger age groups are the most unpredictable and scattered on the plot. One could be tempted to conclude that there is a greater homogeneity in the diversity of the samples of children aged one year and over than in the new / non-borns. However, it also seems that there are many more samples of children under 6 months old than others age groups. We could then safely interepret that due to the pool of samples at our disposal, younger age groups are more represented and seem to have greater beta-diversity dispersion, than weighting beta diversity metrics by feature abundance would be no help in observing differences.

In [6]:
! qiime diversity adonis \
    --i-distance-matrix $data_dir/core-metrics-results/unweighted_unifrac_distance_matrix.qza\
    --m-metadata-file $data_dir/metadata.tsv \
    --p-formula 'month*delivery*diet' \
    --o-visualization $data_dir/AD-unweighted-mdd.qzv

[32mSaved Visualization to: w10_data/AD-unweighted-mdd.qzv[0m
[0m

In [7]:
Visualization.load(f'{data_dir}/AD-unweighted-mdd.qzv')

The p-value for month:delivery:diet test is 0.394>0.05 so there is no significant difference in diet and delivery mode groups over time. I would interpret that with time, diet and delivery mode tend to have less influence on microbial diversity sine the child is exposed to multiple other sources of contaminations. month:diet is significant, which make sense since some childreen diet are often ajusted over time of growth.  month:delivery is almost significant (pvalue=0.05) but it makes no sense to me. 
Greater R2 values are observed for the 'Month' variable, as we observed sooner. Otherwise, residuals (that is to say, none of the tested variables) explain 87% of the distances.

<a id='function'></a>
## 3. Functional redundancy

**Note:** *from this point on I have entered all of the commands for you, as many of the commands will be new to you. Run all cells below this point and inspect the outputs; if you did everything above correctly (e.g., setting filepaths), the commands below should run. Your job will be to (a) interpret the outputs and discuss as a group in class; and (b) study these commands to understand what I have done and how you can use these in your group projects.*

Next we will look at predicted gene pathway information to compare taxonomic vs. functional diversity patterns. This time we have samples from both the infants and their mothers, but we will examine the infants' microbiota first; in section 4 we will compare them to their mothers.

We will use the `core-metrics` pipeline on the `pathway_abundance.qza` table, which consists of PICRUST2-predicted gene pathway counts. Why don't we input a phylogeny and calculate phylogenetic metrics?

Run the commands below to examine functional diversity patterns across the infants only (we will compare to their mothers later). **Do you see the same relationships between diet, age, and beta diversity that you observed based on sequence variant abundances? Why or why not?**

By using PICRUST2-predicted gene pathway counts (instead of philogeny), we can directly access bioactivity data, which is our main focus here (functional perspective of samples).

In [4]:
# We will look first at only the infants, not their mothers!
! qiime feature-table filter-samples \
    --i-table $data_dir/pathway_abundance.qza \
    --m-metadata-file $data_dir/metadata.tsv \
    --p-where "[mom_or_child]='C'" \
    --o-filtered-table $data_dir/pathway_abundance_child.qza

! qiime diversity core-metrics \
  --i-table $data_dir/pathway_abundance_child.qza \
  --m-metadata-file $data_dir/metadata.tsv \
  --p-sampling-depth 100000 \
  --p-n-jobs $n_jobs \
  --output-dir $data_dir/core-metrics-picrust2

! qiime diversity adonis \
  --i-distance-matrix $data_dir/core-metrics-picrust2/jaccard_distance_matrix.qza \
  --m-metadata-file $data_dir/metadata.tsv \
  --p-formula 'month*diet*delivery' \
  --o-visualization $data_dir/core-metrics-picrust2/adonis-results.qzv

[32mSaved FeatureTable[Frequency] to: w10_data/pathway_abundance_child.qza[0m
[0m[32mSaved FeatureTable[Frequency] to: w10_data/core-metrics-picrust2/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-picrust2/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-picrust2/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-picrust2/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-picrust2/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-picrust2/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: w10_data/core-metrics-picrust2/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: w10_data/core-metrics-picrust2/bray_curtis_pcoa_results.qza[0m
[32mSaved Visualization to: w10_data/core-metrics-picrust2/jaccard_emperor.qzv[0m
[32mSaved Visualization to: w10_data/core-metrics-picrust2/bray_curtis_

In [4]:
Visualization.load(f'{data_dir}/core-metrics-picrust2/jaccard_emperor.qzv')

In [5]:
Visualization.load(f'{data_dir}/core-metrics-picrust2/bray_curtis_emperor.qzv')

In [6]:
Visualization.load(f'{data_dir}/core-metrics-picrust2/adonis-results.qzv')

Results are very similar do what we prevously got : PCoA plots could help foreseeing distinctions in group ages, but no totally clear differences, with clearer space definition of older age groups compared no youngers). The adonis tests show similar results with p-values for month and month:diet being <0.05. This time month:delivery is totally unsignificant (0.244) which matches my previous lack of understanding of this result.
We could except a relationship between beta diversity in sequence variant abundances and functional redundancy so there is no surprise in seing similar yet not identical results.

<a id='procrustes'></a>
## 3.1 Comparing ordinations

One way to compare beta diversity ordination results directly is with [Procrustes analysis](https://en.wikipedia.org/wiki/Procrustes_analysis). This method rotates and scales two ordinations to align them as best as possible. We can view the transformed PCoA coordinates together in a single plot to visually compare the ordinations. Run the code block to view the visualization and answer the following question:

1. How good does this fit look to you? Do you think that this indicates that ASV and pathway abundances are very similar or dissimilar?

In [6]:
# NOTE: you might need to change the "reference" filepath name in the first command,
# depending on the name of the output directory that you used for the core-metrics
# pipeline in Section 2.

! qiime diversity procrustes-analysis \
  --i-reference $data_dir/core-metrics-results1500/jaccard_pcoa_results.qza \
  --i-other $data_dir/core-metrics-picrust2/jaccard_pcoa_results.qza \
  --output-dir $data_dir/core-metrics-picrust2/procrustes/

! qiime emperor procrustes-plot \
  --i-reference-pcoa $data_dir/core-metrics-picrust2/procrustes/transformed_reference.qza \
  --i-other-pcoa $data_dir/core-metrics-picrust2/procrustes/transformed_other.qza \
  --m-metadata-file $data_dir/metadata.tsv \
  --o-visualization $data_dir/core-metrics-picrust2/procrustes-pcoa-plot.qzv

[32mSaved PCoAResults to: w10_data/core-metrics-picrust2/procrustes/transformed_reference.qza[0m
[32mSaved PCoAResults to: w10_data/core-metrics-picrust2/procrustes/transformed_other.qza[0m
[32mSaved ProcrustesStatistics to: w10_data/core-metrics-picrust2/procrustes/disparity_results.qza[0m
[0m[32mSaved Visualization to: w10_data/core-metrics-picrust2/procrustes-pcoa-plot.qzv[0m
[0m

I would say that the fit look pretty ok because the general distribution (like the 'shape' of the PCoA plot) looks conserved and, even though there is a lot of plotted samples, I don't think this is very hard to look at each differences in ordinations in this plot (I can imagine plots of comparison being way harder to understand, like messy, with lines everywhere). For some ages categories like 24months old babies, all the data even seem to follow the same pattern.
Yet, it could still have been gloablly clearer. I think this indicates that ASV and pathway abundances are globally quite similar, and tend to get better as the age of the child increases.

In [7]:
Visualization.load(f'{data_dir}/core-metrics-picrust2/procrustes-pcoa-plot.qzv')

<a id='longitudinal'></a>
## 4. Longitudinal resilience analysis

Now let's examine resilience. As you learned from your reading, resilience measures the ability of an ecosystem to recover from disturbance, e.g., by repopulation of the species and ecosystem functions necessary in that system.

The ECAM dataset does not explicitly look at ecological disturbance, but we can examine the rate of microbial colonization after birth in different infants, as a phenomenon similar to resilience (i.e., how quickly the microbiome can stabilize after colonizing a new ecosystem, as opposed to re-stabilize in a disturbed ecosystem).

We will use the `q2-longitudinal` plugin to examine temporal dynamics in the ECAM infants, in relation to the mothers of those same infants. The mothers' microbiota serve as a baseline of stabilized adult microbiota to which the infant microbiota can be compared (to measure resilience as the rate of (re-)stabilization).

You have not used this plugin before! So I have given you the commands below, and run them for you. Your job is to inspect the output visualizations to answer the following questions. (use the `--help` option to read documentation inline.)

**Note**: Mothers can be distinguished from their children by the `mom_or_child` metadata column. Additionally, the value assiged to all mothers for `month` is `-1.0` (because their stool samples were collected shortly pre-partum) and can be used to distinguish maternal samples when looking at PCoA or line pltos.

The diet metadata are interpreted as follows: `bd` means "breastmilk-dominant" and `fd` means "formula-dominant".

Run the commands below and then answer the following questions about the outputs.

**Questions:**
1. Inspect the `jaccard_emperor.qzv` plot and color points according to the `month` category. Do you see a clear pattern/trend?
2. Look at the `volatility.qzv` line plot. Do you see (qualitative) differences in observed features, Shannon diversity, and Jaccard distance PCoA `Axis 1` between `diet` groups or `delivery` groups over time? 
3. If yes, are these differences consistent over time?

In [38]:
# Run the following commands and inspect the results.
# We will look at infants and their mothers' microbiota just prior to birth
! qiime feature-table filter-samples \
    --i-table $data_dir/filtered-table-deblur.qza \
    --m-metadata-file $data_dir/metadata.tsv \
    --o-filtered-table $data_dir/filtered-table-deblur-with-mothers.qza

# Repeat our core-metrics diversity analysis from before, this time with mothers
! qiime diversity core-metrics \
  --i-table $data_dir/filtered-table-deblur-with-mothers.qza \
  --m-metadata-file $data_dir/metadata.tsv \
  --p-sampling-depth 1800 \
  --p-n-jobs $n_jobs \
  --output-dir $data_dir/core-metrics-with-mothers

# this creates an interactive line plot — useful for looking at changes in alpha and beta diversity across time
! qiime longitudinal volatility \
  --m-metadata-file $data_dir/metadata.tsv \
  --m-metadata-file $data_dir/core-metrics-with-mothers/observed_features_vector.qza \
  --m-metadata-file $data_dir/core-metrics-with-mothers/shannon_vector.qza \
  --m-metadata-file $data_dir/core-metrics-with-mothers/jaccard_pcoa_results.qza \
  --p-default-group-column 'diet'\
  --p-default-metric 'observed_features' \
  --p-state-column 'month' \
  --p-individual-id-column 'host_subject_id' \
  --o-visualization $data_dir/core-metrics-with-mothers/volatility.qzv

[32mSaved FeatureTable[Frequency] to: w10_data/filtered-table-deblur-with-mothers.qza[0m
[0m[32mSaved FeatureTable[Frequency] to: w10_data/core-metrics-with-mothers/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-with-mothers/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-with-mothers/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: w10_data/core-metrics-with-mothers/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-with-mothers/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: w10_data/core-metrics-with-mothers/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: w10_data/core-metrics-with-mothers/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: w10_data/core-metrics-with-mothers/bray_curtis_pcoa_results.qza[0m
[32mSaved Visualization to: w10_data/core-metrics-with-mothers/jaccard_emperor.qzv[0m
[32mSaved Visualization 

Visualization.load(f'{data_dir}/core-metrics-with-mothers/jaccard_emperor.qzv')

**1)** Yes, this time we can see a very clear pattern of teh different age groups with Jaccard diversity metrics PCoA plot.

In [9]:
Visualization.load(f'{data_dir}/core-metrics-with-mothers/volatility.qzv')

**2)** Yes, there are differences in these metrics and variable groups over time.

**3)** This differences fluctuate with time. Even if there are differences during the first month of birth, the group lines eventually cross after several months, indicating that Egality in every metrics columns for differents diet groups is reached between 11 and 13 months of age.
Egality in every metrics columns for differents delivery mode groups is reached betzeen 3 and 8 months of age.

<a id='lme'></a>
## 4.1 applying a statistical test to longitudinal data

We will probably not get to this in class, but this section will show you how to apply a statistical test to quantitatively answer some of the questions posed above regarding longitudinal variation and resilience. You only need to run the code below and examine the results; there are no questions for you to answer.

From our analysis above, it looks like there is an initial "disruption" in the composition of the microbiota following birth (i.e., the microbiota of infants are very dissimilar to their mothers during an initial chaotic period), and an eventual "return" to normalcy (i.e., the microbiota form a stable community that better resembles an adult gut). However, the rate of return differs between some groups, which we can view here as an indication that some groups are more "resilient" than others (or more properly they develop an adult-like microbiome and stabilize more quickly). 

Here we will use a [linear mixed effects model](https://en.wikipedia.org/wiki/Mixed_model) as a statistical test to examine individual infants' trajectories of development as a comparison of resilience. We will examine developmental trajectories in infants only, not in their mothers, so will use our initial `core-metrics` results. We specify a formula consisting of a dependent variable (here `observed_features`, i.e., ASV richness) and several independent variables and their interactions (`month*diet*delivery`) to test their association with variation in the dependent variable. We also specify random effects to incorporate in the model to account for individual variation in baseline and slope: a random intercept for each individual is included by default, and we specify `month` to include a random slope for each individual. 

You can read more about the QIIME 2 implementation of this test, and its interpretation, [here](https://docs.qiime2.org/2022.8/tutorials/longitudinal/#linear-mixed-effect-models).

Run the code below. The results should indicate that delivery mode has a significant impact on Shannon diversity (see the `delivery[T.Vaginal]` row in the model results section), and that there is a signficant interaction between delivery mode and age on Shannon diversity (see the `month:delivery[T.Vaginal]` row in the model results).


In [40]:
# NOTE: you might need to change the filepaths below, depending on the name of the
# output directory that you used for the core-metrics pipeline in Section 2. The
# filepaths to change will be the directory name `core-metrics-results`.

! qiime longitudinal linear-mixed-effects \
  --m-metadata-file $data_dir/metadata.tsv \
  --m-metadata-file $data_dir/core-metrics-results1000/shannon_vector.qza \
  --p-random-effects 'month'\
  --p-formula 'shannon_entropy~month*diet*delivery' \
  --p-state-column 'month' \
  --p-individual-id-column 'host_subject_id' \
  --o-visualization $data_dir/core-metrics-results1000/lme-shannon.qzv

[32mSaved Visualization to: w10_data/core-metrics-results1000/lme-shannon.qzv[0m
[0m

In [9]:
Visualization.load(f'{data_dir}/core-metrics-results1000/lme-shannon.qzv')