Skip to content

Strain Sharing Inference

Vitor Heidrich edited this page Feb 21, 2024 · 8 revisions

Assessment of person-to-person microbiome strain sharing with StrainPhlAn 4.0

This tutorial provides a step-by-step guide to assess person-to-person microbiome strain sharing from metagenomic samples as performed in the Valles-Colomer et al. (2023) publication.

We will use 3 species level genome bins (SGBs) as examples:

  • Bifidobacterium bifidum (SGB17256)

  • Sutterella wadsworthensis (SGB9283)

  • Ruminococcaceae SGB14311 (SGB14311)

The tutorial covers the process starting from shotgun metagenomic reads (fastq files), performing species-level profiling with MetaPhlAn, strain-level profiling with StrainPhlAn, setting species-specific operational definitions of strain, and assessment of person-to-person strain transmission.

1. Taxonomic profiling with MetaPhlAn

The first step is to profile all the metagenomic samples (N=7,646 stool samples) using MetaPhlAn. Reads of each sample are mapped to a database of marker genes: roughly, genes that are both present in each of the genomes of a species (core genes) and that are not present in genomes of other species (specific genes).

It is important here to keep the intermediate mapping results of MetaPhlAn using the -s parameter. This will provide the mapping results as a .sam file as output.

As the input used in Valles-Colomer et al. (2023) is of several terabytes, we here created a reduced input containing only the reads of the 3 SGBs that we will target in this tutorial (~15Gb).

Let's first download and uncompress the data:

wget http://cmprod1.cibio.unitn.it/strainphlan/strain_transmission_wiki/dummy_input_fastq_clean.tar.gz
tar -xvzf dummy_input_fastq_clean.tar.gz

Warning: some of the steps in this tutorial are time-consuming. Considering the number of samples to profile, we advise using HPC or GNU parallel if possible. We also provide intermediate results in the next sections in case you want to skip some of the steps.

You can install MetaPhlAn using conda as specified here: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0. To know more on MetaPhlan, have a look at the MetaPhlAn Wiki, which also includes a dedicated MetaPhlAn tutorial. Note that in order to be able to profile species without cultured representatives (uSGBs) as we did in the study, MetaPhlAn version 4+ is needed.

Let's now run MetaPhlAn on all the read files:

mkdir -p output_sams
mkdir -p output_bowtie2
mkdir -p output_profiles

for f in dummy_input_fastq_clean/*fastq
do
    echo "Running MetaPhlAn on ${f}"
    bn=$(basename ${f})
    id=${bn%.fastq}
    metaphlan ${f} --input_type fastq -s output_sams/${id}.sam.bz2 --bowtie2out output_bowtie2/${id}.bowtie2.bz2 -o output_profiles/${id}_profiled.tsv
done

2. Strain-level profiling with StrainPhlAn

The second step consists in running StrainPhlAn on the results of the previous step. You can also find more details on StrainPhlAn and a general tutorial in the StrainPhlAn Wiki. StrainPhlAn is included with MetaPhlAn, so it should have been installed when following the instructions to install MetaPhlAn.

2.1 Extracting the samples markers

First, markers need to be extracted from the metagenomic samples, which will be used for SNV profiling. Using the following command, a marker file containing the consensus of unique marker genes for each of the species found in the sample will be created, in the .pkl file format.

mkdir -p consensus_markers
sample2markers.py -i *.sam.bz2 -o consensus_markers -n 8

The following steps are specific to each species, so they need to be performed for every SGB that you aim to profile at the strain level.

2.2 Extracting the species marker genes

We then need to build a database containing the marker genes of our species of interest. We specify their SGB identifier with the t__ prefix.

mkdir -p db_markers
extract_markers.py -c t__SGB17256 -o db_markers/
extract_markers.py -c t__SGB9283 -o db_markers/
extract_markers.py -c t__SGB14311 -o db_markers/

The example above will generate a .fna file containing marker genes sequences for the three SGBs we are profiling at the strain level in this tutorial.

2.3 Strain-level profiling

In order to assess strain sharing in: 1- as many samples as possible, 2- in a reasonable time frame, 3- building an accurate phylogenetic tree, StrainPhlAn was called with the following parameters:

strainphlan -s consensus_markers/*.pkl \
            -m db_markers/t__SGB17256.fna \
            -o output \
            -n 8 \
            -c t__SGB17256 \
            --mutation_rates \
            --marker_in_n_samples 1 \
            --sample_with_n_markers 10 \
            --phylophlan_mode accurate
  • -s: The markers extracted for the sample of interest
  • -m: The file with the species marker genes, generated in the previous step
  • -c: The SGB ID
  • --mutation_rates: When specified, pairwise SNV rates are computed
  • --marker_in_n_samples: Minimum number of samples in which a given SGB marker should be present in order to be included in the analysis
  • --sample_with_n_markers: Minimum number of markers that a given sample should have in order to be included in the analysis
  • --phylophlan_mode: Wrapper for PhyloPhlan parameters. For details see the PhyloPhlan wiki

StrainPhlAn will get the markers of the targeted SGB from each sample, align them and build a phylogenetic tree using RAxML. We are interested in the RAxML_bestTree.[SGB_ID].StrainPhlAn4.tre Newick file, from which transmission events will be inferred based on the pairwise phylogenetic distances between samples.

To profile t__SGB9283 and t__SGB14311, just run the same command replacing t__SGB17256 after the -c argument.

Alternatively, you can download the phylogenetic trees for these 3 SGBs from the following link.

2.4 Extraction of pairwise phylogenetic distances

We will next extract the sample pairwise distances from the phylogenetic tree into a tsv file. We will use the tree_pairwisedists.py script in PyPhlAn.

tree_pairwisedists.py -n RAxML_bestTree.t__SGB17256.StrainPhlAn4.tre t__SGB17256_nGD.tsv
tree_pairwisedists.py -n RAxML_bestTree.t__SGB9283.StrainPhlAn4.tre t__SGB9283_nGD.tsv
tree_pairwisedists.py -n RAxML_bestTree.t__SGB14311.StrainPhlAn4.tre t__SGB14311_nGD.tsv

The -n parameter normalizes the distances by the total branch length, creating what we call nGD, i.e. normalized phylogenetic distance.

You can also download the pairwise distances for these 3 SGBs from the following link.

3. Setting species-specific strain identity thresholds

The process of setting species-specific strain identity thresholds based on comparison of inter- and intraindividual phylogenetic distances is explained in detail in the Methods section of Valles-Colomer et al. (2023). To summarize:

  • A total of 5 datasets of healthy subjects with longitudinal sampling was used as "training data".
  • To assess intraindividual phylogenetic distances (same strains), the pair of samples profiled at the strain level that were taken closest in time and at most 6 months apart (based on previous estimates of strain persistence) were selected for each individual.
  • To assess interindividual phylogenetic distances (different strains), pairs of samples from different datasets were used (in order to exclude pairs of samples with potential person-to-person transmission events). For a balanced set, only one sample was included for each individual.

The metadata of the samples used in this study can be retrieved from here along with the pairwise genetic distances. It includes relevant information such as sample, subject, family, and household identifiers.

In studies with other smaller datasets, the strain identity thresholds do not need to be re-defined but can be used as implemented in StrainPhlAn, so part 3 of this tutorial can be skipped.

For highly-prevalent SGBs:

For highly-prevalent SGBs (at least 50 same-individual comparisons available in the training data), the intra- and interindividual phylogenetic distance distributions can be used to identify the threshold that maximises the Youden's Index. The Youden's J statistic is a metric that captures the performance of a classification problem.

As a bound on the false discovery rate, that is, samples from unrelated individuals harboring the same strain, the 5th percentile of the interindividual phylogenetic distance distribution was used as a maximum threshold.

For lower-prevalence SGBs:

For less prevalent SGBs (less than 50 same-individual comparisons available in the training data) in which specific thresholds based on comparison of phylogenetic distances cannot be reliably computed, we set as threshold the median of the interindividual phylogenetic distance percentiles that we obtained for the highly-prevalent SGBs, which corresponds to 3%.

3.1. Example of threshold definition

We provide here an example with R code with the readr, dplyr and ggplot2 packages from the tidyverse and the cutpointr package for Youden's index computation as used in the manuscript.

# loading packages
library(readr)
library(dplyr)
library(ggplot2)
library(cutpointr)

# Reading the metadata
md <- read_tsv(file = "metadata_tutorial.tsv", col_names = T, show_col_types = F)

Ruminococcaceae SGB14311 (SGB14311)

# reading the tsv table of pairwise distances
nGD <- read_tsv(file = "t__SGB14311_nGD.tsv", col_names = F, show_col_types = F)

# Adding the metadata to the table
nGD <- left_join(nGD %>% select(sampleID_1 = X1, everything()),
                 md %>% select(sampleID_1 = sampleID,
                               subjectID_1 = subjectID,
                               family1 = family,
                               days_from_first_collection_1 = days_from_first_collection,
                               Dataset_1 = Dataset))
nGD <- left_join(nGD %>% select(sampleID_2 = X2, everything()),
                 md %>% select(sampleID_2 = sampleID,
                               subjectID_2 = subjectID,
                               family2 = family,
                               days_from_first_collection_2 = days_from_first_collection,
                               Dataset_2 = Dataset))

# Computing time difference between sample (important for longitudinal samples)
nGD$time_day_diff <- abs(nGD$days_from_first_collection_1 - nGD$days_from_first_collection_2)

# Annotating pairs of samples. Are they related? Are they from the same individual?
nGD$same_individual <- ifelse(nGD$subjectID_1 == nGD$subjectID_2, "same_individual", "different_individual")
nGD$related <- ifelse(nGD$family1 == nGD$family2, "related", "unrelated")
nGD$same_dataset <- ifelse(nGD$Dataset_1 == nGD$Dataset_2, "same_dtset", "different_dtset")

Here, following the Detection of strain sharing events part in the Methods of Valles-Colomer et al. (2023), for the intraindividual comparisons we pick one pair of samples per individual taken closest in time and no more then 6 months apart. For interindividual distances we consider a single pair of samples for each pair of individuals not belonging to the same dataset.

# Keeping only the training data
nGD_training <- nGD %>% filter(Dataset_1 %in% c("LouisS_2016","MehtaRS_2018","NielsenHB_2014_ESP","CosteaPI_2017_KAZ","HMP_2019_ibdmdb")) %>%
    filter(Dataset_2 %in% c("LouisS_2016","MehtaRS_2018","NielsenHB_2014_ESP","CosteaPI_2017_KAZ","HMP_2019_ibdmdb"))

nGD_training <- rbind(nGD_training %>% filter(same_individual == "same_individual") %>%
                 filter(time_day_diff <= 180) %>%
                 group_by(subjectID_1) %>% arrange(subjectID_1, time_day_diff) %>% slice_head(n = 1) %>% ungroup(),
             nGD_training %>% filter(same_individual != "same_individual",  related == "unrelated", same_dataset == "different_dtset" ) %>%
                 group_by(subjectID_1, subjectID_2) %>% slice_head(n = 1) %>% ungroup())

We now keep only the same-individual pairs and the unrelated pairs for Youden's index computation.

We first check if there are enough same-individual sample pairs:

table(nGD_training$same_individual)
# different_individual      same_individual
#                 5150                   74

We are here in the first case described above (highly-prevalent SGBs). We compute both the Youden's index and the 5th percentile of the unrelated individual distances.

res_youden <- cutpointr(data = nGD_training, x = X3, class = same_individual, method = maximize_metric, metric = youden)
quantile_5pc <- nGD_training %>% filter(related == "unrelated") %>% pull(X3) %>% quantile(0.05)

We can then visualize the distributions and thresholds:

library(ggplot2)
ggplot(data = nGD_training) +
    geom_density(aes(x = X3, fill = same_individual), alpha = 0.5 ) +
    geom_vline(aes(xintercept = res_youden$optimal_cutpoint, color = "youden")) +
    geom_vline(aes(xintercept = quantile_5pc, color = "quantile"), linetype = "dotted", lwd = 1) +
    theme_minimal() + xlab("StrainPhlAn nGD") + ylab("") +
    ggtitle("") +
    theme(legend.title =  element_blank(),
          legend.position = "bottom") +
    scale_color_manual(name = "statistics", values = c(youden = "blue", quantile = "red"))

nGD distribution SGB14311

In this first example, the Youden's index is lower than the 5th quantile of unrelated individual so we keep the first one as threshold (0.00237295).

Bifidobaterium bifidum (SGB17256)

We repeat the steps from the previous example, just changing the input:

nGD <- read_tsv(file = "t__SGB17256_nGD.tsv", col_names = F, show_col_types = F)

However here the number of same-individual sample pairs is less than 50:

table(nGD_training$same_individual)
# different_individual      same_individual
#                 1420                   30

In this case we take the 3rd percentile of the unrelated individual distribution:

quantile_3pc <- nGD_training %>% filter(related == "unrelated") %>% pull(X3) %>% quantile(0.03)

We get the following:

nGD distribution SGB17256

Sutterella wadsworthensis (SGB9283)

For the last example:

nGD <- read_tsv(file = "t__SGB9283_nGD.tsv", col_names = F, show_col_types = F)

We check the number of pairs of same-individual samples:

table(nGD_training$same_individual)
# different_individual      same_individual
#                 1657                   61

We have enough samples to compute the Youden's index: nGD distribution SGB9283

As the resulting value is above the 5th percentile of the phylogenetic distance distribution of the samples from unrelated individuals, we use the 5th percentile, thus taking 0.05 as a threshold.

4. Detection of strain sharing events:

Strain sharing events are then defined as pairs of samples with phylogenetic distance below the strain identity threshold for a certain SGB. We provide a script within StrainPhlAn to do this: strain_transmission.py.

The script uses the phylogenetic trees produced with StrainPhlAn together with samples metadata containing information specially on family relationships, household identifiers, and datasets.

python strain_transmission.py \
        --tree RAxML_bestTree.t__SGB17256.StrainPhlAn4.tre \
        --metadata metadata_tutorial.tsv \
        --output_dir . \
        --threshold 0.03

As strain identity threshold (--threshold parameter) we specify the optimal values we identified above, expressed as percentiles of the phylogenetic distance distribution of the samples from unrelated individuals. For SGB17256, we estimated this threshold to be the 3rd percentile, while for SGB9283 this was the 5th percentile (see above).

In the case of SGB14311, the Youden's index max defined as the best threshold. We have to convert first this index into the corresponding value in terms of quantile of pariwise distances:

prop.table(table(nGD_training %>% filter(related == "unrelated") %>% pull(X3) >= res_youden$optimal_cutpoint))
#      FALSE       TRUE
# 0.02038835 0.97961165

Therefore, 2.038835% of the pariwise distances are lower than the index. We will thus use 0.02038835 (~roughly the 2nd percentile) as a --threshold value in the script's call above for the case of SGB14311.

The output of this script is a list of all sample pairs in which strain sharing was detected. If you did not run the previous steps, here is an example of the result files for these 3 SGBs.

Final note

This tutorial followed the workflow in Valles-Colomer et al. (2023) to show how assessement of strain sharing was conducted for 3 example SGBs out of the 646 SGBs that were assessed in stool samples in the study. Person-to-person strain sharing rates between pairs of individuals were then computed, the number of strain sharing events identified between two individuals divided by the potential number of strain sharing events (computed as the number of SGBs profiled in both samples by StrainPhlAn).

The strain identity thresholds that were identified in the study are provided in a branch of StrainPhlAn. When specifying the SGB identifier with the --sgb_id parameter, pre-computed strain identity thresholds are used if available for this SGB.