Skip to content

Tutorial version: Metagenomic analysis of the Ethiopian cohort

Katarina Mladenovic edited this page Jun 14, 2024 · 26 revisions

Metagenomic analysis of the Ethiopian cohort

Note: This page is running a reduced-size version of the MAGs and the corresponding database to run within a normal tutorial window. To see the full version of the commands please see the main PhyloPhlAn tutorial specifically - 3. Metagenomic analysis of the Ethiopian cohort; 4. High-resolution phylogeny of genomes and MAGs of a known species (E. coli); 5. Phylogenetic characterization of an unknown SGB from the Proteobacteria phylum


This tutorial will show you how to phylogenetically characterize newly assemble genomes from metagenomes in the context of Species-level Genome Bins (SGBs).

To do this we use 50 metagenomes of the Ethiopian cohort: From the 50 Ethiopian metagenomes, 369 MAGs were reconstructed (with at least >50% completeness and <5% contamination, based on checkM)

Disclaimer: Here the Ethiopian MAGs were reduced to 181 because of the limited time for the tutorial and reduced overhead needed for many tutorial VMs.

Note: Before starting, make sure to have PhyloPhlAn 3 installed.

1. Setup for PhyloPhlAn metagenomic run

Ingredients you will need to run PhyloPhlAn metagenomic include:

  1. A directory with contigs (genome bins / MAGs) from your metagenomic study (for a tutorial on the basics of assembly see here)
  2. A database of SGBs to pull annotations from (lastest release from the Segata lab = SGB.Jan19)

Ingredients you will need to run PhyloPhlAn include:

  1. Reference genomes
  2. Genome bins / MAGs assigned to each phylogeny of interest
  3. Database with annotated marker genes (see Database setup)
  4. Configuration file (How to make a configuration file)

1.1 Download the Ethiopian MAGs setup files

Pull the script to do this from Dropbox:

wget https://github.com/biobakery/biobakery/releases/download/1.8/setup.sh

View the setup.sh file

less -S setup.sh

# database and data download
wget https://www.dropbox.com/s/z2v7nmosua9ty19/tutorial_ethiopia__mag2meta.tsv
wget https://www.dropbox.com/s/ktwviuvwmrf0u2l/tutorial_ethiopia__mags.tar.bz2
tar -xjf tutorial_ethiopia__mags.tar.bz2

mkdir -p phylophlan_databases/
cd phylophlan_databases/
wget https://www.dropbox.com/s/tik9yubeerq4t37/tutorial_ethiopia.md5
wget https://www.dropbox.com/s/9oey75prd2v7lfs/tutorial_ethiopia.txt.bz2
mkdir -p s__Escherichia_coli phylophlan_chlamydiae
cd s__Escherichia_coli/
wget https://www.dropbox.com/s/8quyu04fucl3dwj/s__Escherichia_coli.faa
cd ../phylophlan_chlamydiae/
wget https://www.dropbox.com/s/b1ykd7gh98n8fry/phylophlan_chlamydiae.faa

cd ..
cd ..
mkdir phylophlan_configs/
cd phylophlan_configs
wget https://github.com/biobakery/biobakery/releases/download/1.8/d_aa__i_nt.cfg

cd ..
mkdir ecoli chlamydiae
cd ecoli/
wget https://github.com/biobakery/biobakery/releases/download/1.8/ecoli_refgen.tar
tar -xf ecoli_refgen.tar
cd ../chlamydiae/
wget https://github.com/biobakery/biobakery/releases/download/1.8/chlamydiae_refgen.tar
tar -xf chlamydiae_refgen.tar


What does this code do?


Let's run the setup.sh to set up our environment to run PhyloPhlAn on metagenomic samples.

sh setup.sh

2. Running PhyloPhlAn metagenomic

2.1 Assign a taxonomic label to each bin

With the following command, we will use the SGB release of January 2020 to assign to each genome bin its closest SGB.

Reminder this is a reduced sized database - if you are trying to run against the full database please use the lastest full-size edition located here.

phylophlan_assign_sgbs \
    -i tutorial_ethiopia/ethiopian_mags \
    -o tutorial_ethiopia/ethiopian_mags \
    --nproc 4 \
    -n 1 \
    -d ethiopia_tutorial \
    --database_folder ethiopia_tutorial_db \
    --verbose 2>&1 | tee logs/phylophlan_metagenomic.log

In this case, for each genome bin, we are interested in only the closest SGB (-n 1), which is reported in the output. If the genome bin has a Mash distance <2% from the reported SGB, we can consider that bin as part of it and transfer the SGB's taxonomic label.

What does the output of this code?

less -S tutorial_ethiopia/ethiopian_mags/ethiopian_mags.tsv

2.2 Heatmaps of the top 21 SGBs found in the Ethiopian metagenomes

This step allows you to visualize the top 21 SGBs found in the Ethiopian metagenomes.

To be able to do this, you need to provide a mapping file that maps each genome bin to the metagenome it was assembled from. The mapping file should be a tab-separated text file where the genome bins / MAGs are listed in the first column and the corresponding metagenome in the second column.

For this example, we are providing the mapping file tutorial_ethiopia__mag2meta.tsv present inside the example folder. To further visualize this file run column -t -s "," tutorial_ethiopia/tutorial_ethiopia__mag2meta.tsv | less -S then q to escape.

phylophlan_draw_metagenomic \
-i tutorial_ethiopia/ethiopian_mags/ethiopian_mags.tsv \
--map tutorial_ethiopia/tutorial_ethiopia__mag2meta.tsv \
-f png \
--verbose 2>&1 | tee phylophlan_draw_metagenomic.log

This will produce two heatmaps:

  1. The first heatmap shows, for each metagenome, the presence/absence profile of the top 21 SGBs found in the Ethiopian cohort
  2. The second heatmap shows how many uSGBs, kSGBs, and unassigned bins / MAGs are present in each metagenome

PhyloPhlAn 3: Example 03: Metagenomic application: presence / absence heatmap

PhyloPhlAn 3: Example 03: Metagenomic application: counter uSGBs, kSGBs, unassinged heatmap

Where do we go from here?

The SGBs profiles of the Ethiopian cohort can be further analyzed focusing on some specific known and/or unknown SGBs.

For instance, if we focus on the common gut commensal Escherichia coli, we can put into phylogenetic context the 8 Ethiopian MAGs falling into kSGB 10068, as shown in 4. High-resolution phylogeny of genomes and MAGs of a known species (E. coli) or a reduced version below.

Moreover, if we focus on the most prevalent unknown SGB in the Ethiopian cohort (uSGB 19436), we can further phylogenetically characterize the 13 Ethiopian MAGs in the context of the reference genomes of the Proteobacteria phylum and the MAGs from Pasolli, E et al. Cell (2019) belonging to the same uSGB 19436, as shown in 5. Phylogenetically characterization of an unknown SGB from the Proteobacteria phylum or a reduced version below

3. PhyloPhlAn to phylogenetically place MAGs

The configuration file for the following analyses can be easily generated with:

phylophlan_write_config_file  \
    -d a \
    -o tutorial_ethiopia/phylophlan_configs/reference_config.cfg \
    --db_aa diamond \
    --map_dna diamond \
    --map_aa diamond \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml \
    --verbose 2>&1 | tee phylophlan_write_config_file.log

3.1 E.coli in the Ethiopian MAGs

Study the 8 Ethiopian MAGs assigned to the common gut commensal Escherichia coli (kSGB 10068) with 42 E. coli reference genomes

kSGB 10068: Escherichia coli

Retrieve the genomes that phylophlan_metagenomic has assigned to the E. coli SGB (ID 10068) and move them into the input folder: tutorial_ethiopia/ecoli/. To move just those genomes over to the E.coli directory run the following command:


for b in $(grep kSGB_10068 tutorial_ethiopia/ethiopian_mags/ethiopian_mags.tsv | cut -f1); do cp tutorial_ethiopia/ethiopian_mags/${b}.fna tutorial_ethiopia/ecoli/; done

What does this directory contain?

ls -lthr tutorial_ethiopia/ecoli/

Run PhyloPhlAn to build a phylogenetic tree and check where the new known Ethiopian E.coli SGBs fall within the phylogeny.

phylophlan \
-i tutorial_ethiopia/ecoli/ \
-d tutorial_ethiopia/phylophlan_databases/s__Escherichia_coli \
--diversity low \
--fast \
--force_nucleotides \
-f tutorial_ethiopia/phylophlan_configs/reference_config.cfg \
-t a \
--subsample tenpercent \
--trim greedy \
--nproc 2 \
--verbose 2>&1 | tee phylophlan2_ecoli.log

You can try visualizing the tree using ggtree script (the same way that we did in StrainPhlAn).

cd ecoli_s__Escherichia_coli
./phylophlan_ggtree.R RAxML_bestTree.ecoli.tre ecoli_concatenated.aln ecoli_tree1.png ecoli_tree2.png 

3.2 Proteobacteria phylum in the Ethiopian MAGs

Study the 10 Ethiopian MAGs assigned to the most prevalent uSGB 19436 to phylogenetically characterize them in the context of all species in the Proteobacteria phylum

uSGB 19436: Proteobacteria phylum


for b in $(grep uSGB_19436 tutorial_ethiopia/ethiopian_mags/ethiopian_mags.tsv | cut -f1); do cp tutorial_ethiopia/ethiopian_mags/${b}.fna tutorial_ethiopia/chlamydiae/; done


phylophlan \
-i tutorial_ethiopia/chlamydiae/ \
-d tutorial_ethiopia/phylophlan_databases/phylophlan_chlamydiae \
--diversity high \
--fast \
--force_nucleotides \
-f tutorial_ethiopia/phylophlan_configs/reference_config.cfg \
-t a \
--subsample tenpercent \
--trim greedy \
--nproc 2 \
--verbose 2>&1 | tee phylophlan_chlamydiae.log

You can try visualizing the tree using ggtree script (the same way that we did in StrainPhlAn).

cd chlamydiae_phylophlan_chlamydiae
./phylophlan_ggtree.R RAxML_bestTree.chlamydiae.tre chlamydiae_concatenated.aln chlamydiae_tree1.png chlamydiae_tree2.png 

Clone this wiki locally