Skip to content

PhyloPhlAn 3: Example 04: E. coli

Katarina Mladenovic edited this page Apr 24, 2024 · 3 revisions

High-resolution phylogeny of genomes and MAGs of a known species (E. coli)

Go back to the main PhyloPhlAn 3 Tutorial - Main Page


Before starting, make sure to have PhyloPhlAn 3 installed and to have already followed the first two steps of the tutorial in 3. Metagenomic analysis of the Ethiopian cohort.

  • Make sure PhyloPhlAn 3 scripts are executable and available in your command line
  • The commands in this tutorial assume that you are inside the tutorial folder examples/04_ecoli
  • All the steps below are reported in the run_04.sh script

By following these steps, the user will be able to build a high-resolution phylogeny of the E. coli genome bins found in the metagenomic analysis performed in 3. Metagenomic analysis of the Ethiopian cohort.

Step 1. Retrieve the E. coli genome bins

Retrieve the genomes that phylophlan_metagenomic has assigned to the E. coli SGB (ID 10068) and move them into the input_references folder. To check which genome bins have been assigned to each SGB, check the output_metagenomic.tsv file obtained in the first step of the example 3. Metagenomic analysis of the Ethiopian cohort.

mkdir -p input_references
for i in $(grep kSGB_10068 ../03_metagenomic/output_metagenomic.tsv | cut -f1); do
    cp -a ../03_metagenomic/input_metagenomic/$i.fna input_references/
done

Step 2. Generate a custom database of markers for E. coli UniRef90 proteins

We will use phylophlan_setup_database to automatically retrieve the core set of UniRef90 proteins for the E. coli species.

phylophlan_setup_database \
    -g s__Escherichia_coli \
    --verbose 2>&1 | tee logs/phylophlan_setup_database.log

Step 3. Add E. coli reference genomes

To insert our E. coli Ethiopian MAGs with other E. coli genomes deposited in public databases, we will use phylophlan_get_reference to automatically download 200 reference genomes from GenBank.

phylophlan_get_reference \
    -g s__Escherichia_coli \
    -o input_references/ \
    -n 200 \
    --verbose 2>&1 | tee logs/phylophlan_get_reference.log

Step 4. Generating the configuration file

The configuration file for this analysis can be easily generated with:

phylophlan_write_config_file \
    -o references_config.cfg \
    -d a \
    --db_aa diamond \
    --map_aa diamond \
    --map_dna diamond \
    --msa mafft \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml

Step 5. Build the phylogeny

Build the phylogenetic tree with:

phylophlan \
    -i input_references \
    -o output_references \
    -d s__Escherichia_coli \
    -t a \
    -f references_config.cfg \
    --nproc 4 \
    --diversity low \
    --fast \
    --verbose 2>&1 |tee logs/phylophlan__s__Escherichia_coli.log

The output files will be available in the output_references folder, and the output phylogeny is: RAxML_bestTree.output_isolates_refined.tre.

Note: we use the --diversity low option because we are building a phylogenetic tree with genomes coming from a single bacterial species. If you would like to run PhyloPhlAn 3 with a different number of CPUs, modify the option --nproc as explained in the Parallel computations section.

The output phylogeny will look like this:

PhyloPhlAn 3: Example 04: E. coli complete

It appears though that a reference genome (GCA_000529265) is phylogenetically distant from the other E. coli reference genomes and MAGs. By removing it we can better appreciate the phylogenetic relationships between our Ethiopian E. coli MAGs and the 199 reference genomes retrieved from GenBank. The 8 assembled genomes from the Ethiopian cohort are highlighted in purple.

PhyloPhlAn 3: Example 04: E. coli cleaned

Clone this wiki locally