PhyloPhlAn 3: Example 04: E. coli
Go back to the main PhyloPhlAn 3 Tutorial - Main Page
Before starting, make sure to have PhyloPhlAn 3 installed and to have already followed the first two steps of the tutorial in 3. Metagenomic analysis of the Ethiopian cohort.
- Make sure PhyloPhlAn 3 scripts are executable and available in your command line
- The commands in this tutorial assume that you are inside the tutorial folder
examples/04_ecoli
- All the steps below are reported in the
run_04.sh
script
By following these steps, the user will be able to build a high-resolution phylogeny of the E. coli genome bins found in the metagenomic analysis performed in 3. Metagenomic analysis of the Ethiopian cohort.
Retrieve the genomes that phylophlan_metagenomic
has assigned to the E. coli SGB (ID 10068) and move them into the input_references
folder.
To check which genome bins have been assigned to each SGB, check the output_metagenomic.tsv
file obtained in the first step of the example 3. Metagenomic analysis of the Ethiopian cohort.
mkdir -p input_references
for i in $(grep kSGB_10068 ../03_metagenomic/output_metagenomic.tsv | cut -f1); do
cp -a ../03_metagenomic/input_metagenomic/$i.fna input_references/
done
We will use phylophlan_setup_database
to automatically retrieve the core set of UniRef90 proteins for the E. coli species.
phylophlan_setup_database \
-g s__Escherichia_coli \
--verbose 2>&1 | tee logs/phylophlan_setup_database.log
To insert our E. coli Ethiopian MAGs with other E. coli genomes deposited in public databases, we will use phylophlan_get_reference
to automatically download 200 reference genomes from GenBank.
phylophlan_get_reference \
-g s__Escherichia_coli \
-o input_references/ \
-n 200 \
--verbose 2>&1 | tee logs/phylophlan_get_reference.log
The configuration file for this analysis can be easily generated with:
phylophlan_write_config_file \
-o references_config.cfg \
-d a \
--db_aa diamond \
--map_aa diamond \
--map_dna diamond \
--msa mafft \
--trim trimal \
--tree1 fasttree \
--tree2 raxml
Build the phylogenetic tree with:
phylophlan \
-i input_references \
-o output_references \
-d s__Escherichia_coli \
-t a \
-f references_config.cfg \
--nproc 4 \
--diversity low \
--fast \
--verbose 2>&1 |tee logs/phylophlan__s__Escherichia_coli.log
The output files will be available in the output_references
folder, and the output phylogeny is: RAxML_bestTree.output_isolates_refined.tre
.
Note: we use the --diversity low
option because we are building a phylogenetic tree with genomes coming from a single bacterial species. If you would like to run PhyloPhlAn 3 with a different number of CPUs, modify the option --nproc
as explained in the Parallel computations section.
The output phylogeny will look like this:
It appears though that a reference genome (GCA_000529265
) is phylogenetically distant from the other E. coli reference genomes and MAGs.
By removing it we can better appreciate the phylogenetic relationships between our Ethiopian E. coli MAGs and the 199 reference genomes retrieved from GenBank.
The 8 assembled genomes from the Ethiopian cohort are highlighted in purple.
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA