- [MetaPhlAn] Fix NameError in conditional rounding
- [MetaPhlAn] Fix metaphlan2krona.py to work with v4 PR #239
- [MetaPhlAn] Bug when downloading newick tree #500ce07
- [MetaPhlAn] Minor bug that didn’t allow
merge_vsc_tables.pyto be retrocompatible with older releases #9235c65 - [MetaPhlAn] Minor bug in parameter checks when using subsampling option #b82924f
- [MetaPhlAn] When no additional taxonomy exists add empty value to avoid broken profile #5b73b7c
- [MetaPhlAn] Remove fasta file after installation and avoid viral database check if
--profile_vscis not used #8b3f425 - [MetaPhlAn]
UNCLASSIFIED==100when the profile is empty even when--skip_unclassified_estimationoption is used #5b73b7c - [MetaPhlAn] Choose the number of digits to round the relative abundance to avoid very low abundance species with non-zero coverage to return relative abundance of zero #27e60f3
- [MetaPhlAn]
metaphlan --versionnow reports also locally available databases PR #238
A new ChocoPhlAn database (“vJan25”) for MetaPhlAn and related tools has been released with MetaPhlAn 4.2. This new database includes:
- Addition of ~63k isolate genomes from NCBI, leading to a total of ~219k isolate genomes
- Addition of ~416k new MAGs, leading to a total of ~1,3M MAGs
- Expansion of the markers database with 58,331 SGBs (21,509 more than vJun23)
- Improved clustering of SGBs using skANI
- Addition of taxonomy assignment for Viral Sequence Clusters (VSCs) database performed with GeNomad
- Update SGB2GTDBtk file to r220 - allowing direct comparison between MetaPhlAn and GTDB results
- [MetaPhlAn] Support for long-read sequencing datasets:
Added “Long reads arguments” group of options:
- --long_reads to profile a long-read dataset
- --split_reads to profile long reads by splitting them and treating them as short reads
- --split_readlen to specify the length of the splits when --split_reads is selected (default is 150)
- --max_gcsd to set the threshold on gap-compressed sequence divergence to use when filtering the Minimap2 mapping results (default is 0.10)
- --minimap2_exe to set a path to the Minimap2 executable
- --minimap2_ps to change the mapping options for Minimap2 (defaults are '-x asm20 -B 3 -O 3,12')
- --nbases required to specify the total number of bases in the sample when the input is a SAM file
- [MetaPhlAn] Added information about database used to run MetaPhlAn in SAM header as @CO index:<database_index>
- [MetaPhlAn] Added implementation of RPKM to virus for short reads
- [MetaPhlAn] Added compatibility of --profile_vsc with --input_type sam
- [MetaPhlAn] Added --verbose option for debugging issues with database download
- [StrainPhlAn] Added in sample2markers.py an automatic detection of the mapper used for the alignment (Bowtie2 or Minimap2) to set the correct default values
- [StrainPhlAn] Added --max_gcsd option in sample2markers.py to set the threshold on gap-compressed sequence divergence to use when filtering markers from the Minimap2 mapping results (default is 0.10)
- [StrainPhlAn] Added --polymorphism_perc option in strainphlan.py to regulate the maximum percentage of polymorphic sites in a marker to be considered (default is None)
- [MetaPhlAn] --unclassified_estimation is now a default parameter, making the relative abundance of detected taxa to be rescaled taking into account the estimated portion of unclassified reads. Use --skip_unclassified_estimation to not include an estimate of unclassified reads in the relative abundance profile
- [MetaPhlAn] Changed parameter for specifying the folder containing the MetaPhlAn database from --bowtie2db to --db_dir
- [MetaPhlAn] Changed parameter for specifying the path to the MetaPhlAn database folder from --bowtie2out to --mapout
- [MetaPhlAn] Changed value of --input_type from bowtie2out to mapout to specify the input is an intermediate mapping file from a previous run with --mapout (old --bowtie2out)
- [MetaPhlAn] Changed parameter for specifying output format from --biom to --biom_format_output. Now the parameter is a flag that allows the user to obtain the profiling in biom format only, instead of writing a separate output file. Also, --mdelim is now --biom_mdelim.
- [MetaPhlAn] Changed some default parameters for long reads: --min_mapq_val has default 5 for short reads and 50 for long reads --vsc_breadth has default 0.75 for short reads and 0.5 for long reads --subsampling takes as input the number of bases to be considered from the input metagenomes (instead of number of reads) if --long_reads is selected
- [MetaPhlAn] The MetaPhlAn species Newick trees for computing UniFrac distances in calculate_diversity.R are no longer stored in GitHub, but are downloaded with the database (or can be found here: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/ )
- [StrainPhlAn] Changed some default parameters for long reads in sample2markers.py, depending on the mapper detected (Bowtie2 or Minimap2): --min_reads_aligning has default 8 for short reads and 1 for long reads --min_mapping_quality has default 10 for short reads and 50 for long reads
- Fixed bugs in 4.2.0 and 4.2.1
- [MetaPhlAn] Removed backward compatibility with --mpa3 which includes removal of --add_viruses
- [MetaPhlAn] Removed MetaPhlAn2 style output format --legacy-output
A new ChocoPhlAn database (“vJan25”) for MetaPhlAn and related tools has been released with MetaPhlAn 4.2. This new database includes:
- Addition of ~63k isolate genomes from NCBI, leading to a total of ~219k isolate genomes
- Addition of ~416k new MAGs, leading to a total of ~1,3M MAGs
- Expansion of the markers database with 58,331 SGBs (21,509 more than vJun23)
- Improved clustering of SGBs using skANI
- Addition of taxonomy assignment for Viral Sequence Clusters (VSCs) database performed with GeNomad
- Update SGB2GTDBtk file to r220 - allowing direct comparison between MetaPhlAn and GTDB results
- [MetaPhlAn] Support for long-read sequencing datasets:
Added “Long reads arguments” group of options:
- --long_reads to profile a long-read dataset
- --split_reads to profile long reads by splitting them and treating them as short reads
- --split_readlen to specify the length of the splits when --split_reads is selected (default is 150)
- --max_gcsd to set the threshold on gap-compressed sequence divergence to use when filtering the Minimap2 mapping results (default is 0.10)
- --minimap2_exe to set a path to the Minimap2 executable
- --minimap2_ps to change the mapping options for Minimap2 (defaults are '-x asm20 -B 3 -O 3,12')
- --nbases required to specify the total number of bases in the sample when the input is a SAM file
- [MetaPhlAn] Added information about database used to run MetaPhlAn in SAM header as @CO index:<database_index>
- [MetaPhlAn] Added implementation of RPKM to virus for short reads
- [MetaPhlAn] Added compatibility of --profile_vsc with --input_type sam
- [MetaPhlAn] Added --verbose option for debugging issues with database download
- [StrainPhlAn] Added in sample2markers.py an automatic detection of the mapper used for the alignment (Bowtie2 or Minimap2) to set the correct default values
- [StrainPhlAn] Added --max_gcsd option in sample2markers.py to set the threshold on gap-compressed sequence divergence to use when filtering markers from the Minimap2 mapping results (default is 0.10)
- [StrainPhlAn] Added --polymorphism_perc option in strainphlan.py to regulate the maximum percentage of polymorphic sites in a marker to be considered (default is None)
- [MetaPhlAn] --unclassified_estimation is now a default parameter, making the relative abundance of detected taxa to be rescaled taking into account the estimated portion of unclassified reads. Use --skip_unclassified_estimation to not include an estimate of unclassified reads in the relative abundance profile
- [MetaPhlAn] Changed parameter for specifying the folder containing the MetaPhlAn database from --bowtie2db to --db_dir
- [MetaPhlAn] Changed parameter for specifying the path to the MetaPhlAn database folder from --bowtie2out to --mapout
- [MetaPhlAn] Changed value of --input_type from bowtie2out to mapout to specify the input is an intermediate mapping file from a previous run with --mapout (old --bowtie2out)
- [MetaPhlAn] Changed parameter for specifying output format from --biom to --biom_format_output. Now the parameter is a flag that allows the user to obtain the profiling in biom format only, instead of writing a separate output file. Also, --mdelim is now --biom_mdelim.
- [MetaPhlAn] Changed some default parameters for long reads: --min_mapq_val has default 5 for short reads and 50 for long reads --vsc_breadth has default 0.75 for short reads and 0.5 for long reads --subsampling takes as input the number of bases to be considered from the input metagenomes (instead of number of reads) if --long_reads is selected
- [MetaPhlAn] The MetaPhlAn species Newick trees for computing UniFrac distances in calculate_diversity.R are no longer stored in GitHub, but are downloaded with the database (or can be found here: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/ )
- [StrainPhlAn] Changed some default parameters for long reads in sample2markers.py, depending on the mapper detected (Bowtie2 or Minimap2): --min_reads_aligning has default 8 for short reads and 1 for long reads --min_mapping_quality has default 10 for short reads and 50 for long reads
- Fixed bugs in 4.2.0
- [MetaPhlAn] Removed backward compatibility with --mpa3 which includes removal of --add_viruses
- [MetaPhlAn] Removed MetaPhlAn2 style output format --legacy-output
A new ChocoPhlAn database (“vJan25”) for MetaPhlAn and related tools has been released with MetaPhlAn 4.2. This new database includes:
- Addition of ~63k isolate genomes from NCBI, leading to a total of ~219k isolate genomes
- Addition of ~416k new MAGs, leading to a total of ~1,3M MAGs
- Expansion of the markers database with 58,331 SGBs (21,509 more than vJun23)
- Improved clustering of SGBs using skANI
- Addition of taxonomy assignment for Viral Sequence Clusters (VSCs) database performed with GeNomad
- Update SGB2GTDBtk file to r220 - allowing direct comparison between MetaPhlAn and GTDB results
- [MetaPhlAn] Support for long-read sequencing datasets:
Added “Long reads arguments” group of options:
- --long_reads to profile a long-read dataset
- --split_reads to profile long reads by splitting them and treating them as short reads
- --split_readlen to specify the length of the splits when --split_reads is selected (default is 150)
- --max_gcsd to set the threshold on gap-compressed sequence divergence to use when filtering the Minimap2 mapping results (default is 0.10)
- --minimap2_exe to set a path to the Minimap2 executable
- --minimap2_ps to change the mapping options for Minimap2 (defaults are '-x asm20 -B 3 -O 3,12')
- --nbases required to specify the total number of bases in the sample when the input is a SAM file
- [MetaPhlAn] Added information about database used to run MetaPhlAn in SAM header as @CO index:<database_index>
- [MetaPhlAn] Added implementation of RPKM to virus for short reads
- [MetaPhlAn] Added compatibility of --profile_vsc with --input_type sam
- [MetaPhlAn] Added --verbose option for debugging issues with database download
- [StrainPhlAn] Added in sample2markers.py an automatic detection of the mapper used for the alignment (Bowtie2 or Minimap2) to set the correct default values
- [StrainPhlAn] Added --max_gcsd option in sample2markers.py to set the threshold on gap-compressed sequence divergence to use when filtering markers from the Minimap2 mapping results (default is 0.10)
- [StrainPhlAn] Added --polymorphism_perc option in strainphlan.py to regulate the maximum percentage of polymorphic sites in a marker to be considered (default is None)
- [MetaPhlAn] --unclassified_estimation is now a default parameter, making the relative abundance of detected taxa to be rescaled taking into account the estimated portion of unclassified reads. Use --skip_unclassified_estimation to not include an estimate of unclassified reads in the relative abundance profile
- [MetaPhlAn] Changed parameter for specifying the folder containing the MetaPhlAn database from --bowtie2db to --db_dir
- [MetaPhlAn] Changed parameter for specifying the path to the MetaPhlAn database folder from --bowtie2out to --mapout
- [MetaPhlAn] Changed value of --input_type from bowtie2out to mapout to specify the input is an intermediate mapping file from a previous run with --mapout (old --bowtie2out)
- [MetaPhlAn] Changed parameter for specifying output format from --biom to --biom_format_output. Now the parameter is a flag that allows the user to obtain the profiling in biom format only, instead of writing a separate output file. Also, --mdelim is now --biom_mdelim.
- [MetaPhlAn] Changed some default parameters for long reads: --min_mapq_val has default 5 for short reads and 50 for long reads --vsc_breadth has default 0.75 for short reads and 0.5 for long reads --subsampling takes as input the number of bases to be considered from the input metagenomes (instead of number of reads) if --long_reads is selected
- [MetaPhlAn] The MetaPhlAn species Newick trees for computing UniFrac distances in calculate_diversity.R are no longer stored in GitHub, but are downloaded with the database (or can be found here: http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/ )
- [StrainPhlAn] Changed some default parameters for long reads in sample2markers.py, depending on the mapper detected (Bowtie2 or Minimap2): --min_reads_aligning has default 8 for short reads and 1 for long reads --min_mapping_quality has default 10 for short reads and 50 for long reads
- [MetaPhlAn] Removed backward compatibility with --mpa3 which includes removal of --add_viruses
- [MetaPhlAn] Removed MetaPhlAn2 style output format --legacy-output
- We just released the new vJun23_202403 and vOct22_202403 databases
- Same SGBs as for the vJun23_202307 and vOct22_202212 versions, respectively, but the NCBI taxonomy assignment has been fixed to keep the taxa consistent across the MetaPhlAn taxonomic tree, allowing accurate relative abundance estimation also at higher taxonomic levels
- [MetaPhlAn] The new
fix_relab_mpa4.pyscript enables to fix errors in the relative abundances in profiles generated with previous databases - [MetaPhlAn] Implementation of the option
--subsampling_paired [N_PAIRED_READS]to subsample paired-end input reads. It needs to be used in conjunction with-1 [FORWARD_READS_FILE]and-2 [REVERSE_READS_FILE]
- [MetaPhlAn] Fixed a bug that would halt MetaPhlAn execution when the option
--profile_vscwas used but had no viral hits - [MetaPhlAn] Fixed a bug that would halt MetaPhlAn execution when the number of reads to map was zero
- [StrainPhlAn] Fixed a bug in the new implementation (since v4.1) of
–-print_clades_only
- We just released the new vJun23_202307 database
- Addition of ~45k reference genomes from NCBI
- Addition of ~50k MAGs from ocean, ~40k MAGs from soil, ~30k MAGs from domestic animals and non-human primates, ~4k MAGs from giant turtles, ~7.5k MAGs from skin microbiome, ~20k MAGs from dental plaque, ~15k MAGs from Asian populations, ~2.7k MAGs from ancient and modern Bolivians and other small datasets from diverse sources
- Expansion of the markers database with 36,822 SGBs (6,272 more SGBs than in vOct22)
- Inclusion of the new Viral Sequence Clusters (VSCs) database
- Containing 3,944 VSCs clustered into 1,345 Viral Sequence Groups (VSGs).
- Including a total of 45,872 representative VSGs sequences.
- Each cluster/group is labeled as known (kVSG) or unknown (uVSG) depending on the presence of at least a viral RefSeq reference genome within the cluster/group.
- [MetaPhlAn] The new
--profile_vscparameter (together with--vsc_outand--vsc_breadth) enables the profiling of viral sequence clusters. - [MetaPhlAn] The
--subsamplingnow subsamples the FASTQ files and not the mapping results - [MetaPhlAn] The new
--mapping_subsamplingparameter enables the previous mapping subsampling behaviour - [MetaPhlAn] The new
--subsampling_outputparameter enables to save the subsampled FASTQ file - [MetaPhlAn] The new
create_toy_database.pyscript enables the custom filtering of the MetaPhlAn databases
- [MetaPhlAn] The average read length is included in the output header with the -t rel_ab_w_read_stats parameter
- [StrainPhlAn] Quasi-markers behaviour in line with that of MetaPhlAn
- [StrainPhlAn] sample2markers.py output is now in JSON format
- [StrainPhlAn] Simplified sample and marker filtering parameters, integrated with primary/secondary samples
- [StrainPhlAn] Faster inference of small and medium phylogenies
- [StrainPhlAn] Faster execution of the parameter
–-print_clades_only
- [MetaPhlAn] The GTDB taxonomic assignment for the vOct22 database is now available.
- We just released the new vOct22 database
- Addition of ~200k new genomes
- 3,580 more SGBs than the vJan21
- 2,548 genomes considered reference genomes in vJan21 were relabelled as MAGs in NCBI -> 1,550 kSGBs in vJan21 are now uSGBs in vOct22
- Removed redundant reference genomes from the vJan21 genomic database using a MASH distance threshold at 0.1%
- Local reclustering to improve SGB definitions of oversized or too-close SGBs
- Improved GGB and FGB definitions by reclustering SGB centroids from scratch
- Improved phylum assignment of SGBs with no reference genomes at FGB level using MASH distances on amino acids to find the closest kSGB
- [StrainPhlAn] Improved StrainPhlAn's speed when running with the --print_clades_only option
- [MetaPhlAn] The GTDB taxonomic assignment for the vOct22 database is not available yet (expected release: end of Feb 2023)
- [MetaPhlAn] The phylogenetic tree of life for the vOct22 database is not available yet (expected release: TBD)
- [MetaPhlAn] Download of the pre-computed Bowtie2 database is now the default option during installation
- [StrainPhlAn] Improved StrainPhlAn's sample2makers.py script performance and speed
- [StrainPhlAn] Fixes error when using --abs_n_samples_threshold in the PhyloPhlAn call
- [MetaPhlAn] Removal of the NCBI taxID from the merged profiles produced by the
merge_metaphlan_profiles.pyscript - [StrainPhlAn] Improved StrainPhlAn's performance in the markers/samples filtering step
- [MetaPhlAn]
-t rel_ab_w_read_statsnow produces the reads stats also at the SGB level - [MetaPhlAn] Fixes overstimation of reads aligned to known clades
- [MetaPhlAn] Fixes error when not providing the number of reads using SAM files as input
- [StrainPhlAn] Fixes
No markers were found for the cladeerror while executing StrainPhlAn without providing the clade markers FASTA file
- [MetaPhlAn] The new
--subsamplingparameter allows reads' subsampling on the flight - [MetaPhlAn] The new
--subsampling_seedparameter enables a deterministic or randomized subsampling of the reads - [MetaPhlAn] The new
--gtdb_profilesof themerge_metaphlan_profiles.pyallows the merge of GTDB-based MetaPhlAn profiles - [StrainPhlAn] The new
--breadth_thresparameter allows StrainPhlAn to filter the consensus markers sequences after the execution ofsample2markers.py - [StrainPhlAn] Interactive selection of the available SGBs when the clade is specified at the species level
- [StrainPhlAn] The new
--non_interactiveparameter disables user interaction when running StrainPhlAn - [StrainPhlAn] The new
--abs_n_markers_thresand--abs_n_samples_thresparameters enables the specification of the samples/markers filtering thresholds in absolute numbers - [StrainPhlAn] The new
--treeshrinkparameter enables StrainPhlAn to run TreeShrink for outlier removal in the tree - [StrainPhlAn] Addition of the
VallesColomerM_2022_Jan21_thresholds.tsvfor compatibility with the mpa_vJan21 database - [StrainPhlAn] The new
--cladesparameter enablessample2markers.pyto restrict the reconstruction of markers to the specified clades
- [StrainPhlAn] The
-cparameter of theextract_markers.pyscript now allows the specification of multiple clades - [StrainPhlAn] The
--print_clades_onlyparameter now produces an outputprint_clades_only.tsvreport - [StrainPhlAn] Compatibility with clade markers compressed in bz2 format
- [StrainPhlAn] The
strain_transmission.pyscript now uses by the default theVallesColomerM_2022_Jan21_thresholds.tsvthresholds
- [MetaPhlAn]
metaphlan2krona.pyandhclust2have been added to the bioconda recipe
- [MetaPhlAn] The new
--offlineparameter stops MetaPhlAn from automatically checking for updates
- [StrainPhlAn] Improved StrainPhlAn's gaps management with the newest version of PhyloPhlAn (version 3.0.3)
- [StrainPhlAn] Improved set of colors for the
plot_tree_graphlan.py script
- [MetaPhlAn] Fixes
KeyError: 't'error when running MetaPhlAn with the--CAMI_format_outputparameter
- Adoption of the species-level genome bins system (SGBs)
- New MetaPhlAn marker genes extracted identified from ~1M microbial genomes
- Ability to profile 21,978 known (kSGBs) and 4,992 unknown (uSGBs) microbial species
- Better representation of, not only the human gut microbiome but also many other animal and ecological environments
- [MetaPhlAn] Compatibility with MetaPhlAn 3 databases with parameter
--mpa3
- [MetaPhlAn] Estimation of metagenome composed by microbes not included in the database with parameter
--unclassified_estimation
- 433 low-quality species were removed from the MetaPhlAn 3.1 marker database and 2,680 species were added (for a new total of 15,766; a 17% increase)
- Marker genes for a subset of existing bioBakery 3 species were also revised
- Most existing bioBakery 3 species pangenomes were updated with revised or expanded gene content
- [MetaPhlAn] MetaPhlAn 3.1 software has been updated to work with revised marker database
- New MetaPhlAn marker genes extracted with a newer version of ChocoPhlAn based on UniRef
- Estimation of metagenome composed by unknown microbes with parameter
--unknown_estimation - Automatic retrieval and installation of the latest MetaPhlAn database with parameter
--index latest - Virus profiling with
--add_viruses - Calculation of metagenome size for improved estimation of reads mapped to a given clade
- Inclusion of NCBI taxonomy ID in the ouput file
- CAMI (Taxonomic) Profiling Output Format included
- Removal of reads with low MAPQ values
- added option "marker_counts" (by Nicola)
- added min_alignment_len option to filter out short alignments in local mode. For long reads (>150) it is now recommended to use local mapping together with "--min_alignment_len 100" to filter out very short alignments. (by Tin)
- added "--samout" option to store the mapping file in SAM format (the SAM will be compressed if the extension of the specified output file ends with ".bz2") (by Tin)
- fix: MetaPhlAn2 now ingores about ~300 markers that were a-specific (thanks to Eric)
- fix: Biom >= 2.0.0 has the clade IDs second and the sample ids third'
- added extract_markers.py
- fix: #5; revamp biom generation; set clade IDs as enumeration
- added utils/metaphlan2krona.py