HyAsP (Hybrid Assember for Plasmids) is a tool for the extracting plasmids from WGS assemblies in an automatic way. It combines ideas from both reference-based and depth-based methods to identify plasmids in a greedy algorithm, using information on the occurrences of known plasmid genes and considering characteristics of the contigs such as read depth and GC content.
HyAsp/ contains the source code of HyAsP, which can be installed as a package through
setup.py (see below).
databases/ provides exemplary files that can be used to construct a gene database for HyAsP,
results/ contains the results of a comparison of HyAsP with plasmidSPAdes, MOB-recon and Recycler.
HyAsP was developed and tested with the following software dependencies:
- Python (
python, version 3.5.4; packages: Bio, math, numpy, os, pandas, random, subprocess, sys)
- BLAST+ (
blastn; version 2.6.0)
- standard UNIX tools (
BLAST+ is only required for the
In order to use HyAsP as part of the provided pipeline starting from FASTQ reads, the following requirements have to be satisfied in addition:
- FastQC (
fastqc, version 0.11.5)
- sickle (
sickle, version 1.33)
- cutadapt (
cutadapt, version 1.16)
- Trim Galore (
trim_galore, version 0.4.5_dev)
- Unicycler (
unicycler, version 0.4.5)
Python and the other tools have to be in the
PATH or specified through their path options.
Get the source code from GitHub and, optionally, install HyAsP as a package:
git clone https://github.com/cchauve/hyasp.git cd hyasp python setup.py sdist pip install dist/HyAsP-1.0.0.tar.gz
Installing HyAsP as a package makes
fastq_to_plasmids.py available as
We strongly recommend installing HyAsP in a virtual environment which uses Python 3.
Subsequently, a default gene database could be built as follows:
cd databases hyasp create ncbi_database_genes.fasta -p plasmids.csv -b ncbi_blacklist.txt -d -l 500 -m 100 -t GenBank -r 2015-12-19T00:00:00Z
databases/README.md for more information on this database and a download link.
However, the gene database does not have to be created using the
create command of HyAsP or with above parameters.
Alternative: Singularity containers
In order to simplify the installation of HyAsP, especially with respect to the dependencies of the pipeline, we have packaged HyAsP (and the pipeline) also as application containers, using Singularity (version 3).
We provide two containers in
hyasp_only.img: HyAsP (and its direct dependencies) but not the pipeline.
hyasp_pipeline.img: HyAsP and all the dependencies necessary to run the pipeline.
They can also be built from the provided definition files using the following commands (assuming root privileges and that Singularity is available through the command
sudo singularity build hyasp_pipeline.img hyasp_pipeline.def sudo singularity build hyasp_only.img hyasp_only.def
The usage of HyAsP (explained below) is the same, regardless of using the tool directly or through the containers.
The commands remain the same, except that
hyasp_pipeline are replaced by
singularity run hyasp_only.img and
singularity run hyasp_pipeline.img, respectively:
# direct usage hyasp create genes.fasta -a accessions.txt # container singularity run hyasp_only.img create genes.fasta -a accessions.txt`
# direct usage hyasp_pipeline output_dir genes.fasta -s reads.fastq # pipeline singularity run hyasp_pipeline.img output_dir genes.fasta -s reads.fastq
HyAsP provides the functions to find plasmids and create necessary inputs through different commands. Below example show simple uses of the commands. See section Parameters for lists of options to change the behaviour of each command.
We also provide a minimal working example in
The notebook inside it guides the user through the different steps, either using HyAsP and the pipeline directly or using the Singularity containers.
1) Create a gene database from a collection of plasmids
Requires a list of plasmids with annotated genes or a plasmid table obtained from NCBI (with all columns).
hyasp create genes.fasta -a accessions.txt
Here, the gene database (
genes.fasta) is created by downloading the GenBank files of the plasmids given by accession number in
accessions.txt (one per line)
and extracting the genes from them.
--from_accession, -a Path to file containing one (plasmid) accession number per line OR list of accession numbers (see --from_command_line). (default: (empty string), i.e. not used) --from_genbank, -g Path to file containing path to a GenBank file per line OR list of paths (see --from_command_line). (default: (empty string), i.e. not used) --from_plasmid_table, -p Path to plasmid table downloaded from NCBI. (default: (empty string), i.e. not used) --keep_plasmids, -k Stores the plasmids underlying the gene database in FASTA format if a file is specified. (default: (empty string), i.e. deactivated) --dereplicate, -d Removes duplicate genes from database if activated. (default: False) --from_command_line, -c Instead of a file containing the accession numbers (file paths), the options -a (-g) expect a comma-separated list of accession numbers (file paths). Cannot be combined from -p. (default: False) --extend, -e Genes (and plasmids) are added to an existing database instead of overwriting it. (default: False) --released_before, -r Consider only plasmids released before the specified date. Can only be combined with -p. Date format: YYYY-MM-DDTHH:MM:SSZ, e.g. 2005-07-31T00:00:00Z. (default: (empty string), i.e. deactivated) --type, -t Build the databases from the RefSeq accession numbers (RefSeq), GenBank accession numbers (GenBank) or both (both). Affects only option -p. (default: both) --blacklist, -b Comma-separated list of accession numbers of plasmids not to be included in the databases. Cannot be combined with -e. (default: (empty string), i.e. deactivated) --min_length, -l Minimum length of plasmids to be considered for the database. (default: 0) --max_length, -L Maximum length of plasmids to be considered for the database. (default: infinity) --min_gene_length, -m Minimum length of genes to be considered for the database. (default: 0) --num_attempts, -n Maximum number of attempts to properly download a GenBank file from NCBI. (default: 25) --verbose, -v Flag activating detailed logging. (default: False)
The database is created from either accession numbers or (already downloaded) GenBank files or the NCBI plasmid table, i.e. the options
-p cannot be combined.
2) Map a collection of genes to the contigs of an assembly
Requires a gene database and a collection of contigs.
hyasp map genes.fasta gcm.csv -g assembly.gfa
Here, the gene-contig mapping
gcm.csv is determined by mapping the genes in
genes.fasta to the contigs of the assembly
--from_fasta, -f Path to the file containing the contigs (in FASTA format) to which the genes should be matched. (default: (empty string), i.e. not used) --from_gfa, -g Path to the file containing the contigs (as part of an assembly graph in GFA format) to which the genes should be matched. (default: (empty string), i.e. not used) --clean, -c Remove temporary files after the mapping has been created. (default: False) --verbose, -v Flag activating detailed logging. (default: False) --makeblastdb Path to the makeblastdb executable. (default: makeblastdb) --blastn Path to the blastn executable. (default: blastn)
The contigs are read from either a FASTA or a GFA file, i.e. either
-g have to be used.
3) Filter a gene-contig mapping
Requires a gene database and a gene-contig mapping (created with
hyasp filter genes.fasta gcm.csv filtered_gcm.csv
Here, the gene-contig mapping
gcm.csv (based on the genes in
genes.fasta) will be filtered using default thresholds
(for identity and length) and the remaining mapping is stored in
--identity_threshold, -i Minimum identity of hits retained in the mapping. (default: 0.95) --length_threshold, -l Minimum fraction of query (gene) that has be matched to keep a hit. (default: 0.95) --find_fragmented, -f Search for fragmented hits, i.e. several short high-identity hits that together satisfy the length threshold. (default: False) --verbose, -v Flag activating detailed logging. (default: False)
4) Find plasmids in an assembly graph
Requires an assembly graph (in GFA v1 format),
a gene database (see
create) and a (filtered) gene-contig mapping (
hyasp find assembly.gfa genes.fasta gcm.csv output_dir
Here, the plasmids are predicted for the assembly provided in
assembly.gfa, based on the gene-contig mapping
and the gene database provided in
--min_gene_density, -g Minimum gene density of a putative plasmid. Plasmids with a lower gene density are marked as questionable. (default: 0.3) --min_seed_gene_density, -k Minimum gene density necessary for a contig to be considered as a seed. (default: 1.5 * min_gene_density) --min_length, -l Minimum length of a putative plasmid. Shorter plasmids are marked as questionable. (default: 1500) --max_length, -L Maximum length of a putative plasmid. Gene-containing contigs longer than max_length are not used as seeds. A contig is excluded from list of potential extensions, if the combined length of the contig and the plasmid is larger than max_length. (default: 1750000) --min_read_depth, -r Minimum read depth of a contig to be able to participate in a plasmid. (default: 0.75 * min_plasmid_read_depth) --min_plasmid_read_depth, -d Minimum average read depth of a putative plasmid. Plasmids with a lower average read depth are marked as questionable. (default: 0.4 * (median read depth of input assembly graph)) --max_gc_diff, -G Maximum difference in GC content between a plasmid and a potentially added contig. (default: 0.15) --max_intermediate_contigs, -c Maximum number of gene-free contigs between two gene-containing contigs in a plasmid. A contig is excluded from the list of potential extensions, if its addition would violate this threshold. (default: 2) --max_intermediate_nt, -n Maximum total length of any consecutive sequence of gene-free contigs in a plasmid. A contig is excluded from the list of potential extensions, if its addition would violate this threshold. (default: 2000) --max_score, -s Maximum score of a potential extension. Possible extensions with a higher score are discarded. (default: infinity) --score_weights, -w Weights of the different components of the function used to score extensions. Comma-separated list of entries of the form <name>=<value>. The weight of a component is determined automatically if the question mark (?) is used as <value>. (default: depth_diff=1,gene_density=1,gc_diff=1) --keep_subplasmids, -q Do not mark plasmids whose underlying set of contigs is contained by others as questionable. If several plasmids have the same underlying set of contigs, one of them will remain in the collection of putative plasmids. (default: False) --overlap_ends, -o Minimum overlap between the two ends of a plasmid in order to mark it as circular. (default: infinity) --binning, -b Factor determining how many standard deviations the read depth and GC content of plasmids are allowed to differ from the 'centre' of their bin. Binning is activated by setting the parameter to any value different from NaN. (default: NaN, i.e. deactivated) --fanout, -f Maximum number of predecessors / successors of any contig in a 'plasmid' (or rather contig collection). Setting this parameter to any value > 1 leads to non-linear / branching contig chains. Changes which files are generated as output. Cannot be used together with probabilistic. (default: 1) --probabilistic, -p Flag changing the behaviour of the extension step to a probabilistic choice. The probability of an extension is the share of involved contig of the total read depth of all extensions. Cannot be used together with fanout. (default: False) --use_node_based, -N Flag changing the behaviour of the extension step to a node-based loop avoidance (instead of link-based). (default: False) --use_median, -u Flag activating the use of median (instead of mean) in order to compute the average read depth of a plasmid. (default: False) --verbose, -v Flag activating detailed logging (of the extension procedure). (default: False)
Lists contigs and their orientation as they appear in the linear contig chain of all plasmids (both putative and questionable). Only for
fanout = 1.
<plasmid id>;<comma-separated list of contigs with orientation>
Stores the plasmid sequences (concatenations of the (orientated) contigs) in FASTA format. The plasmid identifier is also used as the identifier of the FASTA entry. The additional information on each plasmid are provided in the deflines (e.g. seed_contig and gene_density. Only for
fanout = 1.
Format: FASTA with additional information in defline (as tab-separated list of
Lists name and orientation of all contigs for each putative resp. questionable plasmid. Only for
fanout > 1.
<plasmid id>;<comma-separated list of contigs with orientation>
Stores the contigs underlying plasmids in FASTA format. If a contig is used in negative orientation, the reverse complement of its sequence is stored in the output file. The identifier of a FASTA entry consists of the contig name and the contig identifier (separated by the
Tagged assembly graph
Stores a copy of the input assembly graph and adds colour and label information to contigs used in putative and questionable plasmids. Contigs occurring in at least one putative plasmids are blue, while those occurring one or more questionable plasmids (but no putative plasmid) are light blue. Each contig is labelled with the identifiers of the plasmids it occurs in and seed contigs also contain a
* in their label.
Format: GFA with additional (optional) tags for contigs
Lists the plasmid identifiers (of putative plasmids resp. all plasmids) grouped into the different bins. Only for
binning != NaN.
<comma-separated list of plasmid identifiers>
Pipeline from FASTQ reads to plasmids
The pipeline takes FASTQ reads and a collection of (plasmid) genes as input. The reads are preprocessed (sickle, Trim Galore) and assembled (Unicycler). The preprocessing and an analysis of the read data (FastQC) are optional. Subsequently, the genes are mapped to the assembly contigs using BLAST (blastn / megablast) and plasmids are predicted in the assembly graph using HyAsP.
The pipeline can be used as a stand-alone script or imported as a module.
The simplest usage only requires the output directory, a gene database and (unpaired) short FASTQ reads.
hyasp_pipeline output_dir genes.fasta -s reads.fastq
Paired short-read data can be used by using the
-2 options, e.g.
hyasp_pipeline output_dir genes.fasta -1 first_reads.fastq -2 second_reads.fastq
Long reads can be added (to unpaired and / or paired short-read data) through the
-l option, e.g.
hyasp_pipeline output_dir genes.fasta -s short_reads.fastq -l long_reads.fastq
Unicycler's assembly mode can be changed from
normal via the
-u option and another gene database can be specified (
hyasp_pipeline output_dir genes.fasta -s reads.fastq -u conservative
The gene-contig mapping obtained from BLAST is filtered before the greedy algorithm is used. The filtering can be influenced by changing the length and / or identity threshold, e.g.
hyasp_pipeline output_dir genes.fasta -s reads.fastq --identity_threshold 0.9 --length_threshold 0.92
The default value for both is 0.95.
In addition, the options of the greedy algorithm can be given to
hyasp_pipeline -h provides a list of the possible options.
The final output of the pipeline corresponds to the output of HyAsP described above and is stored in the
plasmids/ subdirectory of the output directory.
The results of the preprocessing and assembly step are found in the
assembly/ subdirectory of the output directory, respectively.