Skip to content

Methods of Operation

Bruce Walker edited this page Jun 15, 2016 · 10 revisions

Methods of Operation

The following description of Pilon's methods of operation is an excerpt from a Pilon paper in PloS ONE (Bruce J. Walker, Thomas Abeel, Terrance Shea, Margaret Priest, Amr Abouelliel, Sharadha Sakthikumar, Christina A. Cuomo, Qiandong Zeng, Jennifer Wortman, Sarah K. Young, Ashlee M. Earl (2014) Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE 9(11): e112963. doi:10.1371/journal.pone.0112963), revised to reflect changes incorporated in more recent versions of Pilon.

Input requirements

Pilon requires an input genome in FASTA format and one or more BAM files containing sequencing reads aligned to the input genome. The BAM files must be sorted in coordinate order and indexed. For Illumina data, these BAM files are usually produced by an aligner such as BWA or Bowtie 2. Pilon can use three types of BAM files:

  • Fragments: paired read data of short insert size, typically <1 Kbp;
  • Jumps: paired read data of longer insert size, typically >1 Kbp;
  • Unpaired: unpaired sequencing read data.

To use Pilon with default arguments, read length is recommended to be 75 bases or longer and total sequence coverage should be 50x or greater, though deeper total coverage of >100x is beneficial. Pilon can also make use of longer reads, such as those from Sanger capillary sequencing, or circular-consensus or error-corrected reads from Pacific Biosciences (PacBio) sequencing. However, Pilon is not currently tuned to the error model of raw PacBio reads, and their use may introduce false corrections.

Pilon makes extensive use of pairing information when it is available, so paired libraries are highly recommended. Pilon is capable of using paired libraries of any insert size, as it scans the BAMs to compute statistics, including insert size distribution.

Improving local base accuracy

Pilon improves the local base accuracy of the contigs through analysis of the read alignment information. First, Pilon parses the alignment information from the input data and summarizes the evidence from all the reads covering each base position. Alignments can be less trustworthy near the ends of reads, especially in differentiating between indels and base changes, so Pilon ignores the alignments from a small number of bases at each end of the read, which is configurable at run time. For each base position in the genome, Pilon builds a pileup structure which records both a count and a measure of the weighted evidence for each possible base (A,C,G,T) from the read alignments. The contribution of base information from each read is weighted by the base quality reported by the sequencing instrument as well as the mapping quality computed by the aligner.

When Pilon is building pileups from paired alignment data, only reads from “valid” pairs (i.e., those with the PROPER_PAIR flag set in the BAM by the aligner, indicating the reads of a pair align in proper orientation with a plausible separation) contribute evidence to the pileups. A count of non-valid alignments covering each position is also kept to help identify areas of possible misassembly. Pilon also keeps track of “soft clipping” in the alignments, which exclude sub-sections of a read which aligned poorly. A tally of soft-clip transitions is kept at each genomic location as another aid in identifying possible local misassemblies.

From the pileup evidence, Pilon classifies each base in the input genome into one of four categories: Confirmed: the vast majority of evidence supports the base in the input genome; Changed: the vast majority of evidence supports a change of the base in the input genome to another allele; Ambiguous: the evidence supports more than one alternative at this position; Unconfirmed: there is insufficient evidence to make a determination at this position due to insufficient depth of coverage by valid reads.

Ambiguous bases can occur for several reasons. If the genome is diploid, this is expected at heterozygous polymorphic locations. Difficult-to-sequence regions may result in a large enough fraction of sequencing errors to result in an ambiguous call. Finally, if the input genome has a smaller number of copies of a repeated genomic structure than occurs in the true genome (a “collapsed repeat”), the aligned reads may have originated from more than one instance of the repeat structure; where there are differences in the true instances of the repeat, the alignments can show mixed evidence.

Paired read information, especially jumping libraries which span a longer distance, is extremely valuable in helping resolve ambiguous locations due to collapsed repeats. Pairs for which one read lands inside a repeat element, but the other lands in unique anchoring sequence on the flanks of the repeat help to resolve the true base content of the repeat structure. Such jump pairs will typically have a higher alignment mapping quality than short-range fragment pairs which lie completely within the repeat because the fragment pairs may not be able to be placed uniquely among the repeats. Since Pilon uses mapping quality to weigh the evidence from each read, the jumping pairs can often pick the correct haplotype variations of the repeat structure.

Pilon can include corrections to single-base errors in its output genome, and optionally, it can also change ambiguous bases to the allele with the preponderance of evidence or use IUPAC nucleotide codes to encode the alternative alleles.

Finding and fixing small indels

While recording the base-by-base pileup evidence, Pilon also records the location and content of indels present in the alignments. Indel alignments which represent equivalent edits to the input genome may appear at different coordinates in the alignments. For instance, if the input genome has the sequence ACCCCT, but the read evidence suggests one of the Cs should be deleted (ACCCT), each individual read alignment may show a deletion at any of the four C coordinates. Pilon shifts alignment indels to their leftmost equivalent edit in the input genome, so that the evidence from all the equivalent edits is combined into evidence for a single event at a one location.

Pilon makes an insertion or deletion call if a majority of the valid reads support the change, though that threshold is lowered somewhat for longer events, as it is typically more difficult for aligners to identify longer indels in short read data. Called indels from the input genome are fixed in Pilon’s output genome.

Fixing misassemblies, detecting large indels, and gap filling

Pilon is capable of reassembling local regions of the genome when there is sufficient evidence from the alignments that the contiguity of the input genome does not match the sequencing data. For assembly improvement applications, this could be an indication of a local misassembly. For variant calling applications, this could be caused by insertions or deletions too large to be reflected in the short read alignments.

Pilon tries to identify areas of potential local read alignment discontiguity in the contigs of the input genome by employing four heuristics: (i) a large percentage of reads containing a soft-clipped alignment at a given base position, (ii) a large ratio of invalid pairs to valid pairs spanning a location, (iii) areas of extremely low coverage and (iv) rapid drops in alignment coverage over a distance on the order of a read length. Once Pilon has identified an area for local reassembly, it treats the suspicious region (which may be a single base or a larger region) as untrusted, using alignments to the trusted flanks on both sides to identify a collection of reads that might contribute evidence for the true sequence in the suspicious region.

Unpaired reads with partial alignments to the flanks are included in the collection. For paired data, Pilon identifies pairs in which one of the reads is anchored by proper alignment to one of the flanks (e.g., with forward orientation on the left flank, or reverse orientation on the right flank), but whose mate is either unmapped or improperly mapped (e.g., to a remote location in the genome). For fragment pairs, both reads of such pairs are included in the collection; for jump pairs, only the unanchored read is included in the collection.

From the collected reads, Pilon builds a De Bruijn assembly graph (default K=47). For each k-mer in the reads, it uses the same pileup structure to record the bases which follow that k-mer, including weighting by base quality. Then, the pileups are “called” to determine the link(s) to the next k-mer(s); the calls are either a single base call, resulting in one forward link to the next k-mer, or an ambiguous call, resulting in two links forward and a branch in the assembly graph. This process automatically prunes the assembly graph of most sequencing errors, as infrequent base differences are unlikely to present enough evidence to affect the forward links. A minimum coverage cutoff of five for each forward link also prunes the assembly graph of many false links which could appear because of sequencing errors.

Pilon then tries k-mers from the trusted flanks as starting points to walk into the untrusted region from each side, building all possible extensions with up to five branching points (25 possible extensions). Tandem repeats with combined length > K cause loops in the local assembly graph, and they are detected by noting when the assembly walk reaches an already-incorporated k-mer. Pilon currently does not attempt to determine the copy number of such tandem repeats; instead, it will report the length of the repeat structure encountered in its standard output, and it will not attempt to close the two sides.

When no tandem repeat is detected, the resulting extensions from each side are combinatorially matched for possible perfect overlaps of sufficient length (2K+1) to be considered for closure. If there is exactly one such closure and it differs from the input genome, the assembled flank-to-flank sequence will replace the corresponding sequence in the input genome. Since the default k-mer size is 47, an overlap of 95 bases is required for closure.

If there are no closures or more than one possible closure, Pilon will identify a consensus extension from each flank. If an optional argument is set to allow opening of new gaps, Pilon will replace the suspicious region with the consensus extensions from each flank and create a gap between them; otherwise, it simply reports that it was unable to find a solution. These reports identify areas which an assembly analyst might wish to investigate manually.

Pilon also attempts to fill gaps between contigs in a scaffold (“captured gaps”) in the input genome. In order to fill captured gaps, Pilon employs the same local reassembly technique described above, treating the gap itself as the “untrusted” region. If there is a unique closure, the gap is filled; otherwise, consensus extensions from each flank are used to reduce the size of the gap. Pilon does not currently attempt to join or break scaffolds.

Large collapsed repeat (segmental duplication) detection

Pilon includes heuristics which attempt to flag areas indicative of large (>10 Kbp) collapsed repeats with respect to the input genome. These areas are identified by looking for large contiguous areas which appear to have double (or higher) read coverage compared to the rest of the genomic element being analyzed. Jump pairs are excluded from this computation, as we have found jump coverage to be far more variable across some genomes.

In assembly improvement mode, Pilon does not attempt to fix these potentially collapsed regions, but it does report them in its standard output for further investigation. In variant calling applications, large segmental duplications in the sequenced strain with respect to the reference have the same signature as large collapsed repeats in a draft assembly; a duplicated region of the genome will result in double the number of reads covering that sequence. Pilon’s reporting of large collapsed repeat regions can be used to identify candidate segmental duplications.

Output files

Pilon generates a modified genome as a FASTA file, including all single-base, small indel, gap filling, misassembly and large-event corrections from the input genome. In the assembly improvement case, this is the improved assembly consensus. In variant detection mode, this is the reference sequence which has been edited to represent the consensus of the given sample more closely.

Pilon can optionally generate a Variant Call Format (VCF) [http://samtools.github.io/hts-specs/VCFv4.1.pdf] file, which lists detailed information about the base and indel evidence at every base position in the genome. Changes generated by local reassembly, often triggered by larger polymorphisms in variant calling applications, are included as structural variant records (SVTYPE=INS and SVTYPE=DEL). Pilon can also, optionally, generate a “changes” file which lists the edits applied from input to output genome in tabular form, including source and destination coordinates and source and destination sequence. Finally, Pilon will optionally output a series of visualization tracks (“bed” and “wig” files) suitable for viewing in genome browsers such as IGV (17) and GenomeView (18). Tracks include basic metrics across the genome, such as sequence coverage and physical coverage, as well as some of the calculated metrics Pilon uses in its heuristics for finding potential areas of misassembly, such as percentage of valid read pairs covering every location.

Pilon’s standard output also contains useful information, including coverage levels, percentage of the input genome confirmed, a summary of the changes made, as well as some specifically flagged issues which were not corrected, such as potentially large collapsed repeat regions, potential regions of misassembly which were not able to be corrected, and detected tandem repeats that were not resolved.