Skip to content

Output File Descriptions

Bruce Walker edited this page Jun 15, 2016 · 20 revisions

Output Files

Pilon produces a set of output files named pilon.* by default. If the user specified an --output <prefix> argument, the output files will be named <prefix>.*. If the --outdir <dir> option is used, all the output files will be placed in the specified directory.

FASTA ("Improved" assembly)

Pilon will normally output a fasta file (pilon.fasta by default) which contains a version of the assembly in which errors are fixed as specified by the --fix option. Pilon renames the sequence headers by appending _pilon to each FASTA element name.

If the --iupac argument is given, Pilon will use IUPAC nucleotide codes in the output FASTA file to represent ambiguous bases and/or heterozygous SNPs.

Changes

If run with the --changes argument, Pilon produces a file (pilon.changes by default) containing a space-delimited record of every change made in the assembly as instructed by the --fix option. The format for the file is as follows:

<Original Scaffold Coordinate> <New Scaffold Coordinate> <Original Sequence> <New Sequence>

These headers are further described in the table below:

Header Description Example
Original Scaffold Coordinate The coordinate of sequence in the original fasta that was flagged by pilon for a change scaffold00003:690754
New Scaffold Coordinate The coordinate of the changed sequence in the pilon fasta file scaffold00003_pilon:690809-690853
Original Sequence The sequence in the original fasta file that was flagged by pilon for a change .
New Sequence The sequence in the pilon fasta file that was inserted or deleted by pilon to fix the assembly GGCCAGTCCACAACAAGGCAAACATACCAACGCCCACGGCTATCT

A period (.) will be used to represent blank fields for original or new sequence. Using the examples in the table above, the pilon change file record would appear as follows:

scaffold00003:690754 scaffold00003_pilon:690809-690853 . GGCCAGTCCACAACAAGGCAAACATACCAACGCCCACGGCTATCT

In this case, pilon did not remove any sequence from at position 690754 in scaffold00003 of the original assembly, but inserted 45 bases and wrote the output to the pilon fasta file at coordinates 690809-690853 in scaffold00003_pilon.

VCF

If the --vcf option is specified, Pilon variant output is stored in a file named pilon.vcf by default (if --output is specified, the file will be named <output>.vcf).

VCF Pileups

Calls are classified by small number VCF FILTER tags:

Filter Tag Description
PASS A passing call, either reference confirmation or difference
Amb Ambiguous; significant evidence for more than one allele at this position. Meant for haploid genomes, this filter tag is suppressed by the --diploid argument, as it will result in a heterozygous call for a diploid genome
LowCov Valid read coverage less than the threshold controlled by the --mindepth argument
Del Provides pileup information for loci which were removed by a variation in another line; this gives a sense of the alignment evidence at that locus had the larger variation not been called

Pilon includes many computed values in the VCF INFO field; here is an example along with a description of the values:

DP=38;TD=55;BQ=25;MQ=19;QD=2;BC=0,22,2,14;QP=0,72,1,27;PC=958;IC=0;DC=0;XC=0;AC=1;AF=0.27

Example Description
DP=38 Depth of valid reads in pileup (not invalid pair; not soft-clipped)
TD=55 Total Depth, including reads excluded from pileups
BQ=25 Mean base base quality in pileup
MQ=19 Mean mapping quality in pileup
QD=2 Quality normalized to depth; meant to give a sense of how confident the base call is.
BC=0,22,2,14 Base count in pileups (order A,C,G,T)
QP=0,72,1,27 Percentage of weighted evidence for each base (order A,C,G,T)
PC=958 Physical coverage of valid reads or pairs spanning this locus
IC=0 Number of reads in pileup calling an insertion at this locus
DC=0 Number of reads in pileup calling a deletion at this locus
XC=0 Number of reads in pileup soft-clipped at this locus
AC=1 Alternate allele count, as defined in VCF spec (0=reference call, 1=heterozygous/ambiguous, 2=alternate call)
AF=0.27 Fraction of evidence in support of alternate allele

Larger events

Potentially larger events resulting from local reassembly of suspicious regions are represented by VCF Structural Variant records. Pilon assigns SVTYPE=INS if the variant contains more bases than the reference region, otherwise SVTYPE=DEL, even if the events are in the form of block substitutions (not pure insertions or deletions). Example:

gi|395136682|gb|CP003248.1| 2133481 . T TGCCGTCACCTCGCAT . PASS SVTYPE=INS;SVLEN=15;END=2133481 GT 1/1

If there are unknown (N) bases in the resulting event, the INFO tag IMPRECISE will be included. This can happen when Pilon partially assembles an event, such as a large insertion, but it cannot resolve the complete event by joining the two flanks. This will only happen if the --fix +breaks option is turned on, which is implied by the --variant option.

Segmental duplications

Pilon will use SVTYPE=DUP records to indicate possible large segmental duplications, meaning there is read evidence for more copies of this region than appear in the input genome. This feature should be considered experimental and advisory, and is not meant to give a definitive call of such events. Example:

gi|395136682|gb|CP003248.1| 3494060 . T <DUP> . PASS SVTYPE=DUP;SVLEN=218228;END=3712288;IMPRECISE GT ./.

VCF Spec

For more information on the VCF file format specification, see Variant Call Format version 4.1

Tracks

If run with the --tracks argument, Pilon produces .bed and .wig files that may be viewed in genome browsers such as IGV, GenomeView, and other applications that support these formats. The tracks produced by Pilon are as follows:

Track Name Filename Description
Pilon pilonPilon.bed Several classes of issue found by Pilon in a compact format
Changes pilonChanges.wig Changes made by Pilon
Unconfirmed pilonUnconfirmed.wig Non-zero in regions where the input genome was not confirmed
Copy Number pilonCopyNumber.wig The copy number in the genome of the sequence at a given location
Coverage pilonCoverage.wig The sequence coverage at a given position
Bad Coverage pilonBadCoverage.wig The coverage of reads that do not map logically
Delta Coverage pilonDeltaCoverage.wig A measure of local rate-of-change of valid coverage
Dip Coverage pilonDipCoverage.wig A metric designed to identify local dips in coverage, often indicating a contiguity break
Frag Coverage pilonFragCoverage.wig The fragment read coverage at a given position
Physical Coverage pilonPhysicalCoverage.wig The physical coverage at a given position
GC track pilonGC.wig The percent GC of sequence in a 100bp window centered on the given location
Pct Bad pilonPctBad.wig Percentage of invalid reads (usually bad pairing) compared to total depth
Weighted Qual pilonWeightedQual.wig The weighted base quality of reads at a given position
Weighted MQ pilonWeightedMq.wig The weighted mapping quality of reads at given position
Clipped Alignments pilonClippedAlignments.wig A count of how many soft-clipping events started at this locus

The SD tracks express a metric as standard deviations from the mean of the metric across a given input fasta element. The values are integers representing 0.1 sigma, i.e., a value of -21 means 2.1 standard deviations below the mean.

Many of these tracks were primarily of use in developing Pilon's heuristics for making calls and identifying regions of possible misassembly; they may be removed in a future release.