Skip to content

douglasgscofield/PacBio-utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PacBio-utilities

A collection of scripts useful for working with PacBio-based assemblies.

The pacbio-util script requires BioPerl and samtools >= 0.1.19.

Correction of small indels in polished assemblies

We have noticed that PacBio assemblies can show single-base deletions, in most cases, in proximity to even fairly short homopolymer runs. This does not occur at all homopolymer runs, and in our limited samples incidence rate is perhaps 1 in 10kbp. Single-base insertions and longer insertions may occur but are much rarer.

There are three steps to applying this correction. First, map a set of Illumina reads to the PacBio assembly to generate sorted BAM file(s). Reads from multiple BAMs will be combined. These reads should be from the same individual/strain from which the assembly was created. Reads from other sequencing technologies, or from another individual/strain, will reduce the effectiveness of the correction and may introduce further errors. In all subsequent steps it is required that the order of the sequences in the assembly and the BAMs is identical.

Second, use the pacbio-util script here to generate a set of indel targets for correction. The command is pacbio-util indel-targets, and the criteria for determining targets may be modified via options. Use the -h option to see available options and current defaults.

The third and final step is to use the command pacbio-util indel-apply to apply the targets generated in the second step to the assembly to create a corrected assembly. Options may also be used here to restrict the set of targets applied; use the -h option to see available options and current defaults.

For an assembly pacbio_assembly.fasta, and mappings of Illumina paired-end and single-end reads to this assembly in BAM files pe.sorted.bam and se.sorted.bam, the following will generated the list of indel targets to targets.txt and a corrected assembly to corrected_assembly.fasta.

pacbio-util indel-targets -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam > targets.txt
pacbio-util indel-apply -f pacbio_assembly.fasta -t targets.txt > corrected_assembly.fasta

The indel-apply command can find the assembly used for determining the targets by using the first line of the list of targets. This provides a useful shortcut for combining both commands into a one-line piped command. The tee is in place to save the list of targets as they pass through to indel-apply.

pacbio-util indel-targets -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam \
| tee targets.txt \
| pacbio-util indel-apply > corrected_assembly.fasta

Here is example tool output (with -v) and scaffold size results before and after correction.

$ pacbio-util indel-targets -v -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam | tee targets.txt | pacbio-util indel-apply -v > corrected_assembly.fasta
samtools pipe command: 'samtools mpileup -s -B -d 10000 -L 10000 --ff 1280 -q 1 -Q 13 -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam |'
[mpileup] 2 samples in 2 input files
Merging columns from 2 BAM files ...
pacbio-util indel-targets: 1281 targets generated for assembly pacbio_assembly.fasta
pacbio-util indel-apply: Opening target assembly 'pacbio_assembly.fasta'
pacbio-util indel-apply: unitig_7, no targets
pacbio-util indel-apply: unitig_52, no targets
pacbio-util indel-apply: unitig_46, no targets
pacbio-util indel-apply: unitig_14, no targets
pacbio-util indel-apply: unitig_54, no targets
pacbio-util indel-apply: unitig_44, no targets
pacbio-util indel-apply: unitig_13, no targets
pacbio-util indel-apply: unitig_56, no targets
pacbio-util indel-apply: unitig_55, no targets
pacbio-util indel-apply: unitig_43, no targets
pacbio-util indel-apply: unitig_48, no targets
pacbio-util indel-apply: unitig_45, no targets
pacbio-util indel-apply: unitig_15, no targets
pacbio-util indel-apply: unitig_47, no targets
pacbio-util indel-apply: unitig_60, no targets
pacbio-util indel-apply: unitig_61, no targets
pacbio-util indel-apply: unitig_51, no targets
pacbio-util indel-apply: unitig_59, no targets
pacbio-util indel-apply: unitig_4, no targets
pacbio-util indel-apply: 1281 targets applied to assembly pacbio_assembly.fasta
$
Scaffold bp before bp after
unitig_1 6250549 6250752
unitig_10 1811588 1811655
unitig_7 15299 15299
unitig_52 7932 7932
unitig_46 8606 8606
unitig_14 1884 1884
unitig_0 10120325 10120600
unitig_8 3934972 3935082
unitig_12 240123 240143
unitig_3 16252 16253
unitig_54 22065 22065
unitig_44 12520 12520
unitig_53 5951772 5951911
unitig_13 86224 86224
unitig_56 30475 30475
unitig_55 15275 15275
unitig_41 19231 19232
unitig_43 9344 9344
unitig_58 6608738 6608943
unitig_9 2363594 2363680
unitig_48 30325 30325
unitig_45 14763 14763
unitig_57 18691 18685
unitig_15 2004 2004
unitig_47 15558 15558
unitig_6 4486498 4486633
unitig_60 26006 26006
unitig_61 16086 16086
unitig_51 9410 9410
unitig_59 14740 14740
unitig_4 19898 19898

pacbio-util indel-targets

Generate indel targets for correction. Uses samtools mpileup for indel detection. Targets are written to standard output.

Targets will only be generated where all read mappings containing an indel agree on the size and sequence of the indel. Further criteria for minimum indel fraction, minimum coverage, and maximum indel size may be adjusted via options.

pacbio-util indel-targets -f FASTA FILE1.bam [ FILE2.bam ... ]

OPTIONS

    -f FILE, --fasta FILE    PacBio assembly in Fasta format

    FILE1.bam [ FILE2.bam ]  BAM files of reads mapped to

    -                        Read mpileup from stdin rather than running
                             samtools. Note that mapping qualities
                             (option `-s`) are expected in the mpileup input.

    --indel-frac FLOAT       Do not report indels at a position if the
                             fraction of reads containing them is below FLOAT
                             [default 0.8]
    --indel-min-cov INT      Minimum read coverage to consider an indel
                             [default 10]
    --include-bad-indels     Include indels in target set that do not pass
                             our criteria, these lines are marked 'bad' in
                             the sixth column
    --max-indel-size INT     Maximum indel size to accept, beyond this the
                             position is considered in error [default 1000]
    --mapping-quality FLOAT  Minimum mean mapping quality of reads covering a
                             target site. Note that a value of 0 means no
                             minimum applied; reads with 0 mapQ are still
                             excluded by default (see --all-reads) [default 0]
    --all-reads              Include all reads.  Default is to exclude
                             multiply-mapped reads (mapQ < 1), alignments
                             that are not the primary (SAM flag 256), and
                             duplicate alignments (flag 1024).
    --all-bases              Include all bases.  Default is to exclude bases
                             below quality 13.
    --samtools FILE          Location of samtools executable
                             [default samtools]

    --verbose                Print informational messages
    --help, -?               help message

Below is an example subset of targets generated with this command. The first line of the targets output is the name of the assembly Fasta file against which the targets were generated. Following that, the columns are scaffold name, target position, reference base, total read coverage, type of target (always indel for this command), whether the target passed quality criteria (good or bad), the frequency of coverage supporting the indel, the size of the indel (positive for insertion, negative for deletion), a string describing the indel size and sequence, the number of reads supporting the indel, and the mean mapping quality of reads.

#assembly:pacbio_assembly.fasta
unitig_1	101522	C	72	indel	good	0.9444	1	+1A	68	60
unitig_1	119319	G	127	indel	good	0.8110	1	+1A	103	60
unitig_1	120891	A	104	indel	good	0.9231	1	+1T	96	60
unitig_1	122801	A	93	indel	good	0.9032	1	+1C	84	60
unitig_1	125303	G	84	indel	good	0.9286	1	+1A	78	60
unitig_1	131734	T	67	indel	good	0.8657	1	+1A	58	60
unitig_1	136434	A	28	indel	good	0.8214	1	+1C	23	60
unitig_1	170949	A	40	indel	good	0.9500	1	+1G	38	60
unitig_1	171182	A	80	indel	good	0.9500	1	+1G	76	60
unitig_1	172016	A	67	indel	good	0.9403	1	+1C	63	60
unitig_1	190068	A	18	indel	good	1.0000	1	+1G	18	60
unitig_1	190154	T	11	indel	good	0.9091	1	+1G	10	60
unitig_1	191027	C	104	indel	good	0.9231	1	+1G	96	60
unitig_1	191173	C	96	indel	good	0.9167	1	+1T	88	60
unitig_1	230981	T	63	indel	good	0.9683	1	+1C	61	60
unitig_1	270949	G	100	indel	good	0.9800	1	+1C	98	60
unitig_1	307836	A	38	indel	good	0.8947	1	+1C	34	60
unitig_1	310152	C	105	indel	good	0.8857	1	+1A	93	60
unitig_1	326536	G	105	indel	good	0.9143	1	+1A	96	60
unitig_1	360786	G	39	indel	good	0.8205	1	+1A	32	58.5
unitig_1	369367	C	98	indel	good	0.9592	1	+1A	94	60
unitig_1	371657	T	48	indel	good	0.9375	1	+1G	45	60
unitig_1	371776	G	43	indel	good	1.0000	1	+1T	43	60
unitig_1	373159	C	113	indel	good	0.9469	1	+1A	107	60

pacbio-util indel-apply

Apply indel targets to correct an assembly. The corrected assembly is written to standard output. The order of sequences in the assembly must be the same as that in the targets.

pacbio-util indel-apply [ -f FASTA ] [ -t TARGETS ]

OPTIONS

    -f | --fasta FILE        PacBio assembly in Fasta format, to be corrected.
                             Must be the same assembly used for
                             './pacbio-util indel-targets'.
                             If the assembly is not specified, it is determined
                             from the first line of the target list.
    -t | --targets TARGETS   List of indel targets to apply, standard output
                             from './pacbio-util indel-targets -f FILE ...'.  If no
                             targets are specified, they are read from stdin.

    --include-bad-indels     Apply indels in target set that are marked 'bad'
    --max-indel-size INT     Maximum indel size to apply [default 1000]
    --min-indel-frac FLOAT   Minimim indel fraction to apply, a value of 0
                             means apply all indels in target set [default 0]
    --skip-deletion-verify   Do not verify that a deletion matches the
                             sequence at that position in the assembly
                             [default 0]
    --skip-softmasked        Do not apply targets to softmasked regions of the
                             assembly, as determined by use of lowercase
                             [default 0]

    --verbose                Print informational messages
    --help, -?               help message

About

Collection of utilities for working with PacBio-based assemblies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages