FSFix uses Illumina short reads to fix frameshift (indel) errors in a Pacbio assembly. It fixes many of the remaining insertions or deletions (indels) in homopolymer regions, and indels caused by reads that were derived from heterozygous haplotigs, even after assembly polishing by Arrow, Quiver, and Pilon. This program was used in a durian genome assembly project (doi:10.1038/ng.3972), where it fixed 1255 indels in a 720Mb assembly.
This program was written by Chern Han Yong. Please contact me at chernycherny@hotmail.com if you have any questions.
CITATION
If you use this program, please cite: Bin Tean Teh, Kevin Lim, Chern Han Yong, Cedric Chuan Young Ng, Sushma Ramesh Rao, Vikneswari Rajasegaran, Weng Khong Lim, Choon Kiat Ong, Ki Chan, Vincent Kin Yuen Cheng, Poh Sheng Soh, Sanjay Swarup, Steven G Rozen, Niranjan Nagarajan & Patrick Tan. The draft genome of tropical fruit durian (Durio zibethinus). Nature Genetics 2017. doi:10.1038/ng.3972
REQUIREMENTS
-
Samtools
-
Perl interpreter
INPUT
-
unmasked.bed - bed file of unmasked region (ie. non-repeat region) of the genome to fix. Fixing only the unmasked region will require much less time and space. Generate this file eg. from RepeatMasker's output. If you want to fix the entire genome, then create a bed file covering the entire genome.
-
alignment.bam - alignment of Illumina short reads onto Pacbio assembly.
-
genome.fasta - Pacbio assembly
USAGE
- Extract alignments of short reads on the unmasked region, with map quality > 20
samtools view alignment.bam -f 0x2 -q 20 -L unmasked.bed > dna_reads_unmasked_tmp.sam
- Remove short read alignments that are not uniquely mapped
grep -v "XA:Z" dna_reads_unmasked_tmp.sam > dna_reads_unmasked.sam
- Analyze DNA read alignments for frameshifts
perl find_frameshifts.pl -s dna_reads_unmasked.sam > out_find_frameshifts
- Fix frameshifts
perl fix_frameshifts.pl -i out_find_frameshifts -s genome.fasta -o genome_FSfix.fasta > out_fix_frameshifts
DETAILS
FSFix corrects many of the remaining insertions or deletions (indels) in homopolymer regions, and indels caused by reads that were derived from heterozygous haplotigs, from a Pacbio assembly, even after assembly polishing by Arrow, Quiver, and Pilon. First, FSFix aligns the Illumina short reads onto the assembly, keeping only unique alignments with mapping quality >20. Next, FSFix looks indels in the alignment supported by at least 90% of the aligned reads, and with at least 5 supporting reads. Reads for which the indel under consideration falls within the first 5 or last 5 bases are ignored in calculating this support. Finally, FSFix corrects the assembly using these supported indels: supported deletions in the aligned reads are used to delete base(s) from the assembly, while supported insertions in the aligned reads are used to insert base(s) into the assembly. FSFix also looks for a special case of indels likely derived from reads assembled from heterozygous haplotigs, where the assembly contains both alleles from the heterozygous haplotigs. This is manifested as two nearby deletions in the aligned reads. This is not detected by FSFix in the general case, because each deletion would only be supported by about 50% of reads. Thus, if FSFix finds two deletions, X1 and X2, that are are within 3bp of each other, and X1’s deletion-supporting reads are distinct from X2’s deletion-supporting reads, FSFix considers X1 and X2 as the same deletion event, taking its supporting reads as the union of X1 and X2’s supporting reads, and calculates if the deletion is supported as before. If so, it deletes only one of either X1 or X2. Please see (link to publication xxx) for full details.