a string to graph aligner
C++ Other
Latest commit 700da97 Jun 1, 2016 @ekg get things compiling
Permalink
Failed to load latest commit information.
bamtools @ 2d7685d fix build problems Oct 2, 2013
fastahack @ c68cebb update repos, resolve empty target DAG issue Oct 2, 2013
gssw @ e8d597e update gssw Apr 19, 2016
json fixed bugs in jsoncpp amalgam Feb 19, 2013
rapidjson @ 52d4fc5 updated vcfl,b some cleanup, addition of rapidjson Nov 15, 2013
vcflib @ 1e2c464 first steps in gssw integration Feb 20, 2014
.gitignore updated .gitignore Apr 3, 2013
.gitmodules added gssw submodule Feb 18, 2014
LICENSE MIT license Sep 25, 2014
Makefile get things compiling Jun 1, 2016
README.md updated readme Jul 5, 2014
alignmentstats.cpp check for empty cigar --- implies unmapped Mar 12, 2014
alignmentstats.h flattened alignments, GSSW integration Mar 9, 2014
cigar.cpp resolve flattening bug related to mishandling of graph-relative softc… Jun 20, 2014
cigar.h resolve flattening bug related to mishandling of graph-relative softc… Jun 20, 2014
construct.cpp fix two construction bugs Jun 17, 2014
construct.h major cleanup, now usable Apr 14, 2014
convert.h we have alignment Feb 28, 2013
dump.cpp first commit Apr 27, 2012
dump.h first commit Apr 27, 2012
examples.cpp DAG construction from VCF inputs Mar 1, 2013
examples.h first commit Apr 27, 2012
fastq.cpp Seed Improvements Jul 31, 2012
gliamodels.cpp various updates, fix coordinate issue when using whole sequence target Mar 7, 2013
gliamodels.h first steps in gssw integration Feb 20, 2014
gsw.cpp updated vcfl,b some cleanup, addition of rapidjson Nov 15, 2013
gsw.h vcf-dag construction now working Mar 1, 2013
json-forwards.h Add failed last time - adding json files. Feb 19, 2013
jsoncpp.cpp getting things to compile with g++ on linux Feb 28, 2013
jsreader.cpp delete json fwd Feb 19, 2013
jsreader.h getting things to compile with g++ on linux Feb 28, 2013
main.cpp get things compiling Jun 1, 2016
main.h various updates, fix coordinate issue when using whole sequence target Mar 7, 2013
matrices.cpp first commit Apr 27, 2012
matrices.h first commit Apr 27, 2012
nodealign.cpp updated vcfl,b some cleanup, addition of rapidjson Nov 15, 2013
nodealign.h updated vcfl,b some cleanup, addition of rapidjson Nov 15, 2013
parameters.cpp get things compiling Jun 1, 2016
parameters.h added option to not decompose variants in DAG Apr 14, 2014
seqtools.cpp minor improvements and commenting Feb 13, 2013
seqtools.h remove sparsehash and other cruft Dec 18, 2013
show.cpp various updates, fix coordinate issue when using whole sequence target Mar 7, 2013
show.h various updates, fix coordinate issue when using whole sequence target Mar 7, 2013
split.cpp add split.{cpp,h} Feb 28, 2013
split.h add split.{cpp,h} Feb 28, 2013
traceback.cpp update repos, resolve empty target DAG issue Oct 2, 2013
traceback.h full realignment and flattening Mar 23, 2013
utility.cpp average quality for inserted sequence Apr 11, 2014
utility.h average quality for inserted sequence Apr 11, 2014

README.md

glia

a Graph/Smith-Waterman (partial order) aligner/realigner

Erik Garrison erik.garrison@bc.edu

Deniz Kural deniz.kural@gmail.com

Installation Instructions

Make sure you download the submodules by adding the recursive flag to the clone command:

git clone --recursive https://github.com/ekg/glia.git

Then build with a call to make:

cd glia
make

You'll need cmake to compile bamtools, which is a submoduled dependency. ZLIB is also required, but is typically installed system-wide on development servers.

Example usage

glia consumes a VCF, a target region, and a string to generate an alignment of any string against the variant graph defined by the VCF. For instance calling:

glia -f ~/human_g1k_v37.fasta -t 20:62900077-62902077 -v variants.vcf.gz \
     -s AAATGTAAACATTTTATAGGGGATTCCCCTAAAAACAAAAAAACTTTCTGGGAAAGATTTTTCAAAAAATAAAA

Will report the alignment of this string against the variant graph in the region 20:62900077-62902077. This mode is provided for debugging purposes.

Streaming local realignment

glia's main use is as a local realigner. It will realign reads to a set of known (or putative) variants in a VCF, both consuming and producing an ordered stream of BAM alignments. As such, it can be used immediately prior to variant calling with standard methods. In this case, we use freebayes:

<sample.bam glia -Rr -w 1500 -S 200 -Q 200 -G 4 -f ref.fasta -v known_variants.vcf.gz \
    | freebayes -f ref.fasta --stdin >sample.vcf

Here, glia is running reads that have a soft-clip ('S' cigar element) quality sum of >200 (-S 200), or a mismatch quality sum >200 (-Q 200) or any indel (default), and only accepting alignments that have fewer than 4 embedded gaps and decrease the overall quality score sum of soft-clips and mismatches. So as to match unaligned mates to their correct alignments, this invocation of glia will align reads to the variant graph constructed across a window of 1500bp semi-centered around the read alignment position (-w 1500). These settings are typically sensible for realigning Illumina 70bp reads as found in the 1000 Genomes Project.

Why?

The alignment of reads can be frustrated by variants, both large (e.g. structural variants) and small (e.g. SNPs, small indels). Observations of larger variants, where the variant scale is of a large fraction of a read length or greater, can be rescued by using unaligned mates from paired sequencing. For smaller variants, more typically contemporary aligners will soft-clip starts and ends of a read when the read contains multiple mismatches or gaps. Soft-clipping makes sense in that the tails of reads often contain many sequencing errors, but this filtering practice can reduce sensitivity of resequencing even to small variants.

glia realigns poorly-aligned reads against a local variant graph in order to resolve these biases, provided we know about, or have a good hint that a variant potentially exists in our dataset. Suitable input could be either a known variant set such as dbSNP or a union set of variants such as might be produced by an ensemble variant detection process).

How?

glia uses an independently-developed (by Deniz Kural) version of partial order alignment algorithm to realign reads to a local "variant graph," "sequence graph," or "partial order." This structure was originally applied to the problem of multiple sequence alignment due to its favorable time and space complexity, which is approximately O(NM) where N is the length of the query and M is the total length of the partial-order graph against which we are aligning. In glia, we apply this method to read alignment and variant detection in the context of the sequence reads generated by 2nd and 3rd-generation sequencing platforms.

Considering a multiple sequence alignment, we can construct a partially-ordered graph in which all of the input sequences are paths:

partial order graph construction

(Note that these figures are drawn from Christopher Lee's original paper on partial order alignment)

This is the structure which glia aligns reads against. Provided a VCF file encodes the same structure and does not include internal conflicts, we can use it for the generation of partially-ordered, directed acyclic graphs.

In partial order (or graph-Smith-Waterman) alignment, the standard dynamic-programming alignment algorithm is generalized to allow alignment across edges of this graph. This can be understood as a modification of the recurrence relation used to define the score distribution across the alignment matrix to account for the score on the upstream side of inbound links. When we initialize a new matrix, we take the score and identity of the node with the maximum alignment score in the same position in the previous node matrix.

partial order alignment

This then allows us to trace back the alignment across the edges of the graph. In practice, glia uses this information to determine if the graph can provide a better alignment of a given read. When the new alignment has better metrics in terms of quality/mismatch or soft-clip, then it is accepted, "flattened" from the graph back into the reference space, and emitted as a BAM record for downstream processing. If the new alignment is not an improvement, or if there are other problems, then the original alignment is provided as-is.

Help!!!?

Email erik.garrison@bc.edu with any questions or concerns. Please report bugs via the bugtracking system in github.