Skip to content

ekg/dozyg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dozyg

"dozey-g": a sequence to variation graph, dozeu-based yeeter

overview

dozyg is a sequence to graph mapper that exploits several properties common to most genome variation graphs to achieve runtime similar to best-of-class methods for read alignment to reference genomes.

  1. Most pangenome variation graphs have a "manifold" linear property. That is, they may have large scale structural variation globally, but they are locally usually linear or partially orderable.
  2. We can sort these graphs so that their colinear regions are represented contiguously within a sort order (e.g. using odgi sort).
  3. This lets us use efficient collinear chaining methods to find target mappings, and apply POA locally (with dozeu) to obtain a base-level alignment.
  4. A hierarchical chaining model lets us split the alignment across different regions of the graph, while maximizing our coverage of our mapped sequence.

process

We first built a k-mer index over all k-mers in the graph (with pruning of complex regions to remove redundant k-mers) and a transformation of the graph designed to support efficient processing during read mapping. The k-mers are hashed (this simplifies index construction and allows them to be of any length) and written into a minimal perfect hash function. We record their positions in the graph, and store linearized versions of the forward and reverse complement of the graph.

To align a sequence to the graph, we apply a two-stage clustering method. The first stage is similar to that used in minimap2, and chains with respect to the target sequence (our linearized graph). Because our graph can contain structural rearrangements, we then add a second pass that combines the target-relative chains into "superchains" in a similar banded process that proceeds over the query sequence.

Finally, we align each sequence to the graph by progressing through the chains in each superchain and locally aligning them. The final alignment is derived by applying dozeu partial order alignment. This process does not respect local complex structures like small inversions and cycles. But, because of its two-stage chaining process, dozyg is able to align to graphs with complex large structural variation of all types.

operation

dozyg reads odgi graphs, indexes their kmers, and then maps reads from FASTA or FASTQ into a subset of the GAF graph alignment format. The input graph must be sorted using a process that ensures that collinear chains of paths in the graph are represented contiguously in its id ordering (this is achieved with odgi sort).

odgi build -g g.gfa -o - | odgi sort -i - -o g.odgi -p Ygs -t 16 -P
dozyg index -i g.odgi -k 15 -e 3 -t 16
dozyg map -t 16 -i g.odgi -f reads.fq >aln.gaf

Downstream processing of GAF records in enabled by numerous algorithms in vg, including vg pack and vg call. Other methods working on GAF alignments can be applied, such as gaffy, which can project these into various graph matrix formats.

Future ergonomic improvements will allow the direct indexing of any pangenome graph in GFAv1 format, and indexing only kmers that occur in actual genomes. We will record the translation of node names from the input graph to that of the internally sorted graph.

scope

dozyg is designed to map sequences of any length, both small and large. Different indexing patterns benift shorter versus longer reads.

considerations

This is currently bleeding-edge research software.

dozyg doesn't map. It yeets your sequence against the graph and hopes that it sticks. It's designed to go fast, and not ask hard questions. It will not get trapped in weird universal graph motifs. It will blast through them like they don't exist. Its hierarchical chaining model means that it is not afraid of complex pangenome graphs.

author

Erik Garrison

license

MIT

About

sequence to graph mapper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published