HLA variation graphs
This is a collection of different renderings of a set of variation graphs made from different versions of the GRCh38 alts.
These are meant for testing variation graph methods.
These are ALT sequences of HLA genes from GRCh38, collected in the Human Genome Variation / Map project.
gene alts size
A-3105 11 147718
B-3106 9 30751
C-3107 10 33810
DMA-3108 11 49624
DMB-3109 10 64515
DOA-3111 10 54307
DOB-3112 10 42837
DPA1-3113 11 178303
DPB1-3115 11 151390
DQA1-3117 10 73280
DQB1-3119 10 73913
DRA-3122 11 57216
DRB1-3123 12 163416
DRB3-3125 4 52271
DRB4-3126 5 64760
DRB5-3127 3 38568
E-3133 9 43200
F-3134 10 49951
G-3135 11 45653
H-3136 10 33111
J-3137 10 39844
K-3138 9 28660
L-3139 8 59036
MICA-100507436 8 116989
MICB-4277 11 168093
TAP1-6890 11 96391
TAP2-6891 11 185580
V-352962 10 9865
A variation graph represents a set of alignments between sequences and the sequences themselves in a graph.
A number of methods are applied to build variation graphs from these sequences.
The classic partial order alignment algorithm, sped up with SIMD instructions and written into GFA format rather than MSA, with some light post-processing
ls seqs/*fa | sort | while read f
do
echo $f
in=$f
out=graphs/spoa/$(basename $f .fa).gfa
spoa -GR $in >$out.1 && odgi build -g $out.1 -o - | odgi unchop -i - -o - | odgi sort -i - -o - | odgi view -i - -g >$out
rm -f $out.1
done
(Some of these require a lot of memory to compute, so not all were created.)
A progressive graph construction using the vg map
algorithm, vg msga
is similar to spoa
, but it allows for structural variation in the graph by "chunking" the alignment problem, threading the results back together using dynamic programming, and further cleaning up the alignment result by locally aligning any remaining unaligned fragments of the sequence.
This heuristic approach works well enough to build these graphs, but it does not scale efficiently to large problems.
ls seqs/*fa | sort | while read f
do
echo $f
in=$f
out=graphs/vg-msga/$(basename $f .fa).gfa
vg msga -f $in | vg view - >$out
done
seqwish
reads a set of alignments and sequences and renders the variation graph that they imply.
It is the WYSIWYG of variation graph construction methods, in that the resulting graph perfectly reflects the input alignment set.
To change the graph, we simply change the input alignments.
seqwish
depends on an indepednent alignment process to produce a set of alignments (in PAF format).
Two alignment approaches have been validated.
minimap2
is a popular and efficient aligner for long sequences based on minimizer seeding/chaining and adaptive banded alignment to derive base pair alignments.
ls seqs/*fa | sort | while read f
do
echo $f
in=$f
out=graphs/seqwish/minimap2/$(basename $f .fa)
minimap2 -c -x asm20 $in $in >$out.paf \
&& seqwish -s $in -p <(fpa drop -l 3000 <$out.paf) -g $out.gfa -t 16 \
&& odgi build -g $out.gfa -o - \
| odgi sort -i - -o - -p sYgYs -k 1000 -G 1 -A -t 16 -P \
| odgi view -i - -g >$out.sort.gfa
done
This builds the graph and also does a sort of it.
seqwish
graphs are not sorted by default, and the nodes in them occur in the order of first appearance in the input sequence set.
mashmap
uses mash
kmer similarity estimates to find matching blocks of long sequences.
These are then chained to produce alignments.
The resulting alignments can be recomputed with edlib to obtain base-exact descriptions of them, which allows the method to be used as an input to seqwish.
In contrast to minimap2
, mashmap
does not suffer from problems with repetitive minimizer seeds or low sequence complexity, and in general runs extremely fast, particularly in its approximate mapping mode.
Here, we'll use a fork mashmap
that supports output in PAF format and multithreaded alignment.
ls seqs/*fa | sort | while read f
do
echo $f
in=$f
out=graphs/seqwish/minimap2/$(basename $f .fa)
mashmap -r $in -q $in -o $out.mashmap.paf --pi 70 -k 11 -s 500 -t 16 -n 10 \
&& mashmap-align -s $in -q $in --mappingFile $out.mashmap.paf --pi 0 -t 16 -o $out.paf
&& seqwish -s $in -p <(fpa drop -l 3000 <$out.paf) -g $out.gfa -t 16 \
&& odgi build -g $out.gfa -o - \
| odgi sort -i - -o - -p sYgYs -k 1000 -G 1 -A -t 16 -P \
| odgi view -i - -g >$out.sort.gfa
done
The PanGenome Graph Builder, pggb, combines wfmash
, seqwish
, and smoothxg
to build a normalized graph in which the local sequence representation is partially ordered.
The graph itself can represent any kind of variation detectable by the alignment parameters.
Instructions for rebuilding these graphs are in graphs/pggb/README.md
.