Skip to content

Reading GLnexus pVCFs

Mike Lin edited this page Oct 30, 2019 · 14 revisions

One of GLnexus' main functions is to generate a population-wide "project" VCF (pVCF) based on the input gVCFs for each individual sample. To that end, a key challenge is to derive a unified representation of the sites and alleles discovered across the population, navigating the numerous "degrees of freedom" VCF admits in expressing multi-allelic sites (further reading). Here we'll explain how GLnexus approaches this, important information for interpreting generated pVCFs.

Background

The pVCF can be viewed as a matrix with variant sites down the rows, each having two or more alleles, and samples across the columns, with the matrix filled in with all the individual genotypes. Ideally, this matrix should provide a partitioned representation of all the discovered genetic variation, where all the sites' genome positions are mutually non-overlapping, and all the alleles at each site are mutually exclusive of the others. Such partitioning facilitates downstream analysis, e.g. population genetic statistics or haplotype inference, because they can simply tally the independent events reported in the matrix. If, on the other hand, sites are overlapping and/or their alleles entangled, then the event space is trickier to reason about downstream (or such reasoning is skipped, with undefined results).

The nicely partitioned representation is easy for SNPs, but complications arise for indels and MNPs, which can overlap other alleles found in the population at slightly different positions. As a result, it isn't strictly possible to present a partitioned pVCF to represent all high-quality variants.

GLnexus approach

GLnexus first unifies as many alleles as possible into non-overlapping multiallelic sites, calling each sample's diploid genotype as usual. The unification can complicate the representation of variants; for example, if the alternate alleles at a site include both a SNP and an MNP, the SNP is "padded" with reference bases so that it covers the same reference genome range as the MNP. This can be inconvenient, but it follows from VCF's edit representation and the desire to keep sites partitioned wherever possible. And, it's much easier to break apart the multiallelic sites afterwards than to reassemble the diploid genotypes once broken apart.

For the remaining alleles that don't unify nicely, usually because they overlap multiple other sites, we also generate a second tier of "monoallelic" sites to represent each one. These monoallelic sites are marked in the FILTER field for easy recognition, and they have a slightly different format and interpretation. Their GT (genotype) fields indicate the called copy number of the ALT allele only (./. for zero copies, ./1 for one copy, or 1/1 for two copies). They don't assert presence/absence of the REF allele or any other, which avoids duplicating information conveyed in the overlapping multiallelic sites.

In this hypothetical example, Alice, Bob, and Carol carry alleles that unify into two non-overlapping, multiallelic sites, some with reference-padded representations. Dave carries a deletion allele that overlaps both of those sites, and thus doesn't unify with them cleanly. Instead, there's a separate monoallelic record indicating its copy number in each sample. Also note Dave's genotype entries for the two simpler sites, which we must be careful not to call 0/0. (The "half-calls" like ./0 are not too often seen with other variant calling methods, but they're useful to convey the most precise assertions possible.)

Empirically, monoallelic sites tend to comprise <1% of the pVCF and most of them arise in homopolymer and tandem repeat sequences, which can be suspect anyway. But they'll typically also include some high-quality variants, especially deletions.

Let's briefly mention possible alternatives to the approach we selected. Instead of generating monoallelic sites only from alleles that don't cleanly unify, we could represent all deletions and MNPs in monoallelic sites. This would treat them more uniformly, and avoid complicating the simpler alleles, but it would reduce the completeness of the conveniently partitioned sites and genotypes. Second, we could take this to the extreme and simply represent all alleles, including the reference, in the monoallelic style (further reading). This more revolutionary approach wouldn't be palatable to our downstream users for the time being.

Spirited debate about how to represent the tricky overlapping cases continues in the community.

"Minimal" allele IDs

To ameliorate the inconvenience of the padded allele representation discussed above, GLnexus sets the ID (column 3) in the pVCF output to a semicolon-separated list of allele IDs with the form "CHROM_minPOS_minREF_minALT", providing the minimal (unpadded) form of each of the site's ALT alleles. For example,

#CHROM  POS             ID                                              REF     ALT
chr12   111814329       chr12_111814329_AGAGT_A;chr12_111814332_G_A     AGAGT   A,AGAAT

The second ALT allele is a simple G>A SNV, but it's padded to AGAGT>AGAAT in the joint site because of the overlapping deletion. The corresponding ID provides the minimal representation, which is much easier to recognize as a SNV and search for in the file. Note that the minimal POS, REF, and ALT can all differ from the padded form.

Reasons for No Call

All entries in the GLnexus pVCF include a Reason for No Call (RNC) field, which is filled in if either GT entry is not called (.). The possible values of RNC include:

Code Reason Explanation
. N/A corresponding allele is called
M Missing data input (gVCF) had no data at this genome position
P Partial data input only partially covered this genome position
D Depth read depth too low to call
deletion sample carries a deletion allele that couldn't be unified into this site; there may be more information in overlapping monoallelic site(s).
L Lost allele ^ but other than deletion allele
U Unphased variants sample carries multiple non-overlapping variants at this position*, whose phase is not known, so the diploid genotype cannot be called safely. There may be more information in overlapping monoallelic site(s).
O Overlapping variants sample carries multiple overlapping variants at this position, so the diploid genotype cannot be called safely. (GLnexus does deal with several common, but not all, cases of overlapping variants output by gVCF callers.) There may be more information in overlapping monoallelic site(s).
1 monoallelic this is a monoallelic site; no assertion about presence/absence of any allele here