-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truth set high-confidence regions and shiftable indels #29
Comments
This is a great point that we've been thinking about in Platinum Genomes (and I know @jzook is too). PG has tried to tackle this problem in the upcoming 2017.1 release through filtering partially-covered STRs: any STR (as defined in the hipSTR reference beds) will be masked if their coverage with confidence regions is <1. e.g. the confidence blocks in 4 are now removed: However it's not a complete fix as your third example has a longer repeat unit (>6 bp) so is left as-is. We've also added an extra confidence padding base after insertions so example 1 is covered and so any right-shifted call should be counted by hap.py as a TP. Using UPS-indel to mask confidence blocks around failed indels is a great idea — I'll try this for PG 2017.2, I think @pkrusche is playing with it too. Thanks for the script! |
Interesting! I also have a script that computes maximal "wobble" for each
variant (prefix or suffix) based on the reference sequence. The major
limitation is that the maximum extents in practice are based on shifting
variants across the potential alternative sequences. So sometimes the
reference based extents are too small. For a trivial example of this,
consider an STR expansion or contraction where the reference sequence has
an impure repeat unit in the middle.
1 2
012345678901234567890123456789
REF: AGAGAGAGAGAGTTAGAGAGAGAGAGAGAG
ALTa: AG
ALTb: --
Using only the reference sequence, the left extent of the STR contraction
(ALTb) would be left of the TT bases (position 14). However, if the TT->AG
substitution (ALTa) was known to be in cis with the contraction, then the
left extent would be at position 0.
…-Kevin
On Thu, Jun 1, 2017 at 11:04 PM, Len Trigg ***@***.***> wrote:
Gold standard benchmarking datasets such as the GIAB/NIST NA12878 and the
Illumina Platinum Genomes truth sets provide both a VCF containing the
variant records themselves, plus a BED file of "high confident regions".
The confident regions are used to indicate regions where there are no
variants present other than those contained within the associated VCF.
There may be several reasons why some parts of the genome are regarded as
low-confidence, but one category is regions where variant calls are made
that are inconsistent between callers or inconsistent with pedigree.
The actual variant comparison is usually carried out using haplotype-aware
tools like vcfeval/hap.py which can identify equivalence between variants
that use alternative representations. Note that there can sometimes be
large positional differences between equivalent variants. These tools are
smart enough to know that if a query variant outside of the confident
regions matches a truth variant inside a confident region, this should be
regarded as a true positive. However, we would ideally want the opposite to
occur -- that is if a query variant inside a confident region is equivalent
to a truth variant outside a confident region it should not be considered a
false positive (otherwise this would bias the accuracy metrics based on
which representation they elected to use). However, we can't generally do
this, because the truth sets rarely include variants outside of the high
confident regions, and secondly, by the nature of low confidence regions an
exact match is unlikely anyway. (Some discussion is at #11
<#11>)
Ideally, confident regions should not be defined such that this type of
thing is likely to occur. It seems like you would not want to create a
narrow low-confidence region around a variant if that variant could equally
be placed at an alternative location -- the low-confidence region should
cover the range of positions at which the variant could have been
represented.
I mentioned this potential issue on one of the GA4GH benchmarking calls
and considered and experiment to right-align variants to see when this
would shift them across confident region boundaries. However, a tool was
recently published that makes this experiment easier. UPS-Indel
<http://biorxiv.org/content/early/2017/05/03/133553> aims to provide a
universal positioning coordinate for indels regardless of their
representation, in order to permit fast detection of redundant/equivalent
indels by exact comparison of this coordinate. Essentially the
UPS-Coordinate contains the reference span across which the indel could be
placed, plus the sequence of the indel (there is also some light
decomposition that is carried out). While the UPS-Indel comparison method
is not capable of identifying complex equivalences involving multiple
variants, for one-to-one variants the authors report it working well. As a
side note, their preprint reports vcfeval performing poorly in comparison
to ups-indel, however, it turns out that UPS-indel looks only at ALTs (i.e.
sample-free), while their vcfeval runs were requiring diploid sample
genotype matches, so obviously it would report fewer matches. I obtained
the VCFs from the authors and running vcfeval in sample-free mode did yield
very similar results. (Additionally, their dataset was a subset containing
deletions only, which limited the likelihood of encountering situations
requiring more than one-to-one matching ability).
Anyway, the UPS-Coordinates provide an easy way to look for cases where
variants can shift across confident region boundaries, allowing us to
identify potentially problematic sites. To start off with, we create a BED
file corresponding to the UPS-coordinates of all indels in the variant set,
and then intersect this with a BED file corresponding only to the borders
of the confident regions.
Some interesting sites that turned up
I ran the intersection process on both GIAB 3.3.2 and PG 2016-1.0 and had
a browse around. Since the inputs were truth VCFs rather than
low-confidence indels, most of the examples were cases of a confident
variant that could also be represented as a variant in a low-confidence
region (i.e. the case already handled by our comparison tools). This would
be better done using a VCF corresponding to the low-confidence indels that
Justin or Illumina use when creating their truth sets, but there are still
some interesting cases.
In the following screenshots we have the GIAB truth VCF at the top, the
UVCF BED regions showing the indel variant spans, the high confidence
regions, and the intersection between UVCF and confident region boundaries
("unsafe"). This sequence of tracks is repeated for the PG set in the lower
half of the display.
[image: 1_5 869 462_5 869 535]
<https://cloud.githubusercontent.com/assets/282098/26712320/7d754640-47ba-11e7-957f-dc8f3fd20a33.png>
1:5,869,472-5,869,526 - PG contains an indel which is just outside a
confident region. The indel is capable of sliding into the downstream
region, so a caller that made the same call but positioned it here to the
right would be assigned a FP.
[image: 1_6 187 990_6 188 099]
<https://cloud.githubusercontent.com/assets/282098/26712330/869a18b8-47ba-11e7-8624-e2a0ef2c57b7.png>
1:6,187,990-6,188,099 - PG contains an indel inside a two-base confident
region which is inside roughly 50bp of low confidence. The indel can be
repositioned inside the low-confidence zone. How confident are those two
bases really, given the context?
[image: 1_12 141 963_12 142 110]
<https://cloud.githubusercontent.com/assets/282098/26712458/12961402-47bb-11e7-8b46-9d22df6b599d.png>
1:12,141,963-12,142,110 - GIAB looks to have included a call that PG has
spliced out. The call itself partially overlaps the GIAB high confidence
regions. Should the GIAB high confidence region end mid-call? In addition,
the call also has enough wiggle that it could be repositioned entirely
within or entirely outside the PG high-conf regions. Can PG really claim to
be be confident about that adjacent region?
[image: 1_50 938 017_50 938 164]
<https://cloud.githubusercontent.com/assets/282098/26712350/9920c8ce-47ba-11e7-9d43-add5d10a6398.png>
1:50,938,017-50,938,164 - GIAB looks to have excluded a call that PG has
also spliced out. However, the call has enough wiggle that it could be
repositioned to overlap the GIAB high confidence region, and it could
happily shift through some PG high-conf regions.
A simple way of making the high-confidence regions a little more robust to
these issues would be to compute the union of the UPS-coordinates of
rejected indels and subtract this from the existing high-confidence regions.
script.zip
<https://github.com/ga4gh/benchmarking-tools/files/1046781/script.zip>
containing a script as a starting point if someone wants to try this out
themselves.
@pkrusche <https://github.com/pkrusche> @jzook <https://github.com/jzook>
@RebeccaTruty <https://github.com/rebeccatruty>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#29>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABM-v4QOwmIn1iuDQ8gl72U88Ht1oiuFks5r_6X0gaJpZM4Nt3cW>
.
|
Gold standard benchmarking datasets such as the GIAB/NIST NA12878 and the Illumina Platinum Genomes truth sets provide both a VCF containing the variant records themselves, plus a BED file of "high confident regions". The confident regions are used to indicate regions where there are no variants present other than those contained within the associated VCF. There may be several reasons why some parts of the genome are regarded as low-confidence, but one category is regions where variant calls are made that are inconsistent between callers or inconsistent with pedigree.
The actual variant comparison is usually carried out using haplotype-aware tools like vcfeval/hap.py which can identify equivalence between variants that use alternative representations. Note that there can sometimes be large positional differences between equivalent variants. These tools are smart enough to know that if a query variant outside of the confident regions matches a truth variant inside a confident region, this should be regarded as a true positive. However, we would ideally want the opposite to occur -- that is if a query variant inside a confident region is equivalent to a truth variant outside a confident region it should not be considered a false positive (otherwise this would bias the accuracy metrics based on which representation they elected to use). However, we can't generally do this, because the truth sets rarely include variants outside of the high confident regions, and secondly, by the nature of low confidence regions an exact match is unlikely anyway. (Some discussion is at #11)
Ideally, confident regions should not be defined such that this type of thing is likely to occur. It seems like you would not want to create a narrow low-confidence region around a variant if that variant could equally be placed at an alternative location -- the low-confidence region should cover the range of positions at which the variant could have been represented.
I mentioned this potential issue on one of the GA4GH benchmarking calls and considered and experiment to right-align variants to see when this would shift them across confident region boundaries. However, a tool was recently published that makes this experiment easier. UPS-Indel aims to provide a universal positioning coordinate for indels regardless of their representation, in order to permit fast detection of redundant/equivalent indels by exact comparison of this coordinate. Essentially the UPS-Coordinate contains the reference span across which the indel could be placed, plus the sequence of the indel (there is also some light decomposition that is carried out). While the UPS-Indel comparison method is not capable of identifying complex equivalences involving multiple variants, for one-to-one variants the authors report it working well. As a side note, their preprint reports vcfeval performing poorly in comparison to ups-indel, however, it turns out that UPS-indel looks only at ALTs (i.e. sample-free), while their vcfeval runs were requiring diploid sample genotype matches, so obviously it would report fewer matches. I obtained the VCFs from the authors and running vcfeval in sample-free mode did yield very similar results. (Additionally, their dataset was a subset containing deletions only, which limited the likelihood of encountering situations requiring more than one-to-one matching ability).
Anyway, the UPS-Coordinates provide an easy way to look for cases where variants can shift across confident region boundaries, allowing us to identify potentially problematic sites. To start off with, we create a BED file corresponding to the UPS-coordinates of all indels in the variant set, and then intersect this with a BED file corresponding only to the borders of the confident regions.
Some interesting sites that turned up
I ran the intersection process on both GIAB 3.3.2 and PG 2016-1.0 and had a browse around. Since the inputs were truth VCFs rather than low-confidence indels, most of the examples were cases of a confident variant that could also be represented as a variant in a low-confidence region (i.e. the case already handled by our comparison tools). This would be better done using a VCF corresponding to the low-confidence indels that Justin or Illumina use when creating their truth sets, but there are still some interesting cases.
In the following screenshots we have the GIAB truth VCF at the top, the UVCF BED regions showing the indel variant spans, the high confidence regions, and the intersection between UVCF and confident region boundaries ("unsafe"). This sequence of tracks is repeated for the PG set in the lower half of the display.
1:5,869,472-5,869,526 - PG contains an indel which is just outside a confident region. The indel is capable of sliding into the downstream region, so a caller that made the same call but positioned it here to the right would be assigned a FP.
1:6,187,990-6,188,099 - PG contains an indel inside a two-base confident region which is inside roughly 50bp of low confidence. The indel can be repositioned inside the low-confidence zone. How confident are those two bases really, given the context?
1:12,141,963-12,142,110 - GIAB looks to have included a call that PG has spliced out. The call itself partially overlaps the GIAB high confidence regions. Should the GIAB high confidence region end mid-call? In addition, the call also has enough wiggle that it could be repositioned entirely within or entirely outside the PG high-conf regions. Can PG really claim to be be confident about that adjacent region?
1:50,938,017-50,938,164 - GIAB looks to have excluded a call that PG has also spliced out. However, the call has enough wiggle that it could be repositioned to overlap the GIAB high confidence region, and it could happily shift through some PG high-conf regions.
A simple way of making the high-confidence regions a little more robust to these issues would be to compute the union of the UPS-coordinates of rejected indels and subtract this from the existing high-confidence regions.
script.zip containing a script as a starting point if someone wants to try this out themselves.
@pkrusche @jzook @RebeccaTruty
The text was updated successfully, but these errors were encountered: