Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermediate VCF, decision sub-types (BK) #4

Closed
Lenbok opened this issue Nov 4, 2015 · 30 comments
Closed

Intermediate VCF, decision sub-types (BK) #4

Lenbok opened this issue Nov 4, 2015 · 30 comments

Comments

@Lenbok
Copy link
Member

Lenbok commented Nov 4, 2015

For decision subtype, I am not clear what values to use for vcfeval results, as the terminology seems to be more oriented toward calls that have been through normalization.

I would probably put any TP as complex match since vcfeval doesn't have any notion of a "direct" match. I guess we could try to determine after-the-fact consider direct to be where the GT fields are the same, although this seems a bit pointless (and would have to be extra careful to handle cases where the input VCFs had multiple records at the same position). Slightly more natural (and useful) would be to count the number of TRUTH/QUERY variants in the sync region and take any 1:1 cases as "direct", even though they may actually be at different coordinates.

For non-TP the vcfeval semantics are really "this variant could not be included in a matching haplotype pair", which I would probably denote as complex mismatch. Our latest version can automatically run haploid matching on the FN/FP in order to find calls that have a common allele (e.g. typically matching calls that differ due to zygosity errors or half-calls), the closest semantics for any of these "haploid TP" here may be genotype mismatch?

I'm also not clear what BK=miss means (does it just mean that the call in this cell is '.'?)

@pkrusche
Copy link
Member

pkrusche commented Nov 5, 2015

Ultimately, I think the usefulness of this field would depend on the application. The reason this field is in here is to support quantifying cases where we get the alleles right but the genotypes wrong. I think this is probably more common than actually having different representations and has been useful for us when developing models for genotyping.

The decision of which BK values to use/support would be up to the comparison engine -- using only cm in vcfeval would probably be ok, too.

@pkrusche
Copy link
Member

pkrusche commented Nov 5, 2015

BK=miss would indicate that the call is present in only one of the datasets. I think in terms of vcfeval, this can be directly extracted from the truth / query decision types: Anything that is not TP in truth or in query could have BK=miss.

@Lenbok
Copy link
Member Author

Lenbok commented Nov 10, 2015

Can we include a more precise definition of what each of the possible values means?

Or perhaps it would make sense to say that it is up to the comparison engine to define/use its own set of distinct sub-types, and that the downstream quantification stage will perform grouping by whatever subtypes are actually used?

The latter approach would also allow optional sub-typing of the 'N' values (why did the engine decide not to assign a truth status).

@pkrusche
Copy link
Member

I think optional sub-typing of the N values is a good idea, different comparison engines will have different reason to not compare certain variants.

I guess for the current types, we could supply examples:

Direct match (m):

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY   
chrQ    1       A       T   GT:BK:BD    0/1:m:TP    0/1:m:TP

Complex match (cm):

Simple left-shifting case (more complicated things are possible):

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      A   GT:BK:BD    ./.:.:.     0/1:cm:TP
chrQ    2       AA      A   GT:BK:BD    0/1:cm:TP   ./.:.:.

GT mismatch (gtmm):

Direct genotype mismatch, no other variants interfering in superlocus.

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY   
chrQ    1       A       T   GT:BK:BD    0/1:gtmm:FN 1/1:gtmm:FP

Allele mismatch (almm):

Direct allele mismatch, no other variants interfering in superlocus.

CHROM   POS     REF     ALT     FORMAT      TRUTH       QUERY   
chrQ    1       A       T,C     GT:BK:BD    0/1:almm:FN 0/2:almm:FP

Missing variants (miss):

Variant present in only one input dataset.

CHROM   POS     REF     ALT     FORMAT      TRUTH       QUERY   
chrQ    1       A       T   GT:BK:BD        0/1:miss:FN ./.:.:.
chrQ    2       A       G   GT:BK:BD        .:.:.       0/1:miss:FP

Complex mismatch (pm):

One goal here would be to highlight / stratify out difficult regions.

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       A       AT  GT:BK:BD    1:pm:TP     .:.:.
chrQ    2       T       TA  GT:BK:BD    1:pm:TP     .:.:.
chrQ    3       A       T   GT:BK:BD    .:.:.       1:pm:TP
chrQ    3       A       TT  GT:BK:BD    1:miss:FN   .:.:.
chrQ    9       AAT     A   GT:BK:BD    1:pm:TP     .:.:.
chrQ    11      T       A   GT:BK:BD    .:.:.       1:pm:TP
REF  ATA-AAAAAAAAAT
     ||x xx|||||||x
T    ATTATTAAAAAAAA
     ||||xx |||||||
Q    ATTAA-AAAAAAAA

Do these make sense?

@Lenbok
Copy link
Member Author

Lenbok commented Nov 12, 2015

So, the important thing is what is the quantification stage going to do with the sub-types (if anything), and are the sub-types expected to be comparable across the comparison engines.

My gut feel is that it would be hard to have the sub-types be exactly comparable across engines, as each engine has different notions of which distinctions are natural to calculate. (@bioinformed It would be good to hear what categorizations are natural to vgraph)

For example, vcfeval doesn't distinguish between direct and complex matches, so flagging all its matches as one or other wouldn't be strictly correct according to the current m/cm sub-type definitions. It would be easy to identify "fairly-non-complex" matches based on the number of variants within the vcfeval sync region, yet not natural for xcmp to compute. Similarly, the vcfeval notion of common-allele mismatches is similar to, but not quite the same as gtmm (any gtmm would fall into this category, plus non-direct versions)

Some examples of useful vcfeval sub-types for matches and non-matches:

Match:

This example is the same as your complex mismatch example (I am not sure why you labelled those TP as pm rather than cm, maybe just a typo?)

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       A       AT  GT:BK:BD    1:M:TP      .:.:.
chrQ    2       T       TA  GT:BK:BD    1:M:TP      .:.:.
chrQ    3       A       T   GT:BK:BD    .:.:.       1:M:TP
chrQ    3       A       TT  GT:BK:BD    1:MISS:FN   .:.:.
chrQ    9       AAT     A   GT:BK:BD    1:M:TP      .:.:.
chrQ    11      T       A   GT:BK:BD    .:.:.       1:M:TP

Simple match:

No other variants interfering in superlocus, but there may be representation differences (not implemented, so these would currently just appear as the default match sub-type).

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      A   GT:BK:BD    0/1:SM:TP   0/1:SM:TP
CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      A   GT:BK:BD    ./.:.:.     0/1:SM:TP
chrQ    2       AA      A   GT:BK:BD    0/1:SM:TP   ./.:.:.

The utility of this sub-type would be to allow easy separation of the quite complicated matches for investigation (alternatively this could be just done by superlocus postprocessing).

Mismatch:

The default FN/FP category.

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      T   GT:BK:BD    ./.:.:.     0/1:MISS:FP
chrQ    2       AA      A   GT:BK:BD    0/1:MISS:FN ./.:.:.
CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      A,T GT:BK:BD    0/1:MISS:FN 0/2:MISS:FP

Common-allele mismatch:

There is a path which has a common allele. Includes zygosity errors (independent of representation).

CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      A   GT:BK:BD    0/1:CAMM:FN 1/1:CAMM:FP
CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      A   GT:BK:BD    ./.:.:.     1/1:CAMM:FP
chrQ    2       AA      A   GT:BK:BD    0/1:CAMM:FN ./.:.:.
CHROM   POS     REF     ALT FORMAT      TRUTH       QUERY
chrQ    1       AA      A,T GT:BK:BD    0/1:CAMM:FN 1/2:CAMM:FP

Highlighting / stratifying out difficult regions would be done by looking the the number of variants sharing the same superlocus id.

The point is that there isn't really a 1:1 correspondence between the xcmp and vcfeval sub-types, so I think that allowing the BK sub-types to be engine-defined seems a better way to go.

@jzook
Copy link
Contributor

jzook commented Nov 18, 2015

I think it is probably important to standardize some of these BK sub-types as much as possible in order to allow the counting step to output our standardized definitions of performance metrics (e.g., TP, FP, FN, etc) at different stringencies (see the different comparison methods at https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceMetricsDefinitions.md). While standardizing complex match and complex mismatch probably aren't needed for this, standard definitions for GT mismatch (gtmm) and Allele mismatch (almm) are likely needed for some of the comparison methods (in particular, #1 and #2). I think Peter's definitions fit these comparison methods pretty well (if for #1 the "region" is +-0bp). If a tool isn't able to fit these definitions, then perhaps it shouldn't output gtmm or almm? Alternatively, we could modify the comparison method definitions if we think they aren't useful.

@Lenbok
Copy link
Member Author

Lenbok commented Nov 19, 2015

My interpretation of the comparison methods in the document (and indeed the goals of the project) was that they should be independent of the variant representation. It should therefore not matter whether a match is direct vs non-direct (whether that be for a allele-only match or a genotype match). Distinguishing on the basis of direct vs non-direct only serves to provide a measure of the degree to which the same representational choices were made.

In fact, as long as the engine outputs an overall status as genotype-match / Allele-match (ideally indicating which allele was the "common" allele, to allow disambiguating pure zygosity errors vs multiallelic cases) / Mismatch, then the determination of whether there is a "direct" match seems to be a syntactic notion that could be determined during quantification (should you want to). (This would avoid needing multiple, possibly conflicting, implementations).

Perhaps the engine should output sub-types (corresponding to those in the comparison document) such as: gtmatch / almatch / miss.

@jzook
Copy link
Contributor

jzook commented Nov 23, 2015

I agree that the direct/indirect distinction is not really important, and
so it might be an interesting subcategory but not required or important to
standardize. Perhaps direct/indirect could be a separate optional field,
since it sounds like hap.py can generate it but it may not make sense for
other tools?

In terms of subtypes, in addition to gtmatch, almatch, and miss, I think we
need a fourth category for Comparison Method #1 (when a test variant is a
TP if it is within x-bp of a benchmark variant) that would be something
like regmatch (for "regional match"). Again, it wouldn't be required if
the comparison tool can't output it, but it is useful in certain
circumstances (e.g., if calling a variant triggers a manual inspection of
the region, or if you want to diagnose the source of FPs/FNs). What do you
think?

On Thu, Nov 19, 2015 at 6:24 PM, Len Trigg notifications@github.com wrote:

My interpretation of the comparison methods in the document (and indeed
the goals of the project) was that they should be independent of the
variant representation. It should therefore not matter whether a match is
direct vs non-direct (whether that be for a allele-only match or a genotype
match). Distinguishing on the basis of direct vs non-direct only serves to
provide a measure of the degree to which the same representational choices
were made.

In fact, as long as the engine outputs an overall status as genotype-match
/ Allele-match (ideally indicating which allele was the "common" allele, to
allow disambiguating pure zygosity errors vs multiallelic cases) /
Mismatch, then the determination of whether there is a "direct" match seems
to be a syntactic notion that could be determined during quantification
(should you want to). (This would avoid needing multiple, possibly
conflicting, implementations).

Perhaps the engine should output sub-types (corresponding to those in the
comparison document) such as: gtmatch / almatch / miss.


Reply to this email directly or view it on GitHub
#4 (comment)
.

@pkrusche
Copy link
Member

I agree that the direct / indirect match category is something that could also live in a separate optional field -- BT comes to mind? This would also clarify #3 .

I do think it is useful to keep track of the cases which have a different representation, mainly because convincing people that different representations are an important thing to handle is also one of our goals.

@Lenbok
Copy link
Member Author

Lenbok commented Dec 9, 2015

Justin said in #8:

  1. You have the categories posed negatively (e.g., "allele mismatch" instead of "allele match" as Len proposed). I think they end up giving the same information (as long as my additional regional match category is added to Len's), but it would be good to hear from Len whether he thinks the positive notation has advantages. I don't see any major advantages or disadvantages to either.

I quite like the positive category labels, as then the BK seems more consistent across the different comparison types (in fact the quantification breakdowns can be probably even be derived from the BK depending on whether you want to do produce statistics according to method 1/2/3). It seems quite natural to assign:

  • BK in {gtmatch, almatch, loosematch} -> Loose TP
  • BK in {gtmatch, almatch} -> Allele TP
  • BK in {gtmatch} -> Genotype TP
    and any non BD==N variant is either FP/FN depending on which set it came from.

I have a test branch which implements most of this if you want to try it (doesn't support loosematch, or splitting records to handle the multi-BS edge case mentioned in #5). Use --output-mode ga4gh to produce the new format output file.

@jzook
Copy link
Contributor

jzook commented Dec 9, 2015

@pkrusche I agree that BT seems like a good home for the direct/indirect distinction. Does that make sense to you @Lenbok ?

@Lenbok I'm inclined to agree that the positive categories for BK seem to have some advantages. @pkrusche, do you see any advantages to the negative categories?

@pkrusche
Copy link
Member

pkrusche commented Dec 9, 2015

I like positive categories and the fact that they are easy to explain for TPs.

I think the main application of negative classification would be stratifying FPs. E.g. looking at het/hom mismatches or allele mismatches (e.g. deletions/insertions of slightly different length) is useful.

@pkrusche
Copy link
Member

Maybe to make some progress on this, we could split this into two fields:

A BK field which says either

  • match -- within the current superlocus, variants produce the same/compatible haplotypes
  • mismatch -- both inputs contain variants within the current superlocus and this one in particular is not compatible
  • miss -- only one input contains variants within the current superlocus

We can introduce a separate field BX (benchmarking extended match information?) which can then (optionally, depending on the benchmarking type / engine) be used to encode things like

  • unspecified match / mismatch
  • gtmatch (TP) / gtmismatch (FP)
  • almatch (TP) / almismatch (FP)
  • other things like length expansions (STR deletion/insertion that is shorter / longer by n repeat units)

@jzook
Copy link
Contributor

jzook commented Dec 16, 2015

Thanks for keeping this moving. Does having 2 fields make it easier to do counting? As far as I understand, BX is just more specific than BK, and the BK status could be inferred from BX. Is that right? Also, wouldn't TP's always be gtmatch?

@pkrusche
Copy link
Member

I guess we could construct more complicated het / hetalt / phasing examples where GTs for TPs can be slightly different.

The advantage I can think of for doing two fields rather than one is that it would be easier to transparently describe how they affect different parts of the counting procedure:

  • BK would determine where FPs become UNKs (i.e. are miss outside the coverage of the truth set)
  • BX would be used only when we use comparison methods that can mark GT mismatches / compare alleles / STR calls.

@Lenbok
Copy link
Member Author

Lenbok commented Dec 16, 2015

Regarding the BK field, I had assumed that the surrounding framework would take care of that (e.g. by pre-filtering any calls outside the truth confident regions before passing to the engine), and I think it's good to keep such filtering separate from the matching semantics as much as possible.

The BX field sounds kind of unwieldy, as it seems like it would become a multi-valued field (e.g. a call could be both a gtmismatch but an almatch, and possibly other values too). With the superset approach of loose > almatch > gtmatch only one value is needed, and this seems to be the primary categorisation in the comparison document.

I agree that additional annotations for subcategories like STR calls could be useful though.

@jzook
Copy link
Contributor

jzook commented Dec 16, 2015

I think the superset approach of loose > almatch > gtmatch would simplify
things as well, though STRs might not fit easily in this. Maybe between
loose and almatch?

In terms of prefiltering, my one concern with this is where part of a
complex variant is excluded by the bed file in one representation and
included in another representation. I've found this to be quite common in
FPs/FNs, and it can be helped by doing the filtering after comparison, but
it gets complicated - maybe this should be a different issue?

On Wed, Dec 16, 2015 at 2:07 PM Len Trigg notifications@github.com wrote:

Regarding the BK field, I had assumed that the surrounding framework would
take care of that (e.g. by pre-filtering any calls outside the truth
confident regions before passing to the engine), and I think it's good to
keep such filtering separate from the matching semantics as much as
possible.

The BX field sounds kind of unwieldy, as it seems like it would become a
multi-valued field (e.g. a call could be both a gtmismatch but an almatch,
and possibly other values too). With the superset approach of loose >
almatch > gtmatch only one value is needed, and this seems to be the
primary categorisation in the comparison document.

I agree that additional annotations for subcategories like STR calls could
be useful though.


Reply to this email directly or view it on GitHub
#4 (comment)
.

@pkrusche
Copy link
Member

Using this approach, maybe we could define BK to have the following values and do without BX:

missing; lose match [there is a variant nearby] > approx. almatch > almatch > gtmatch > hapmatch [haplotypes can be resolved as the same]

We could then define TP/FP/FN on top of this for different comparison methods as follows:

Lose comparison:

TP: BK in {lose match / approx / almatch / gtmatch / hapmatch}
FP: query BK == missing
FN: truth BK == missing

Exact allele comparison:

TP: BK in {almatch / gtmatch / hapmatch}
FP: query BK in { missing, lose, approx }
FN: truth BK in { missing, lose, approx }

Exact GT comparison:

TP: BK in {gtmatch / hapmatch}
FP-al: query BK in { missing, lose, approx }
FP-gt: query BK in { almatch } // Alleles match but GT is wrong
FP = FP-gt + FP-al
FN: truth BK in { missing, lose, approx, almatch }

W.r.t. prefiltering I also agree with @jzook -- the main reasoning for doing the UNK regions after comparison in hap.py is that this allows us to be more accurate at the boundaries of confident regions -- when variants have different representations in the query, they might not necessary fall into the confident call regions, so filtering before comparison would wrongly give FNs.

@Lenbok
Copy link
Member Author

Lenbok commented Dec 22, 2015

@pkrusche I am unsure of the semantics of the additional BK values you've added (particularly loose vs approx, and gtmatch vs hapmatch), can we tighten/define them?

I think this is what you mean, please correct if I'm wrong:

  • missing = no match at any level
  • loose match = the truth/query variant is nearby to one in the query/truth
  • approx = ??
  • almatch = the variant forms (part of) an allele match (independent of representation, i.e. one-sided haplotype match)
  • gtmatch = there is a direct GT match (i.e. requires the same representation be used)
  • hapmatch = diploid haplotypes were resolved to be the same (independent of representation)

(Previously I was using "gtmatch" to correspond to a representation independent diploid haplotype match).

So gtmatch seems to be a more specific match than hapmatch in that the haplotypes need to match and the same representation be used? Or are you thinking they would be done independently according to different matching techniques? (and since vcfeval only matches haplotypes, it would output almatch/hapmatch, never gtmatch)

I'll open a separate issue for discussion around confident regions.

@RebeccaTruty
Copy link

I agree with @Lenbok that the way he's defined gtmatch doesn't make a lot of sense. I feel like that's getting back into the direct/indirect comparison which we've agreed isn't very useful.

Also, I'm guessing that @pkrusche intended the approx category to address things like STR's (@pkrusche correct me if I'm wrong). I agree that's useful information, but since it's not included in any of our comparison methods I'm not sure it belongs in BK. I would propose putting that information into a separate optional (since not all engines will compute it) field. Unless we all think we should add an additional comparison method that addresses STRs?

Anyway, I would propose the following definitions:

  • missing = no match at any level
  • loose match = the truth/query variant is nearby to one in the query/truth
  • almatch = the variant forms (part of) an allele match (independent of representation, i.e. one-sided haplotype match)
  • gtmatch = diploid haplotypes were resolved to be the same (independent of representation)

Also, stop me if this is a terrible idea (and it may be since different comparison engines will define superloci differently), but it seems like a in our comparison method #1 we could define a loose match as "there is a variant in both the truth and the query within this superlocus" rather than an arbitrary 50bp (or whatever length) threshold.

@pkrusche
Copy link
Member

pkrusche commented Jan 6, 2016

Yes, these definitions look good to me. I also agree that leaving STRs for later and linking loose matches to superloci seem like good ideas.

@jzook
Copy link
Contributor

jzook commented Jan 6, 2016

I also think the definitions proposed by @RebeccaTruty make sense, so it sounds like everyone may agree on these?

I'm not sure I like the idea of linking loose matches to superloci because I suspect that it will be hard for people to interpret what the output means, especially when each tool has a different, complicated, definition for superloci. 50bp is arbitrary though, so it's likely useful for 50bp to be a parameter that could be changed. Right now, the hapdip tool from @lh3 is the only tool I know that does loose matching, and it allows the user to specify the "looseness".

@RebeccaTruty
Copy link

Perhaps we shouldn't drop the x-bp version, but if it's easy to compute a superloci-match (I suspect it is) then maybe we can add it? But I would consider it low priority.

@jzook
Copy link
Contributor

jzook commented Jan 7, 2016

I'm fine with keeping the superloci match as well. How about these as definitions:

  • missing = no match at any level tested by the comparison tool
  • superlocus match = the truth/query variant is within the same superlocus as a variant in the query/truth
  • loose match = the truth/query variant is nearby a variant in the query/truth
  • almatch = the variant forms (part of) an allele match (independent of representation, i.e. one-sided haplotype match)
  • gtmatch = diploid haplotypes (and genotypes) were resolved to be the same (independent of representation)

@RebeccaTruty
Copy link

+1

@Lenbok
Copy link
Member Author

Lenbok commented Jan 7, 2016

Since superloci can vary in size (potentially depending both on the algorithm and the complexity of the region) it's not true that the superlocus match category is either more general or more specific than loose match. Thus you can no longer get away with supplying a single BK value. Rather than making BK be multi-valued, I would opt for simplicity by leaving out the superlocus match unless there you feel there is a clear need for it.

@jzook
Copy link
Contributor

jzook commented Jan 7, 2016

This is a good point though I was not thinking of making it multivalued. I
was thinking that tools would likely report one or the other, and if they
do calculate both then the loose match would take precedence since it is
easier to interpret. However I'm ok with eliminating super locus match
also. Any strong preference between these options?
On Thu, Jan 7, 2016 at 3:43 PM Len Trigg notifications@github.com wrote:

Since superloci can vary in size (potentially depending both on the
algorithm and the complexity of the region) it's not true that the
superlocus match category is either more general or more specific than
loose match. Thus you can no longer get away with supplying a single BK
value. Rather than making BK be multi-valued, I would opt for simplicity by
leaving out the superlocus match unless there you feel there is a clear
need for it.


Reply to this email directly or view it on GitHub
#4 (comment)
.

@RebeccaTruty
Copy link

Good point Len, I hadn't thought of that. In that case I agree with Justin that we can just drop the superlocus match.

@pkrusche
Copy link
Member

pkrusche commented Jan 8, 2016

I added some examples in #12 -- please have a look and see if this makes sense -- here is a link to the updated file:

https://github.com/ga4gh/benchmarking-tools/blob/issue-4-clarify-bk/doc/ref-impl/intermediate.md

@RebeccaTruty
Copy link

Looks good to me. Thanks Peter!

pkrusche added a commit that referenced this issue Jan 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants