How do we make Categorical Variants work programmatically for all flavors of Constraints & Recipes? #110

larrybabb · 2025-02-18T16:32:24Z

larrybabb
Feb 18, 2025
Maintainer

The concept of a Categorical Variant originated with the acknowledgement that genomic knowledge associated to variants require the ability to associate that knowledge with one or more contextual variants (i.e. a specific variant change at a specific location on a specific reference sequence).

A typical case is the knowledge in ClinVar where pathogenic classifications are associated to a specific change on a genomic variant. In actuality, the rare disease community that uses ClinVar considers all the contextual variants that originate from a single defining allele context. In ClinVar, the policy is to define all variants on the genomic build 38 (GRCh38, hg20) contextual location and change whenever possible. From that defining genomic build 38 context, ClinVar lifts over to build 37 (and 36 sometimes) to infer that their contextual variant representations are also associated to the variant classification. Additionally, ClinVar derives all the transcript forms from RefSeq that align with the defining allele to provide additional contextual members of their representation of a ClinVar variant.

This typical case is also used in other resources to provide what we refer to as a Canonical Allele, a pattern or policy for defining the membership of possible contextual variants that satisfy the constraints of that policy.

As Cat-VRS has started to dig into the concept of a Categorical Variant the idea of creating filters or constraints has become the primary principle behind how the Cat-VRS spec will provide a standard for all types of Categorical Variant needed by the community. Each constraint is analogous to a function that can filter down the membership within a Categorical Variant such that we can provide a set of ingredients to allow implementers to create a recipe that results in a Categorical Variant design that satisfies their needs. While there may be some standard or common patterns whereby recipes can be prebuilt into out of the box Categorical Variants, the option to custom build is a requirement due to the dynamic nature of how the community needs to represent Categorical Variants for the purpose of associating discoveries and knowledge to share.

So how do we make this flexible design programmatically pragmatic and adoptable?

I suggest that we define how Constraints or functions to filter members should be applied in a system. From my experience in a system that requires that I constrain or filter a set of data I must first have a basis or set of data on which to apply those constraints. Therefore, I'd like to clarify as to what is the starting set? on which constraint are applied within the Categorical Variant model.

I think the vast majority of Categorical Variants are going to have a known sequence location as a starting point. I will also acknowledge that there may be different kinds of categorical variants on which sequence locations may not make sense. I would start by saying that if a sequence location can be used to set the basis or foundation of membership for a categorical variant then that categorical variant would only have members that are Molecular Variants in it, since SequenceLocation is tightly coupled with the concept of Molecular Variation. I would need domain experts to assist in defining other kinds of Categorical Variants, but I can imagine there will be others.

Taking this notion of a basis that is foundational to all Categorical Variants and applying it to our current Canonical Allele type of Categorical Variant I would re-design it along the following lines:

break out the SequenceLocation from the DefiningAlleleConstraint and make it the basis for a subtype of CategoricalVariant, let's call it CategoricalLocationVariant (or something more appropriate as we get a clearer picture of this design). Within the CategoricalLocationVariant.
add an attribute that is the definingLocation that would take a SequenceLocation. This attribute would basically scope the region in which all eventual members must exist and for which all recipe Constraints use to filter down the list of members to the final result.
Create a new constraints from the DefiningAlleleConstraint that is only based on the state of the variant at the definingLocation called AlleleStateConstraint. It would be a super simple constraint (as all constraints should be IMO) which would specify the state of the residues at the definingLocation.
Redefine the CanonicalAllele recipe to be based on this new CategoricalLocationVariant with a single AlleleStateConstraint.
This leaves the question on how the original relations concept should be handled in this new design. Is it a foundational aspect of how the overarching CategoricalVariant behaves or is it a constraint. One could argue that it isn't really a constraint since it isn't limiting the set of data, but instead defining the scope of variants that can be limited by the constraints. In other words, having both the definingLocation and relations as part of the CategoricalLocationVariant would provide the constraints the starting dataset on which to begin constraining.

To test this proposal out we would need to apply this to the notion of other types of CategoricalVariants that we've found in practice from our registered implementers. For example, Categorical CNVs, Protein Sequence Expressions, ....

The main idea to focus on to test this design out is to verify whether it makes sense to define subclasses of CategoricalVariant that let's the users define the scope or dataset starting point on which the one or more Constraints applied in the categorical variant recipe can be used to determine the final result set of variants (or members). With this approach the programming would then be able to emulate the functions needed to first define a sequence, cytoband, gene, etc. region that is then used as input to the constraint's functional filters to arrive at the final outcome.

While it is understood that the purpose of a Categorical Variant is not to always precisely define every member in a set, it should be possible to always take one or more variants and test it against a CategoricalVariant recipe to determine if it meets the scope and constraints it defines.

Finally, I would like to promote the idea that all Constraints are designed to be as simplistic as possible. Contriving complex constraints early on is an indicator that the design may be suboptimal and would also lead to a greater barrier for adoption in crafting the functions needed to share in a standard way.

DanielPuthawala · 2025-03-04T15:51:52Z

DanielPuthawala
Mar 4, 2025
Maintainer

Thanks for this well-thought out post Larry. It took me a while to get around to responding to this between PRC panic, but I don;t want you to think that we're ignoring this. I agree with your characterization of categorical variation in the opening 4 paragraphs. On track with you there. In fact I think most of what you lay out here I agree with, particularly your conclusions:

While it is understood that the purpose of a Categorical Variant is not to always precisely define every member in a set, it should be possible to always take one or more variants and test it against a CategoricalVariant recipe to determine if it meets the scope and constraints it defines.

Yes, 100%.

Finally, I would like to promote the idea that all Constraints are designed to be as simplistic as possible. Contriving complex constraints early on is an indicator that the design may be suboptimal and would also lead to a greater barrier for adoption in crafting the functions needed to share in a standard way.

Quite possibly, yes. I agree!

So digging into the meat of your proposal, if I understand you correctly, you are making two (not necessarilly related) proposals here.

The first proposal is to explicitly introduce top-down subtypes of CategoricalVariant. I won’t say (and do not mean to say)t hat this is necessarily a bad idea, but the reason why we currently have one top-level CategoricalVariant class, and then capture the variation bottom-up via constraints is because such a structure allows us to guarantee that we’ll be able to describe anything the people can come up with within a broad domain of categorical variation. Adding in subtypes can probably be done, but I would want to make sure that doing so does not result in siloing the categorical variation domain by making it impractical to compare caters between subtypes, and make sure that sub typing doesn’t introduce undesirable gaps in coverage.

The second proposal is, in essence, to break up the allele constraint into two separate constraints, one which simply handles the sequence location, and another than handles the sequence state. Something sort of like this.

This makes the constraints maximally simple, and is actually precisely what I originally suggested, as seen in The Daniel, (reinserted here for convenience) where I differentiated CatVars based on sequence (state) and location.

The reason we didn’t go this route, if memory serves, was due to wanting to build in compatibility with VRS, where there is already a location vs location + state class distinction in the form of vis:Location and vis:Allele.  So if this is an accurate summary of your proposal (along with the associated recipe changes, etc), Larry (and please correct me if it’s not), I guess my follow-up questions would be (1) is there a reason that the upstream compatibility with VRS is no longer as much of a priority? And (2) is this redesign worth pausing PRC for? (These are still only Trial-Use classes, so this (breaking) change later one would only require a minor version release).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we make Categorical Variants work programmatically for all flavors of Constraints & Recipes? #110

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How do we make Categorical Variants work programmatically for all flavors of Constraints & Recipes? #110

Uh oh!

larrybabb Feb 18, 2025 Maintainer

So how do we make this flexible design programmatically pragmatic and adoptable?

Replies: 1 comment

Uh oh!

DanielPuthawala Mar 4, 2025 Maintainer

larrybabb
Feb 18, 2025
Maintainer

DanielPuthawala
Mar 4, 2025
Maintainer