Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure reactome go-cams validate against main GO shex schema #69

Closed
1 of 2 tasks
goodb opened this issue Aug 23, 2019 · 142 comments
Closed
1 of 2 tasks

Ensure reactome go-cams validate against main GO shex schema #69

goodb opened this issue Aug 23, 2019 · 142 comments
Assignees
Labels
knowledge representation work here is for modelers and ontologists

Comments

@goodb
Copy link
Contributor

goodb commented Aug 23, 2019

Revisit and rework conversion rules or shex schema to bring them into alignment. Close when everything passes. Re-open individual tickets for future failures.

  • Implement re-usable software framework for testing models that is integrated with this repository's travis system
  • Complete a clean validation run using this framework
@goodb goodb self-assigned this Aug 23, 2019
@ukemi
Copy link

ukemi commented Aug 23, 2019

Fingers crossed!

@goodb
Copy link
Contributor Author

goodb commented Aug 23, 2019

Hit two issues so far.

  1. Need to ensure that protein complexes end up being subclasses of chemical entity. I added an assertion that molecular entity is a subclass of chemical entity to the Reactome Entity Ontology (RE). Complexes are subclasses of molecular entity. With that assertion or a chebi import (that contains the same assertion) they thus get to be chemical entities as well so models that use them as inputs outputs etc. can validate. This seems okay right now.
  2. [UPDATE - this is fixed in the schema now] The conversion uses 'causally upstream of' to link reactions/functions that are related by the 'nextEvent' relation in the biopax. In some cases this is subsequently specified (e.g. to directly positively regulates) but in many cases it is not. This breaks the shex model on the MF constraint:
    causally_upstream_of: BiologicalProcess *;
    as MFs aren't BPs.. I think this is an error in the shex schema - it looks like a current proposal would fix this: Proposed updates to Molecular Function shape go-shapes#137

Here is an example of the latter problem, which is very common.
Screen Shot 2019-08-23 at 2 59 14 PM

goodb pushed a commit that referenced this issue Aug 23, 2019
@goodb
Copy link
Contributor Author

goodb commented Sep 24, 2019

@vanaukenk in relation to geneontology/go-shapes#160 I am running all of the current conversions through the validator. It looks like about 25% of them do not pass as things stand today. I will catalogue failures here as I diagnose them.

Example 1 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8852276
There are cases in that one where a function is asserted to 'directly_provide_input_for' a process. Note that the process in this case is not specified (this is the current resolution for transport) e.g. look for "GTSE1 promotes translocation of TP53 to the cytosol"

Same problem in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-168638
'K63 polyubiquitinated RIP2 associates with the TAK1 complex' 'directly_provide_input_for' 'TAK1 is activated'

Another similar example: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8941333 contains positive_regulates and causally_upstream of links from functions to untyped processes.
more examples:
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-983231 same again.
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-877300
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-5660668
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9026290

@goodb
Copy link
Contributor Author

goodb commented Sep 24, 2019

Another problem case. http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1482788
It looks like at least one problem here is the constraint that functions be enabled by InformationBiomacromolecules (or complexes). We have Set entities filling in the enabled_by slot that are only typed as 'chemical entities' (root). Schema is fine here, need to make the typing for Sets better.

In this case the Set is a mix of proteins and one complex. https://reactome.org/content/detail/R-HSA-1524112 maybe goes as a curation problem for Reactome - @deustp01 ?

goodb pushed a commit that referenced this issue Sep 24, 2019
#69 (comment)

In this version, if a set contains proteins and complexes (but not
anything else) it is typed as an informationbiomacromolecule .
@goodb
Copy link
Contributor Author

goodb commented Sep 25, 2019

The conversion sometimes results in functions with more than one enabler, e.g. "PH receptors autophosphorylate" in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2682334
The schema currently allows only 0 or 1 enablers.

Same for 'Phosphorylation of IRF-3/IRF7 and their release from the activated TLR3 complex' in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9013973

@ukemi
Copy link

ukemi commented Sep 25, 2019

Interesting again! We are faithfully representing the info.

In the first case, the function is enabled by EPHRINAs or EPHRINBs. Biologically this is correct. Any set will do this, but the restirction is hidden because the set is a single entity.
Spot checking the annotations for some of the EPHRIN genes in AmiGO shows that current annotations support these individual gene products enabling the molecular functions.

In the second example, it is also correct. Either entity can enable the kinase activity. (I'm not going to go down the path as to whether this actually represents a process of phosphorylation that is made up of two molecular functions. :) )

However, I thought it was against Reactome curation policy to allow more than one active unit in a reaction. @deustp01? Would it have been more consistent to create a set in these cases: p-S172-TBK1 [plasma membrane] and p-S172-IKBKE [plasma membrane]?

I was actually hoping we could go the route of more than one active subunit to solve the BMP receptor complex issue. If we could enumerate all of the catalytic subunits of the complex when we ditch the contributes_to rule, we could get more info. This skirts the whole issue of 'emergent functions', but do our users really care about that? When they ask for a list of genes that can carry out a function, don't they want all the members of a catalytic complex? @vanaukenk Complexes Complexes Complexes. Another alternative would be to propose that Reactome (and GO curators) somehow capture this info. Then how do we serve it up?

@goodb
Copy link
Contributor Author

goodb commented Sep 30, 2019

@ukemi @deustp01 it looks like recently introduced changes to the conversion have cured most of the problems above. Out of a sample of 650 converted pathways, only 51 now fail the shex validation test. (So about 92% success rate). I checked 10 of the failures. Of these 10, 3 failures were caused by having multiple enablers on a reaction node and untyped gene product nodes cause 7. Examples of reactions with untyped gene product nodes enabling reactions include:
Asymmetric localization of PCP proteins - unknown kinase
CS DS degradation.ttl - "beta-xylosidase" R-HSA-2247521
Metabolism of ingested H2SeO4 and H2SeO3 into H2Se - R-HSA-5359054
Glyoxylate metabolism and glycine degradation - R-HSA-6784434
Signal transduction by L1 - Phosphorylation of VAV2
Aflatoxin activation and detoxification - Unknown NAT transfers COCH3 to AFXBO-C, AFNBO-C
Vitamin B1 (thiamin) metabolism - TDPK (looks like reaction has been changed in release 70 but still untyped)

When the converter encounters a node without any information beyond "physical entity", as in the cases above, it types it as a chebi 'chemical entity' (which is the root of all physical entities in our system). The shex schema does not like functions to be enabled by anything aside from gene products or complexes.

Thoughts on this problem? I could infer that if an untyped node is enabling a reaction, it must be an information macromolecule and make that assignment when the models are created. Would that be okay?

@ukemi
Copy link

ukemi commented Sep 30, 2019

Interesting, Peter? It might be a better strategy to type these in Reactome. Some look like black-box reactions.

  1. Asymmetric localization of PCP proteins----Clearly the unknown kinase should be an enzyme/gene product or complex
  2. CS DS degradation.ttl - "beta-xylosidase" R-HSA-2247521----is a Genes and Transcripts [GenomeEncodedEntity] this should certainly be an information biomacromolecule.
  3. Metabolism of ingested H2SeO4 and H2SeO3 into H2Se - R-HSA-5359054----is a Genes and Transcripts [GenomeEncodedEntity] (PAPSe reductase) this should certainly be an information biomacromolecule.
  4. Glyoxylate metabolism and glycine degradation - R-HSA-6784434----- This is a black-box reaction. Is it safe to assume it is an information biomacromolecule? peroxisomal glyoxylate carrier [peroxisomal membrane]
  5. Signal transduction by L1 - Phosphorylation of VAV2-----Another black box unknown kinase. Safe to assume it is a biomacromolecule.
  6. Aflatoxin activation and detoxification - Unknown NAT transfers COCH3 to AFXBO-C, AFNBO-C--------Another unknown NAT [cytosol] black box. Safe to assume it is a biomacromolecule?
  7. Vitamin B1 (thiamin) metabolism - TDPK--------Looks like another enzyme where we know it exists, but don't know what it is. Should be a biomacromolecule.

@ukemi
Copy link

ukemi commented Sep 30, 2019

PS. would like Peter to comment too.

@ukemi
Copy link

ukemi commented Sep 30, 2019

If the ratio stays the same, it only looks like there are ~15 pathways with multiple enablers that will require new sets. I predict all the others are black-box reactions.

@deustp01
Copy link
Collaborator

Yes, David's predictions seem plausible. I will look carefully.

@deustp01
Copy link
Collaborator

Out of a sample of 650 converted pathways, only 51 now fail the shex validation test. (So about 92% success rate)
@goodb could you send me the list of 51 so I can check?

@ukemi
Copy link

ukemi commented Sep 30, 2019

@goodb. Welcome to the world of the in-the-weeds curators! :)

@goodb
Copy link
Contributor Author

goodb commented Sep 30, 2019

Examples in your emails.

@ukemi
Copy link

ukemi commented Sep 30, 2019

Thanks @goodb. @deustp01 do you want to look yourself, split them up or look together?

@deustp01
Copy link
Collaborator

deustp01 commented Sep 30, 2019

I've done the first seven from the list that Ben added to this ticket earlier today.

Two conclusions. First, in all cases the problem is that we have asserted that a protein with no UniProt ID mediates a function defined by a proper GO molecular function term. Our goal is to associate functions with gene products, so this kind of annotation is rare - we should do it only when a reaction is known to happen even though the protein that mediates it has not been characterized enough to allow it to be identified, and we need the reaction to let us connect all the parts of a process together. We in fact have a separate class of physical entity, genomeEncodedEntity, used to annotate such unidentified proteins. So could this class distinction be made into a test to tell the ShEX QA that it's looking at an instance with missing information rather than an incorrect one?

EntityWithAccessionedSequence - the class that holds proteins that do have UniProt IDs - is a child of genomeEncodedEntity. We strictly enforce the rule that an EWAS must have a UniProt attribute, so any member of the GEE class that lacks a UniProt identifier must be a GEE.

Second conclusion / question. As you'll see from the details below, sometimes it looks like a catalystActivity instance is the target of the ShEX xomplaint. This is what I'd expect because this is the class that combines a specific entity with a specific molecular function so this is where unexpected kinds of physical entities should cause problems. But in some cases, reactions or whole pathways seem to have gotten flagged. This is not a big practical problem - as you've already noticed drilling down to the problem entity is easy. I'm just surprized by the irregulasr granularity.

Now the weeds.

  1. Asymmetric localization of PCP proteins----Clearly the unknown kinase should be an enzyme/gene product or complex
    Yes. In the same way that there is a reactionlike event class that has specific types of reactions as children (so GO-CAM could point to the parent when the exact reaction type is unknown or has no associated GO molecular function term, there is a genome-encoded entity class which has entities with accessioned sequences (EWAS) (i.e., canonical forms of proteins with UniProt IDs) as its major child class. However, it is also legal to create instances of the parent genome-encoded entity class when there is evidence that a protein (alone or as part of a complex) does something but not enough evidence to identify the protein uniquely, as here. How to detect this circumstance? All members of the EWAS class must have associated UniProt IDs – this is rigidly enforced – so any annotation of a macromolecule with no identifier must be an instance of the genome-encoded entity class.
    Something I don’t understand about this instance is why the pathway is flagged while in problem cases 2 and 3 it’s the individual reaction. It looks like the ShEX issue is at the catalystActivity level in all cases.

  2. CS DS degradation.ttl - "beta-xylosidase" R-HSA-2247521----is a Genes and Transcripts [GenomeEncodedEntity] this should certainly be an information biomacromolecule.
    Yes, exactly like the kinase in example 1.

  3. Metabolism of ingested H2SeO4 and H2SeO3 into H2Se - R-HSA-5359054----is a Genes and Transcripts [GenomeEncodedEntity] (PAPSe reductase) this should certainly be an information biomacromolecule.
    Yes, exactly like the kinase in example 1.

  4. Glyoxylate metabolism and glycine degradation - R-HSA-6784434----- This is a black-box reaction. Is it safe to assume it is an information biomacromolecule? peroxisomal glyoxylate carrier [peroxisomal membrane]
    Yes, exactly like the kinase in example 1. Another granularity oddity. In examples 2 and 3 it looks like the catalystActivity instance (the class that juxtaposes an entity and a molecular function to assert the entity enables the function) appears to be what was flagged by the ShEX script, and that’s right. In example 1, a whole pathway involving multiple binding reactions and one catalyzed reaction got flagged. Here (example 4) the whole reaction got flagged.

  5. Signal transduction by L1 - Phosphorylation of VAV2-----Another black box unknown kinase. Safe to assume it is a biomacromolecule.
    Yes, exactly like the kinase in example 1.

  6. Aflatoxin activation and detoxification - Unknown NAT transfers COCH3 to AFXBO-C, AFNBO-C--------Another unknown NAT [cytosol] black box. Safe to assume it is a biomacromolecule?
    Yes, exactly like the kinase in example 1.

  7. Vitamin B1 (thiamin) metabolism - TDPK--------Looks like another enzyme where we know it exists, but don't know what it is. Should be a biomacromolecule.
    Yes, exactly like the kinase in example 1.

@deustp01
Copy link
Collaborator

Thanks @goodb. @deustp01 do you want to look yourself, split them up or look together?

I'll start by myself. If the pattern of the first seven holds up, we may already have all the information there is to get.

@goodb
Copy link
Contributor Author

goodb commented Sep 30, 2019

Apologies @deustp01 I must always remember to be more precise when sending spreadsheets to you.. The notes column is not a specific indication of what aspect of the model failed the shex, it was just a where I was jotting things down as I looked through them so that I had a trail to find my way back.

All of the untyped gene product node cases are molecular function nodes failing against the rule that they can only be enabled by proteins or complexes. Those failures have cascading effects on other nodes connected to the problem case by causal relations.

When I look at the BioPAX for these pathways, I don't see any signs of the genomeEncodedEntity class. Unless there is an accession number, all I see in there is that it is a Physical Entity.

@deustp01
Copy link
Collaborator

I'll ask Guanming about how / whether kinds of macromolecule physical entities are distinguished in our BioPax export. (No problem about the granularity

@goodb
Copy link
Contributor Author

goodb commented Sep 30, 2019

If you are looking at the models on noctua-dev we have a new toy for you. When examining a model that you think may be invalid, turn on the reasoner and look for the red boxes. Red boxes indicate nodes that are failing validation. Clicking on the nodes will give you a geeky json representation of what went wrong. (it will be blank if its the 2 enablers problem - known bug). Try it on this one: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4608870
The reaction in the middle is the culprit. Notice how it also invalidates the reactions with causal relations leading into it.

Screen Shot 2019-09-30 at 1 52 33 PM
Screen Shot 2019-09-30 at 1 52 42 PM

@goodb
Copy link
Contributor Author

goodb commented Oct 1, 2019

@deustp01 here is what I believe is a complete list of pathways in Reactome release 69 that contain reactions that are enabled by more than one entity.

reactome_r69_multi_enabler_reactions.txt

@deustp01
Copy link
Collaborator

deustp01 commented Oct 1, 2019

I'll ask Guanming about how / whether kinds of macromolecule physical entities are distinguished in our BioPax export.

And the answer is, "For GEE, we hacked BioPAX a little bit by converting a GEE into a PhysicalEntity directly. This is the best approach I could find.The same is used for any Reactome PhysicalEntity instances that don’t have ReferenceEntities assigned."

So I guess for Reactome to GO-CAM conversion via BioPAX, these instances can be found by testing each incoming monomer physicalEntity to determine whether it has a UniProt or ChEBI reference. If not, then it is an empty placeholder, to be handled (? or not?) by the physical entity equivalent of the reactionLikeEvent placeholder for transformations that are out of scope for GO like dissociations.

@goodb
Copy link
Contributor Author

goodb commented Oct 1, 2019

If they have references, they are typed correctly in the BioPAX and the conversion, so that is handled.

Its a shame BioPax lumps SmallMolecule in with the biological macromolecules as direct children of physical entity. Otherwise we could equate physical entity with information biomacromolecule CHEBI:33695 which I think is the placeholder we are looking for.

I suppose we could hack the biopax again with a comment for the genome encoded entities, but in this case it seems that it is probably not worthwhile to do so as the overall information gained is low.

Could we use the information that these entities are catalyzing reactions to infer that they are biomacromolecules ? Are we safe to assume that Reactome would not record an untyped small molecule catalyzing a reaction?

@ukemi
Copy link

ukemi commented Oct 1, 2019

It's a shame that we can't just equateor create a relationship between a GEE and a biomacromolecule. Would it be safe to do this in the BioPax itself?

@deustp01
Copy link
Collaborator

deustp01 commented Oct 1, 2019

Could we use the information that these entities are catalyzing reactions to infer that they are biomacromolecules ? Are we safe to assume that Reactome would not record an untyped small molecule catalyzing a reaction?

The answer is almost always, yes. Here is the one exception I know of: we have overloaded the concept of catalysis to assert that a photon mediates the activation of rhodopsin.

If abuses like this are really rare (I think they are), then they could be eliminated on the Reactome side by re-annotating this reaction and any other where we have assigned an enzyme role to a small molecule. Even if small molecules do catalyze reactions in the body (I think that's really rare), Reactome and GO both want to capture the functions of gene products so such a catalysis is out of scope for both. If either Reactome or GO needs the reaction for connectivity in a process, we can find another way to annotate it.

@goodb
Copy link
Contributor Author

goodb commented Oct 1, 2019

I don't see a way to do it without going beyond what is in the standard. When we extended it to handle active units, we did so in comments.. Not really ideal.

@goodb
Copy link
Contributor Author

goodb commented Jan 27, 2020

One thing @deustp01 's comment suggests to me is that it its going to be increasingly important to think about annotating at the gene family level rather than the individual gene level when that is the level of evidence. FWIW..

Here are the only entities that are currently completely unclassified and thus schema breakers:
R-HBV-8982481 viral double-stranded RNA
R-HCV-8982462 viral double-stranded RNA
R-HSA-5357857 p-T357,S358-MLKL oligomer
R-MMU-9670591 Igkc
R-HSA-446069 Keratin 5/14
R-HSA-9660029 Poly-vimentin
R-HSA-9660015 Poly-vimentin
R-HSA-9657899 Poly-vimentin
R-HIV-173644 Multimeric capsid coat
R-HIV-175338 Multimeric matrix layer
R-HSA-2514779 Cleaved fibrillin-2
R-CPE-9661737 UBGNR
R-CPE-9661715 BILR
R-HSA-9661701 UBGNO
R-HSA-9663470 BGET
R-ALL-9638756 RNA(n+1)

@deustp01
Copy link
Collaborator

One thing @deustp01 's comment suggests to me is that it its going to be increasingly important to think about annotating at the gene family level rather than the individual gene level when that is the level of evidence. FWIW..

Immediate practical patch question - I haven't looked up any of these instances yet to see what's going on - some look like things I should have fixed in the xref cleanup late in the fall. Would adding a ChEBI:protein (or ChEBI:RNA) xref to each of these fix the schema-breaking problem?

Longer-term (version 2) technical issue: Reactome has a class "polymer" that up to now we have been able to evade in exports to GO-CAM. Poly-vimentin is an example. For version 1, I can xref it to ChEBI:protein, and we can discuss later how to handle a polymer instance whose monomers are a UniProt protein.

Ben's point about classes. As an editorial issue, we don't handle classes of macromolecules (e.g., "collagens", or "kinases") well because our goal is to annotate individual gene products. Our work-around to date has been the creation of sets, with defined and candidate set types as a way of handling groups with more or less evidence to support the grouping. The PRO ontology structure provides a way to identify protein families and to associate specific proteins with those families that may be a better approach to this problem. A version 2 discussion item, I think.

@goodb
Copy link
Contributor Author

goodb commented Jan 27, 2020

Yes, adding the CHEBI xrefs would solve the schema problem right now.

@deustp01
Copy link
Collaborator

deustp01 commented Jan 31, 2020

Here are the only entities that are currently completely unclassified and thus schema breakers:

All now fixed with ChEBI cross-references:
R-HBV-8982481 viral double-stranded RNA ChEBI 67208 dsRNA
R-HCV-8982462 viral double-stranded RNA ChEBI 67208 dsRNA
R-HSA-5357857 p-T357,S358-MLKL oligomer ChEBI 36080 protein
R-MMU-9670591 Igkc curation mistake
R-HSA-446069 Keratin 5/14 ChEBI 36080 protein
R-HSA-9660029 Poly-vimentin ChEBI 36080 protein
R-HSA-9660015 Poly-vimentin ChEBI 36080 protein
R-HSA-9657899 Poly-vimentin ChEBI 36080 protein
R-HIV-9657899 Multimeric capsid coat wrong ID in GO-CAM?
R-HIV-173644 Multimeric capsid coat ChEBI 36080 protein
R-HIV-175338 Multimeric matrix layer ChEBI 36080 protein
R-HSA-2514779 Cleaved fibrillin-2 ChEBI 36080 protein
R-CPE-9661737 UBGNR ChEBI 36080 protein
R-CPE-9661715 BILR ChEBI 36080 protein
R-HSA-9661701 UBGNO ChEBI 36080 protein
R-HSA-9663470 BGET ChEBI 36080 protein
R-ALL-9638756 RNA(n+1) ChEBI 83400 RNA polyanion

@goodb
Copy link
Contributor Author

goodb commented Jan 31, 2020

Wish I could rerun now, but will either have to wait for this to percolate through the release or ask for another custom export from the curator site (which I hesitate to do).

@deustp01
Copy link
Collaborator

Let me see whether I can make any progress on the list of multiple enablers before we decide what to do here.

@deustp01
Copy link
Collaborator

@goodb Here's an interesting multiple-enablers case:
Activated JNKs phosphorylate c-JUN R-HSA-168136

We're asserting here that each of the three members of a set of proteins is capable by itself of enabling the molecular transformation annotated in the reaction. To do that, we've made the set the physical entity attribute of the reaction's catalystActivity, and made each of the set members an activeUnit attribute of that catalystActivity. If we can do this annotation consistently, could it be parsed correctly at the GO-CAM end, i.e., if the physical entity associated with catalysis is a set and the active units associated with that catalysis are all the members of the set, can this be interpreted to mean that set member 1 enables the annotated GO molecular function, set member 1 enables the annotated GO molecular function, ..., set member n enables the annotated GO molecular function?

R-HSA-168162 is another example of this.

@goodb
Copy link
Contributor Author

goodb commented Jan 31, 2020

@deustp01 erm.. This seems like an inappropriate use of the active-unit slot in your schema. That should really be used to specify which member of a complex is doing the business. In this case, there is no complex, and every member of the Set performs the action directly. If there is some difference between the members in terms of what they are doing there, then they shouldn't be in a Set object.

I would suggest that what you should do is just remove the active unit declaration here. Given the interpretation that a set is basically a shorthand way of saying the same things about each of its components, we would end up saying that each protein in that Set performs the catalysis. Without the active unit declaration, the Set would be processed as you say: MAPK10 enables JUN kinase activity, MAPK9 enables JUN kinase activity etc.

@deustp01
Copy link
Collaborator

@goodb That's better - if the set notation itself forces the correct parsing, I'll remove the redundant / inappropriate activeUnit annotations.

@goodb
Copy link
Contributor Author

goodb commented Jan 31, 2020

If it doesn't, then its my problem! Either way you are done ;).

@deustp01
Copy link
Collaborator

deustp01 commented Feb 3, 2020

@goodb - cleanup of events with multiple active units. If I want to assert that either of two members of a protein family M1 and M2 is capable of joining a given complex with some other proteins A, B, and C and functioning as the active unit of a molecular function enabled by the complex, will it work to make a set [M1 or M2], annotate a complex whose components are A and B and C and [M1 or M2], and then annotate the set [M1 or M2] as the active unit?

If that will work, I will use that strategy to patch at least one reaction so far. If that strategy will not work or may come with other risks or difficulties (it looks close to the request you disallowed on Friday), I will simply annotate the complex as mediating the molecular function without specifying an active unit, and leave the issue of capturing the lost bit of information for version 2.

@goodb
Copy link
Contributor Author

goodb commented Feb 3, 2020

@deustp01 I think that will work, please go for it.

@deustp01
Copy link
Collaborator

deustp01 commented Feb 8, 2020

Here is a list of all of the reactions with the multiple enabler problem (based on curator-version BioPAX from Jan 8, 2020):
Multiple Enablers.txt

I have gone through the list and resolved most of them, mostly by findng a reason to delete all or all but one of the entities listed as an active unit. In a few cases I've had to refer an instance to a curator for action because I was uncomfortable unilaterally resolving it. The Excel file is Ben's list with green color indicating fully resolved (I hope) instances and yellow indicating pending ones. The number in the last column refers to entries in the Word document attached here that give the rationales for what I did in each case. I will now try to turn those notes into a better organized proposal for how to use the active unit slot in Reactome catalyst activity instances for discussion both among ourselves and with Reactome curators.
events_with_mutiple_enablers_20200127.xlsx
events_with_mutiple_enablers_20200131.docx

@goodb
Copy link
Contributor Author

goodb commented Feb 17, 2020

@deustp01 it looks like we are pretty close here. When do you think I will have access to a BioPAX export that contains all of these updates?

@ukemi
Copy link

ukemi commented Feb 17, 2020

@deustp01, when I look at the annotations coming from the reactome gaf in AmiGO2, I only see NSF1 annotated to GO:0031071. NSF1, FXN, ISD1 (LYRM4) and ISCU are annotated to mitochondrial matrix from this pathway. Do you know how NSF1 is chosen from the R-HSA-1362401 complex? Am I looking at the pathway correctly? It seems to me that the complex has the catalyst activity and somehow in the gaf you have identified NSF1 catalytic subunit.

@deustp01
Copy link
Collaborator

deustp01 commented Feb 17, 2020

@ukemi The released version of this reaction has two catalystActivities, which I think causes the multiple-enablers issue. It has now been re-annotated to give it one catalyst activity (2Iron:FXN:NFS1:ISD11:ISCU [mitochondrial matrix] complex, with activeUnit PXLP-K258-NFS1-1 [mitochondrial matrix] (NFS1 protein covalently modified with pyridoxal phosphate), mediates GO:0031071 "cysteine desulfurase activity"), positively regulated by the physical entity that was previously annotated as the second catalyst. Bruce May, who curated the original event and the patch says it is a pretty good fit to the biology.

That activeUnit annotation for NSF1 is already present in the public two-catalyst version of the event and is correct. The fatal problem for GO-CAM is the second catalystActivity, now removed.

You can see the revised event on our internal site here. It will propagate into our public database, GAF, and BioPax export naturally with our next release in mid-March and should be available as BioPax sooner (@goodb) depending on Guanming's workload. I will ask about that.

@ukemi
Copy link

ukemi commented Feb 17, 2020

Got it! Thanks @deustp01. I'm still curious as the the origin of the conventional annotation of NFS1 to cysteine desulfurase activity (GO:0031071) and none of the other members of the complex though. This is off topic for this ticket, but I am curious.

@goodb
Copy link
Contributor Author

goodb commented Apr 18, 2020

Since the March release, recent updates to the conversion code and to the shex schema, we are now down to 8 models that don't validate. All models pass OWL consistency check.

5 of them fail because they have molecular function nodes with more than one enabling entity: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-6782135
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-6807505
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4615885
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-6785470
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8868766

3 of them fail because they have a Photon as the enabler of a molecular function:
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2187335
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2485179
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2453902

For the latter, the march release seems to have Photon classified as a chemical entity (resulting in the shex violation as chemical entities can't be enablers). However, this seems to have been changed to the real entry for Photon in CHEBI https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:30212 which will result in it being typed as a subatomic particle - and likely continue to fail the validation for the same reason.
 
(note that current (April 18, 2020) noctua-dev has an outdated reacto driving it that is making photons into proteins and thus not showing the shex fail).

@ukemi
Copy link

ukemi commented Apr 20, 2020

I've been in discussion with @deustp01 about the photons. Let's bring it up on the next call, April 29. As for the multiple enablers, we will have to have a closer look a these. Is it because a complex has more than one catalytic subunit? Considering the number of models we have, this really isn't that bad!

@deustp01
Copy link
Collaborator

I'll look today - multiple-enablers should be fix-able on the Reactome side. Photons - as far as the Reactome data model is concerned, photons can be inputs and in fact should generally (maybe always) not be enablers. (Even input has a problem with materiality, but there's no need to compound it by treating them as catalysts where GO-CAM expects genome-encoded informational macromolecules.

@deustp01
Copy link
Collaborator

OK. The three enabling photons have been converted from catalysts to input physical entities.
Events with multiple enablers - for local data model reasons, four of the five here may be hard to fix. One looks like a curation mistake that may be quite hard to track down and fix.
So, @goodb , would it be possible for the GO-CAM building process to check for flawed enablers (the only surviving flaw should be multiple ones, but new curation could create new kinds), and discard the flawed enabler attribute, note its identifier so we can check it on our side, and build a GO-CAM with what's left? If that check + complain approach is easy to implement, that looks like it would be a fairly sturdy fix.

@goodb
Copy link
Contributor Author

goodb commented Apr 20, 2020

@deustp01 I think its possible, but not sure on the overall strategy. It seems like a deeper modification of the incoming model than we have done so far. An alternative is simply to report any disagreement with the schema (including this enabler problem and many other potential conflicts that could arise) and the reason, and stop with that. I kind of like that approach as its generally applicable.

Following down that road, we might imagine an option on the converter to delete any relations violating the schema. That would only be important if we had software that cared deeply about the schema and an intention to use these models in that software. I'm not sure that either of those conditions apply right now.

@deustp01
Copy link
Collaborator

If I understand right, the number of remaining problem cases is small and most of the ones we know about require fixes on the Reactome side that are hard to do. We can also expect that new content added to Reactome will come with occasional items that violate GO-CAM rules. I was imagining the remaining problems were a major block to using the models in current applications, and if that's not true then the reports you describe are all we need now.

@ukemi
Copy link

ukemi commented Apr 21, 2020

"Hard to do" says to me "Maybe not wrong" and we might want to have a look at our rules. But you are correct. it's a small number of cases.

@deustp01
Copy link
Collaborator

"Hard to do" is one of those context-dependent terms. Of the five non-photon cases that Ben flagged on April 18 above, one was a curation mistake (the active unit of a catalytic complex was not a component of that complex - now fixed; visible in June). The other four were black box events each of which describe multiple catalytic steps with known catalysts that occur in unknown order, so multiple enablers are crammed into a single event. This tactic preserves all the gene product - molecular function associations but breaks our logic (harmlessly) and GO-CAM's (not so harmlessly). Even if it were possible, it seems unlikely that a rule change on the GO / GO-CAM side is good here. On our side, we can go back to the biology and look for ways to salvage as much information as possible in a rule-compliant way.

The photon cases look like a nice example of "maybe not wrong" - as long as we don't call them (or anything else that's not a gene product) enablers, but use them as inputs, outputs, and the non-gene-product physical entities that mediate regulation instances*, the remaining bit of Reactome - GO disconnect can be ignored safely.
(*If a small molecule like fructose-2,6-bisphosphate can regulate without breaking anything, then perhaps a photon could do it as well.)

@ukemi
Copy link

ukemi commented Apr 21, 2020

As we move forward though, we might want to devise a way to get annotations from those black boxes.

@deustp01
Copy link
Collaborator

Definitely - part of the larger discussion of aligning data models better for version 2.

@goodb
Copy link
Contributor Author

goodb commented May 29, 2020

From the conversion code side, I believe we are not planning to make any more changes. There will be future changes on the reactome side, but no need for this issue as these will be ongoing with each release I imagine.

@goodb goodb closed this as completed May 29, 2020
DONE 2020-05 (Paris) Pathways2GO Version 1.0 automation moved this from In progress to Done May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
knowledge representation work here is for modelers and ontologists
Projects
No open projects
Development

No branches or pull requests

4 participants