-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure reactome go-cams validate against main GO shex schema #69
Comments
Fingers crossed! |
Hit two issues so far.
Here is an example of the latter problem, which is very common. |
@vanaukenk in relation to geneontology/go-shapes#160 I am running all of the current conversions through the validator. It looks like about 25% of them do not pass as things stand today. I will catalogue failures here as I diagnose them. Example 1 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8852276 Same problem in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-168638 Another similar example: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-8941333 contains positive_regulates and causally_upstream of links from functions to untyped processes. |
Another problem case. http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1482788 In this case the Set is a mix of proteins and one complex. https://reactome.org/content/detail/R-HSA-1524112 maybe goes as a curation problem for Reactome - @deustp01 ? |
#69 (comment) In this version, if a set contains proteins and complexes (but not anything else) it is typed as an informationbiomacromolecule .
The conversion sometimes results in functions with more than one enabler, e.g. "PH receptors autophosphorylate" in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2682334 Same for 'Phosphorylation of IRF-3/IRF7 and their release from the activated TLR3 complex' in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9013973 |
Interesting again! We are faithfully representing the info. In the first case, the function is enabled by EPHRINAs or EPHRINBs. Biologically this is correct. Any set will do this, but the restirction is hidden because the set is a single entity. In the second example, it is also correct. Either entity can enable the kinase activity. (I'm not going to go down the path as to whether this actually represents a process of phosphorylation that is made up of two molecular functions. :) ) However, I thought it was against Reactome curation policy to allow more than one active unit in a reaction. @deustp01? Would it have been more consistent to create a set in these cases: p-S172-TBK1 [plasma membrane] and p-S172-IKBKE [plasma membrane]? I was actually hoping we could go the route of more than one active subunit to solve the BMP receptor complex issue. If we could enumerate all of the catalytic subunits of the complex when we ditch the contributes_to rule, we could get more info. This skirts the whole issue of 'emergent functions', but do our users really care about that? When they ask for a list of genes that can carry out a function, don't they want all the members of a catalytic complex? @vanaukenk Complexes Complexes Complexes. Another alternative would be to propose that Reactome (and GO curators) somehow capture this info. Then how do we serve it up? |
@ukemi @deustp01 it looks like recently introduced changes to the conversion have cured most of the problems above. Out of a sample of 650 converted pathways, only 51 now fail the shex validation test. (So about 92% success rate). I checked 10 of the failures. Of these 10, 3 failures were caused by having multiple enablers on a reaction node and untyped gene product nodes cause 7. Examples of reactions with untyped gene product nodes enabling reactions include: When the converter encounters a node without any information beyond "physical entity", as in the cases above, it types it as a chebi 'chemical entity' (which is the root of all physical entities in our system). The shex schema does not like functions to be enabled by anything aside from gene products or complexes. Thoughts on this problem? I could infer that if an untyped node is enabling a reaction, it must be an information macromolecule and make that assignment when the models are created. Would that be okay? |
Interesting, Peter? It might be a better strategy to type these in Reactome. Some look like black-box reactions.
|
PS. would like Peter to comment too. |
If the ratio stays the same, it only looks like there are ~15 pathways with multiple enablers that will require new sets. I predict all the others are black-box reactions. |
Yes, David's predictions seem plausible. I will look carefully. |
|
@goodb. Welcome to the world of the in-the-weeds curators! :) |
Examples in your emails. |
I've done the first seven from the list that Ben added to this ticket earlier today. Two conclusions. First, in all cases the problem is that we have asserted that a protein with no UniProt ID mediates a function defined by a proper GO molecular function term. Our goal is to associate functions with gene products, so this kind of annotation is rare - we should do it only when a reaction is known to happen even though the protein that mediates it has not been characterized enough to allow it to be identified, and we need the reaction to let us connect all the parts of a process together. We in fact have a separate class of physical entity, genomeEncodedEntity, used to annotate such unidentified proteins. So could this class distinction be made into a test to tell the ShEX QA that it's looking at an instance with missing information rather than an incorrect one? EntityWithAccessionedSequence - the class that holds proteins that do have UniProt IDs - is a child of genomeEncodedEntity. We strictly enforce the rule that an EWAS must have a UniProt attribute, so any member of the GEE class that lacks a UniProt identifier must be a GEE. Second conclusion / question. As you'll see from the details below, sometimes it looks like a catalystActivity instance is the target of the ShEX xomplaint. This is what I'd expect because this is the class that combines a specific entity with a specific molecular function so this is where unexpected kinds of physical entities should cause problems. But in some cases, reactions or whole pathways seem to have gotten flagged. This is not a big practical problem - as you've already noticed drilling down to the problem entity is easy. I'm just surprized by the irregulasr granularity. Now the weeds.
|
Apologies @deustp01 I must always remember to be more precise when sending spreadsheets to you.. The notes column is not a specific indication of what aspect of the model failed the shex, it was just a where I was jotting things down as I looked through them so that I had a trail to find my way back. All of the untyped gene product node cases are molecular function nodes failing against the rule that they can only be enabled by proteins or complexes. Those failures have cascading effects on other nodes connected to the problem case by causal relations. When I look at the BioPAX for these pathways, I don't see any signs of the genomeEncodedEntity class. Unless there is an accession number, all I see in there is that it is a Physical Entity. |
I'll ask Guanming about how / whether kinds of macromolecule physical entities are distinguished in our BioPax export. (No problem about the granularity |
If you are looking at the models on noctua-dev we have a new toy for you. When examining a model that you think may be invalid, turn on the reasoner and look for the red boxes. Red boxes indicate nodes that are failing validation. Clicking on the nodes will give you a geeky json representation of what went wrong. (it will be blank if its the 2 enablers problem - known bug). Try it on this one: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4608870 |
@deustp01 here is what I believe is a complete list of pathways in Reactome release 69 that contain reactions that are enabled by more than one entity. |
And the answer is, "For GEE, we hacked BioPAX a little bit by converting a GEE into a PhysicalEntity directly. This is the best approach I could find.The same is used for any Reactome PhysicalEntity instances that don’t have ReferenceEntities assigned." So I guess for Reactome to GO-CAM conversion via BioPAX, these instances can be found by testing each incoming monomer physicalEntity to determine whether it has a UniProt or ChEBI reference. If not, then it is an empty placeholder, to be handled (? or not?) by the physical entity equivalent of the reactionLikeEvent placeholder for transformations that are out of scope for GO like dissociations. |
If they have references, they are typed correctly in the BioPAX and the conversion, so that is handled. Its a shame BioPax lumps SmallMolecule in with the biological macromolecules as direct children of physical entity. Otherwise we could equate physical entity with information biomacromolecule CHEBI:33695 which I think is the placeholder we are looking for. I suppose we could hack the biopax again with a comment for the genome encoded entities, but in this case it seems that it is probably not worthwhile to do so as the overall information gained is low. Could we use the information that these entities are catalyzing reactions to infer that they are biomacromolecules ? Are we safe to assume that Reactome would not record an untyped small molecule catalyzing a reaction? |
It's a shame that we can't just equateor create a relationship between a GEE and a biomacromolecule. Would it be safe to do this in the BioPax itself? |
The answer is almost always, yes. Here is the one exception I know of: we have overloaded the concept of catalysis to assert that a photon mediates the activation of rhodopsin. If abuses like this are really rare (I think they are), then they could be eliminated on the Reactome side by re-annotating this reaction and any other where we have assigned an enzyme role to a small molecule. Even if small molecules do catalyze reactions in the body (I think that's really rare), Reactome and GO both want to capture the functions of gene products so such a catalysis is out of scope for both. If either Reactome or GO needs the reaction for connectivity in a process, we can find another way to annotate it. |
I don't see a way to do it without going beyond what is in the standard. When we extended it to handle active units, we did so in comments.. Not really ideal. |
One thing @deustp01 's comment suggests to me is that it its going to be increasingly important to think about annotating at the gene family level rather than the individual gene level when that is the level of evidence. FWIW.. Here are the only entities that are currently completely unclassified and thus schema breakers: |
Immediate practical patch question - I haven't looked up any of these instances yet to see what's going on - some look like things I should have fixed in the xref cleanup late in the fall. Would adding a ChEBI:protein (or ChEBI:RNA) xref to each of these fix the schema-breaking problem? Longer-term (version 2) technical issue: Reactome has a class "polymer" that up to now we have been able to evade in exports to GO-CAM. Poly-vimentin is an example. For version 1, I can xref it to ChEBI:protein, and we can discuss later how to handle a polymer instance whose monomers are a UniProt protein. Ben's point about classes. As an editorial issue, we don't handle classes of macromolecules (e.g., "collagens", or "kinases") well because our goal is to annotate individual gene products. Our work-around to date has been the creation of sets, with defined and candidate set types as a way of handling groups with more or less evidence to support the grouping. The PRO ontology structure provides a way to identify protein families and to associate specific proteins with those families that may be a better approach to this problem. A version 2 discussion item, I think. |
Yes, adding the CHEBI xrefs would solve the schema problem right now. |
All now fixed with ChEBI cross-references: |
Wish I could rerun now, but will either have to wait for this to percolate through the release or ask for another custom export from the curator site (which I hesitate to do). |
Let me see whether I can make any progress on the list of multiple enablers before we decide what to do here. |
@goodb Here's an interesting multiple-enablers case: We're asserting here that each of the three members of a set of proteins is capable by itself of enabling the molecular transformation annotated in the reaction. To do that, we've made the set the physical entity attribute of the reaction's catalystActivity, and made each of the set members an activeUnit attribute of that catalystActivity. If we can do this annotation consistently, could it be parsed correctly at the GO-CAM end, i.e., if the physical entity associated with catalysis is a set and the active units associated with that catalysis are all the members of the set, can this be interpreted to mean that set member 1 enables the annotated GO molecular function, set member 1 enables the annotated GO molecular function, ..., set member n enables the annotated GO molecular function? R-HSA-168162 is another example of this. |
@deustp01 erm.. This seems like an inappropriate use of the active-unit slot in your schema. That should really be used to specify which member of a complex is doing the business. In this case, there is no complex, and every member of the Set performs the action directly. If there is some difference between the members in terms of what they are doing there, then they shouldn't be in a Set object. I would suggest that what you should do is just remove the active unit declaration here. Given the interpretation that a set is basically a shorthand way of saying the same things about each of its components, we would end up saying that each protein in that Set performs the catalysis. Without the active unit declaration, the Set would be processed as you say: MAPK10 enables JUN kinase activity, MAPK9 enables JUN kinase activity etc. |
@goodb That's better - if the set notation itself forces the correct parsing, I'll remove the redundant / inappropriate activeUnit annotations. |
If it doesn't, then its my problem! Either way you are done ;). |
@goodb - cleanup of events with multiple active units. If I want to assert that either of two members of a protein family M1 and M2 is capable of joining a given complex with some other proteins A, B, and C and functioning as the active unit of a molecular function enabled by the complex, will it work to make a set [M1 or M2], annotate a complex whose components are A and B and C and [M1 or M2], and then annotate the set [M1 or M2] as the active unit? If that will work, I will use that strategy to patch at least one reaction so far. If that strategy will not work or may come with other risks or difficulties (it looks close to the request you disallowed on Friday), I will simply annotate the complex as mediating the molecular function without specifying an active unit, and leave the issue of capturing the lost bit of information for version 2. |
@deustp01 I think that will work, please go for it. |
I have gone through the list and resolved most of them, mostly by findng a reason to delete all or all but one of the entities listed as an active unit. In a few cases I've had to refer an instance to a curator for action because I was uncomfortable unilaterally resolving it. The Excel file is Ben's list with green color indicating fully resolved (I hope) instances and yellow indicating pending ones. The number in the last column refers to entries in the Word document attached here that give the rationales for what I did in each case. I will now try to turn those notes into a better organized proposal for how to use the active unit slot in Reactome catalyst activity instances for discussion both among ourselves and with Reactome curators. |
@deustp01 it looks like we are pretty close here. When do you think I will have access to a BioPAX export that contains all of these updates? |
@deustp01, when I look at the annotations coming from the reactome gaf in AmiGO2, I only see NSF1 annotated to GO:0031071. NSF1, FXN, ISD1 (LYRM4) and ISCU are annotated to mitochondrial matrix from this pathway. Do you know how NSF1 is chosen from the R-HSA-1362401 complex? Am I looking at the pathway correctly? It seems to me that the complex has the catalyst activity and somehow in the gaf you have identified NSF1 catalytic subunit. |
@ukemi The released version of this reaction has two catalystActivities, which I think causes the multiple-enablers issue. It has now been re-annotated to give it one catalyst activity (2Iron:FXN:NFS1:ISD11:ISCU [mitochondrial matrix] complex, with activeUnit PXLP-K258-NFS1-1 [mitochondrial matrix] (NFS1 protein covalently modified with pyridoxal phosphate), mediates GO:0031071 "cysteine desulfurase activity"), positively regulated by the physical entity that was previously annotated as the second catalyst. Bruce May, who curated the original event and the patch says it is a pretty good fit to the biology. That activeUnit annotation for NSF1 is already present in the public two-catalyst version of the event and is correct. The fatal problem for GO-CAM is the second catalystActivity, now removed. You can see the revised event on our internal site here. It will propagate into our public database, GAF, and BioPax export naturally with our next release in mid-March and should be available as BioPax sooner (@goodb) depending on Guanming's workload. I will ask about that. |
Got it! Thanks @deustp01. I'm still curious as the the origin of the conventional annotation of NFS1 to cysteine desulfurase activity (GO:0031071) and none of the other members of the complex though. This is off topic for this ticket, but I am curious. |
Since the March release, recent updates to the conversion code and to the shex schema, we are now down to 8 models that don't validate. All models pass OWL consistency check. 5 of them fail because they have molecular function nodes with more than one enabling entity: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-6782135 3 of them fail because they have a Photon as the enabler of a molecular function: For the latter, the march release seems to have Photon classified as a chemical entity (resulting in the shex violation as chemical entities can't be enablers). However, this seems to have been changed to the real entry for Photon in CHEBI https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:30212 which will result in it being typed as a subatomic particle - and likely continue to fail the validation for the same reason. |
I've been in discussion with @deustp01 about the photons. Let's bring it up on the next call, April 29. As for the multiple enablers, we will have to have a closer look a these. Is it because a complex has more than one catalytic subunit? Considering the number of models we have, this really isn't that bad! |
I'll look today - multiple-enablers should be fix-able on the Reactome side. Photons - as far as the Reactome data model is concerned, photons can be inputs and in fact should generally (maybe always) not be enablers. (Even input has a problem with materiality, but there's no need to compound it by treating them as catalysts where GO-CAM expects genome-encoded informational macromolecules. |
OK. The three enabling photons have been converted from catalysts to input physical entities. |
@deustp01 I think its possible, but not sure on the overall strategy. It seems like a deeper modification of the incoming model than we have done so far. An alternative is simply to report any disagreement with the schema (including this enabler problem and many other potential conflicts that could arise) and the reason, and stop with that. I kind of like that approach as its generally applicable. Following down that road, we might imagine an option on the converter to delete any relations violating the schema. That would only be important if we had software that cared deeply about the schema and an intention to use these models in that software. I'm not sure that either of those conditions apply right now. |
If I understand right, the number of remaining problem cases is small and most of the ones we know about require fixes on the Reactome side that are hard to do. We can also expect that new content added to Reactome will come with occasional items that violate GO-CAM rules. I was imagining the remaining problems were a major block to using the models in current applications, and if that's not true then the reports you describe are all we need now. |
"Hard to do" says to me "Maybe not wrong" and we might want to have a look at our rules. But you are correct. it's a small number of cases. |
"Hard to do" is one of those context-dependent terms. Of the five non-photon cases that Ben flagged on April 18 above, one was a curation mistake (the active unit of a catalytic complex was not a component of that complex - now fixed; visible in June). The other four were black box events each of which describe multiple catalytic steps with known catalysts that occur in unknown order, so multiple enablers are crammed into a single event. This tactic preserves all the gene product - molecular function associations but breaks our logic (harmlessly) and GO-CAM's (not so harmlessly). Even if it were possible, it seems unlikely that a rule change on the GO / GO-CAM side is good here. On our side, we can go back to the biology and look for ways to salvage as much information as possible in a rule-compliant way. The photon cases look like a nice example of "maybe not wrong" - as long as we don't call them (or anything else that's not a gene product) enablers, but use them as inputs, outputs, and the non-gene-product physical entities that mediate regulation instances*, the remaining bit of Reactome - GO disconnect can be ignored safely. |
As we move forward though, we might want to devise a way to get annotations from those black boxes. |
Definitely - part of the larger discussion of aligning data models better for version 2. |
From the conversion code side, I believe we are not planning to make any more changes. There will be future changes on the reactome side, but no need for this issue as these will be ongoing with each release I imagine. |
Revisit and rework conversion rules or shex schema to bring them into alignment. Close when everything passes. Re-open individual tickets for future failures.
The text was updated successfully, but these errors were encountered: