Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle location information when complex function enabler is itself in multiple locations #59

Closed
goodb opened this issue Mar 27, 2019 · 39 comments

Comments

@goodb
Copy link
Contributor

goodb commented Mar 27, 2019

@deustp01 I'm curious what you think of this one. For the pathway 'HATs acetylate histones', given the current conversion (not what is up on dev as of right now), the reasoner infers a logical inconsistency based on the reaction 'Elongator complex acetylates replicative histone H3, H4', its enabler the 'Elongator complex' and its location in the cytosol.

As you can see here (with the occurs_in cytosol statement removed), the Elongator complex is inferred to be a histone acetyltransferase complex based on the fact that it enables histone acetyltransferase activity.
Screen Shot 2019-03-27 at 2 27 20 PM

When the occurs_in cytosol is added back in, the model is inconsistent
Screen Shot 2019-03-27 at 2 27 55 PM

This is because the inferred 'histone acetyltransferase complex' isa nucleoplasm part isa nuclear_part and its apparently impossible for the function to happen in the cytoplasm while being enabled by something in the nucleoplasm.
https://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0000123

Is there a problem with the conversion or is there perhaps a problem with the Reactome model ?

@deustp01
Copy link
Collaborator

I don't know much about this process but I would have expected it to be purely nuclear. A curation error is possible. I will look more at the annotations, check with the curator, and let you know.

@goodb
Copy link
Contributor Author

goodb commented Mar 28, 2019

It looks to me like I have a rather strange bug. I don't see why cytosol is getting asserted there - all I see is nucleoplasm in reactome. Odd but almost certainly my fault..

@goodb
Copy link
Contributor Author

goodb commented Mar 28, 2019

Found the source in Reactome. One protein IKBKAP in the elongator complex is tagged as cytosolic.

Screen Shot 2019-03-27 at 5 04 27 PM

still indicates a bug but not as crazy as I thought.

@goodb
Copy link
Contributor Author

goodb commented Mar 28, 2019

Indicates an error in the implementation of the rule for asserting occurs_in in situations like this (located in statements straight from reactome before they are trimmed out of the model):

Screen Shot 2019-03-27 at 5 18 51 PM

When a complex enables a function and the complex has parts in different locations, where should the function be said to occur ? I guess in this case we are not supposed to add any occurs_in statement. I think right now the code is just picking one of the possibilities at random (weighted by the number of elements in a particular location).

@deustp01
Copy link
Collaborator

Found the source in Reactome. One protein IKBKAP in the elongator complex is tagged as cytosolic.

The annotation mistake is now fixed in Reactome - visible on our internal site now; will propagate to public site with June 2019 release.

@deustp01
Copy link
Collaborator

deustp01 commented Mar 28, 2019

When a complex enables a function and the complex has parts in different locations, where should the function be said to occur ?

Right now, when the complex has one of its parts identified as the active unit, the location of that part is a good location for the function. In the future, when spatial relations are available, they can probably be used to provide some sanity checks on this and maybe support some reasoning?

@goodb
Copy link
Contributor Author

goodb commented Mar 28, 2019

@deustp01 I assume there some situations where a complex actually does have some members oriented in one location and others in another - e.g. transmembrane complexes. If a complex like that is said to enable a particular function in the GO-CAM world, would we simply not put any occurs_in annotations on the function node? I think that is most in keeping with the rules we established : #51 (comment) though I find it really unsatisfying when we are dropping information that could somehow be captured.

On a separate note. Although it was an accident, this seems like a nice example of how the OWL reasoning made possible by the GO-CAM conversion led to the detection of a curation error in Reactome. Its worth thinking through how we could better take advantage of this from the perspective of improving Reactome. We might actually want different conversion rules for the Reactome curation case. In the situation above, if I adapt the code as I predict I will be instructed to (dropping occurs_in in cases like these) we would end up with a consistent model for the GO but would not detect the error in Reactome. Perhaps this is something to consider formalizing in #48

@deustp01
Copy link
Collaborator

On the loss-of-information part, a report that listed the items that got dropped and the reasons could be used for Reactome QA.

goodb pushed a commit that referenced this issue Mar 28, 2019
Committing to note this point.  Right now we can get multiple occurs_in
edges when an enabling entity is annotated in multiple locations.  Next
step is to make this not happen.  But see discussion in issue #59 about
what this loses and how we might not want to do this.
@goodb
Copy link
Contributor Author

goodb commented Mar 28, 2019

@deustp01 R-HSA-420883 (Opsins act as GEFs for G alpha-t) is an example of a reaction where the catalytic complex has components in different locations but I suspect its not really a curation error. Opsins:photon has members in the photoreceptor disk membrane but also in the extracellular space.

Another example is R-HSA-1250498 where the catalyst EGF:p-EGFR:p-ERBB2:GRB2:SOS1 has components in the cytosol and in the plasma membrane.

When something like this is a catalyst, how should location information be shown in the GO-CAM? Right now, when there are conflicts like this, the reaction/function node does not get an occurs_in relation. Assuming this is desired, I can flag things like this for review by reactome - but I'd like to make a specific plan about what would be useful in that regard across the whole process.

@goodb
Copy link
Contributor Author

goodb commented Mar 28, 2019

One more example because I think it is interesting from the pathway tRNA modification in the nucleus and cytosol. The catalyzing complex METTL1:WDR4 for reaction R-HSA-6782286 is inferred to be a tRNA methyltransferase complex because it catalyzes tRNA methyltransferase activity. We end up with an inconsistency because tRNA methyltransferase complexes are cytosolic parts according to the GO and the reaction is annotated to happen in the nucleoplasm.

@ukemi
Copy link

ukemi commented Mar 29, 2019

Since there seems to be some concern about information loss. Would it be possible to capture the location in the GO-CAM with just a part_of relationship to the cellular component? It would be similar to the manual curation process where a curator knows that a gene product is found somewhere but isn't confident to make the leap and say that its function occurs there.

@deustp01
Copy link
Collaborator

R-HSA-420883 (Opsins act as GEFs for G alpha-t).

As annotated, and given Reactome's current view of spatial relationships among cell components, the locations here are complicated and possibly wrong:
R-HSA-420883
We are currently asserting that photoreceptor disc membrane is a part_of cilium and Antonio has it within the cilium. That whole structure is part_of cytosol, surrounded by plasma membrane. I hope that the real geometry is that photoreceptor disc membrane is a specialized kind of plasma membrane, so a photon-activated opsin located there can interact with a nearby plasma membrane-located G protein complex, enabling it to exchange its bound GDP molecule for a cytosolic GTP molecule (possible because the G protein complex in the membrane has access to the adjacent cytosol.

Saying all of this correctly in both Reactome and GO, and reasoning in GO to check plausibility of the annotations, will require a correct set of spatial relationships implemented in both GO and Reactome. But once that is done, this case should sort itself out, I think, so trying to get it to work in GO-CAM now is probably not worthwhile.

But this will be a great use case for the Reactome talk in Cambridge, to show a hard problem that becomes straightforward to deal with through collaborative work.

@deustp01
Copy link
Collaborator

Another example is R-HSA-1250498 where the catalyst EGF:p-EGFR:p-ERBB2:GRB2:SOS1 has components in the cytosol and in the plasma membrane.

R-HSA-1250498

This one is more straightforward. We're asserting that a complex located in the plasma membrane acts on p21-RAS protein also in the plasma membrane to catalyze the exchange of bound GDP for GTP on the latter. (This may be a misuse of "catalyst activity", but that is a different issue.) Again, the exchange is possible because p21-RAS has physical access to the cytosol.

In this case, as soon as spatial relationships (and some rules about what adjacency means: does p21's location next to the cytosol mean that it has access to all cytosolic entities by default?) are sorted out, bringing this case into GO-CAN+M should work OK, I think.

If the protein entities here were assigned to multiple locations (e.g., the G protein complex had some parts in the membrane, some in the extracellular space, and some in the cytosol, the p21-RAS a=was located on the cytosolic face of the plasma membrane, the cytosolic piece of the G protein complex were identified as the active unit for the GDP-GTP exchange, then the whole problem can be handled with current tools. The biology problem is that we generally don't have sufficiently precise location information to make such precise location annotations, so we end up simply putting things in the plasma membrane and inferring from the fact that guanine nucleotides are polar and intracellular, that the proteins must somehow have access to them even though we don't know exactly where the proteins are sitting when they get this access. David's last comment also addresses this and sounds like a good approach.

@goodb
Copy link
Contributor Author

goodb commented Mar 29, 2019

Okay, trying to synthesize here. It sounds like the most important pertinent unit of work right now is to finish the logical encoding of the spatial relationships in the GO CC. Ping @cmungall (let me know if you want me to participate there). Before that is solved, it is probably better not to try too hard to adapt models based on current location-based inferences.

Regarding David's idea. I'm personally in favor and did something very close to this in my first cut at the translation, except I used 'located_in' instead of 'part_of'. This formulation met with strong resistance and has not been a part of the conversion for a long time now, though it is still part of how the code works. I made a demo model to show how I think this would look for the elongator complex reaction:
http://noctua.geneontology.org/editor/graph/gomodel:5c4605cc00000822

Screen Shot 2019-03-29 at 9 45 16 AM

As I understand, the current idea is further specification of the rules for establishing occurs_in. Basically, if no occurs_in can be established because either the reaction involves entities in multiple locations or the enabling complex for the reaction itself has parts in different locations, then add in the part_of location statements for each of the physical entities involved in the reaction. From the perspective of exporting models to gpad (yuck just in general..) this would have the benefit of preserving the CC annotations for the relevant proteins that would otherwise be lost in translation.

thoughts?

@deustp01
Copy link
Collaborator

One more example because I think it is interesting from the pathway tRNA modification in the nucleus and cytosol. The catalyzing complex METTL1:WDR4 for reaction R-HSA-6782286 is inferred to be a tRNA methyltransferase complex because it catalyzes tRNA methyltransferase activity. We end up with an inconsistency because tRNA methyltransferase complexes are cytosolic parts according to the GO and the reaction is annotated to happen in the nucleoplasm.

This is a different problem. The Reactome localization is based on experimental evidence from PMID:15861136 (supplemental Figure 3), which shows nuclear localization of an overexpressed GFP-tagged version of the protein. Consistent with this conclusion, some sort of high throughput screen in S. cerevisiae located the yeast ortholog of the protein to the nucleus (PMID:22932476). The three annotations based on direct experimental assays that use GO:0043527 tRNA methyltransferase complex, all describe the structure of the yeast complex and say nothing about its location within the cell - nuclear or cytosolic.

So here there is a contradiction between the arguably limited and poor-quality evidence to assign a location to the complex that we relied on and the location assigned by the parentage of the GO term that appears to be based on no evidence at all. @ukemi is this worth raising as a questionable-parentage ontology issue?

@ukemi
Copy link

ukemi commented Mar 29, 2019

@hdrabkin is there a problem in GO with the tRNA methyltransferase complex being a part_of the cytoplasm?

If so, can you open a ticket and fix it? This would be an example of an error in GO discovered by this project.

@hdrabkin
Copy link

Depending on ogranism tRNA methylation can occur
Prok: cyto
Euk: nuclear, mitochondrial, but also sometimes cytoplasmic (Dmt2, zebrafish):

As for this complex, it might not just be in cytoplasm
Anderson J., Phan L., Cuesta R., Carison B. A., Pak M., Asano K., et al. (1998). The essential Gcd10p-Gcd14p nuclear complex is required for 1-methyladenosine modification and maturation of initiator methionyl-tRNA. Genes Dev. 12, 3650–3652 10.1101/gad.12.23.3650 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
Will need to read more to get a consensus

@goodb
Copy link
Contributor Author

goodb commented Mar 29, 2019

If I may, this seems like an oddity of the cellular component ontology in general. I would have assumed that the main axis of differentia for this ontology was location, not function. That is really the way Reactome is using the ontology (exclusively for location, not entity function classification) and the way I would have expected it to be before looking into this. The branch of the ontology where protein complexes are defined based on what they are capable of seems like it might better be placed in an entirely separate ontology. I know that this is part of very long debate. But if a fresh perspective is useful, it just looks weird to me as it is...

@deustp01
Copy link
Collaborator

It is also a use case for the argument that the scope of GO should end at protein-containing complex and terms for more specific children should be retired and handled instead as annotations (a really good fit to the strengths of GO-CAM) or by specialized projects like PRO or IntAct / Complex Portal.

@deustp01
Copy link
Collaborator

deustp01 commented Mar 29, 2019

Depending on ogranism tRNA methylation can occur
Prok: cyto
Euk: nuclear, mitochondrial, but also sometimes cytoplasmic (Dmt2, zebrafish)

From what I saw poking around this morning, it’s really hard to find cellular location data for the various kinds of covalent modifications of residues in tRNAs (as opposed to splicing and cutting the poly-RNA itself, which all appears to be nuclear), and I guess there’s no reason a priori why different kinds of residue modifications couldn’t occur in different places in the cell, nor why a given modification couldn't occur in different places in cells from different taxa.

Maybe that's the ontology fix - based on the very limited data, and the suggestions of diverse locations from the bit of information that is available, remove all parents that restrict tRNA methylation complexes to specific subcellular locations - handle that as a separate annotation.

@goodb goodb changed the title bug or feature ? HATs acetylate histones How to handle location information when complex function enabler is itself in multiple locations Apr 5, 2019
@goodb
Copy link
Contributor Author

goodb commented Apr 19, 2019

Unless there are any objections, I'm going to implement the addition of the gene_product part_of location relations as @ukemi suggested and summarized in #59 (comment) . That will close this issue - (hope to see the other ontology problems/suggestions attended to elsewhere).

@goodb goodb moved this from To do to In progress in DONE 2020-05 (Paris) Pathways2GO Version 1.0 Apr 19, 2019
@goodb
Copy link
Contributor Author

goodb commented Apr 19, 2019

Noting that, as it stands, if a complex is part of a location A, and the complex has_part protein P with location B (as in the picture above for ELP1), the GPAD export will say that protein P is part of both locations A and B.
Screen Shot 2019-04-19 at 2 15 59 PM

Its a little surprising that this doesn't cause an inconsistency. I suppose one way to ameliorate this would be to take the location annotation off of the complex when it is on its members.

@goodb
Copy link
Contributor Author

goodb commented Apr 22, 2019

I have implemented the rule as follows:
If a reaction is not assigned an occurs_in relation per the other rules for establishing that, Then add entity part_of location for the entity (and its components) that enables the reaction. Questions:

  1. Should this apply to all entities involved (e.g. as inputs, outputs) or just to entities that enable the reaction?
  2. As noted above, this can result in a degree of redundancy for complexes - especially when the models are turned into gpad. Is this a problem?

Though I understand this particular pathway will not exhibit this structure in the next release, here is what the current one that started this thread looks like now:
Screen Shot 2019-04-22 at 1 41 18 PM

@cmungall
Copy link
Member

cmungall commented Apr 22, 2019 via email

@deustp01
Copy link
Collaborator

For discussion on Wednesday? How much value is added by keeping location information when it's non-redundant? How easy / safe is it to do this? Can we account for complexes that sprawl across locations, like a reactor-ligand complex that has extracellular, membrane-associated, and cytosolic bits whose distinct locations are central to the signal-transducing function of the complex?

@goodb goodb added the question label Apr 23, 2019
@goodb
Copy link
Contributor Author

goodb commented Apr 24, 2019

make a list for Peter..

@goodb
Copy link
Contributor Author

goodb commented Apr 25, 2019

@deustp01 here is a file containing a table of pathways and reactions that are enabled by entities or sets of entities that are annotated with more than one location.
multi_location_enablers.txt

It looks like some of these have sets in the active unit slot and this is getting skipped in the code that detects active units. Fixing that ought to reduce the size of the list as the location of the reaction will be determined by the active unit annotation and the rest of the locations will disappear in the generated go-cam. I will work on that in combination with #61

@goodb
Copy link
Contributor Author

goodb commented Apr 26, 2019

I believe the sets in the active unit slot problem has been solved. With the additional reaction locations inferred from these entities, the total size of the table is reduced, but only slightly. Here is the new version.

multi_location_enablers_now_with_active_unit_sets.txt

Noting for posterity the query used to generate that table - running on a blazegraph instance loaded with all of the reactome models (and nothing else).

prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
prefix obo: <http://purl.obolibrary.org/obo/>
prefix biopax: <http://www.biopax.org/release/biopax-level3.owl#>
prefix skos: <http://www.w3.org/2004/02/skos/core#>

#find reactions with uncertain location so we can add location to the physical entities
select distinct ?pathway_id ?pathway_label ?reaction_id ?reaction_label ?enabler_complex_label (COUNT(distinct ?location_type) AS ?n_locations) # ?location_type ?location_label    
where { 
  ?reaction obo:BFO_0000050 ?pathway . 
  ?pathway <http://www.geneontology.org/formats/oboInOwl#hasDbXref>	?pathway_id . 
  ?reaction <http://www.geneontology.org/formats/oboInOwl#hasDbXref> ?reaction_id . 
  ?reaction rdfs:label ?reaction_label . 
  ?pathway rdfs:label ?pathway_label .  
  #find reactions that have no occurs_in information 
  optional { 
     ?reaction obo:BFO_0000066 ?reaction_location .    
   }
  FILTER (!BOUND(?reaction_location)) .  
  # get location information for the physical entities 
  ?reaction obo:RO_0002333 ?enabler_complex . #enabled by|output|input   |obo:RO_0002234:obo:RO_0002233  
  ?enabler_complex rdfs:label ?enabler_complex_label . 
  ?enabler_complex obo:BFO_0000051 ?enabler . 
  ?enabler obo:BFO_0000050 ?location .
  ?location rdf:type ?location_type . 
  ?location_type rdfs:label ?location_label . 
  FILTER (?location_type != owl:NamedIndividual) .  
}
group by ?pathway_id ?pathway_label ?reaction_id ?reaction_label ?enabler_complex_label order by ?pathway_id ?reaction_id ?n_locations 

@deustp01
Copy link
Collaborator

Reviewing the table multi_location_enablers_now_with_active_unit_sets.txt and this issue and also #51 to figure out what I'm looking for in the table has led to a question about #51. There, you made a rule:

For reactions with multiple entity locations, that are enabled by something, the reaction occurs_in the location of the enabler. Other location information is dropped.

Today (should have thought about this at the time), I see the exception that needs to be accommodated: an enzyme located in a membrane but accessible from the adjoining cytosol enables a transformation of something in the cytosol as in this case from a month ago #59 (comment)

@deustp01
Copy link
Collaborator

Looking first at the cases with the most locations in the multi_location_enablers_now_with_active_unit_sets.txt table, almost all the 4-location cases are curation oddities in Reactome. Some look like plain mistakes. Others are more subtle, for example different specialized types of plasma membrane get counted as different locations but for the purpose of the geometry of a GO-CAM model they can all be collapsed into one - a problem that can be detected and maybe automatically handled in a future when spatial relationships are available.

@goodb
Copy link
Contributor Author

goodb commented May 7, 2019

@deustp01 could you help me formalize the exception you describe as "an enzyme located in a membrane but accessible from the adjoining cytosol enables a transformation of something in the cytosol as in this case from a month ago"

Is that a situation that can be captured directly from the explicit information in the pathway? I'm not sure how to detect the "accessible from the adjoining cytosol" aspect.

@deustp01
Copy link
Collaborator

deustp01 commented May 7, 2019

@goodb the right way to do it would be for us to use GO:0009898 cytoplasmic side of plasma membrane as the location for membrane-associated entities that are located so that they can interact with entities in the cytosol. There are two lethal practical problems: 1) a 20-year legacy clean-up which 2) would entail the curation hack of reasoning that since the membrane-associated entity is observed experimentally to enable something involving cytosolic entities or vice versa therefore the membrane associated entity must be at the cytoplasmic face of the membrane even though no one has produced any evidence that explicitly, precisely puts it there.

So, can you implement that hack in your logic: if entities are located in adjacent compartments (we provide a list of adjacent compartment pairs, starting with plasma membrane : cytosol and plasma membrane : extracellular region), you allow the locations? Getting a comprehensive list of adjacent compartments / locations will be a natural part of getting spatial relations among GO cell_component terms sorted out. This will take a while but we can immediately construct a reliable lookup list as we copme on location combinations that need to be sorted out.

I guess this would need to accompanied with some documentation that spells out the curator - biologist reasoning that says this is OK.

@goodb
Copy link
Contributor Author

goodb commented May 7, 2019

@deustp01 I can detect entities in adjacent compartments, given the adjacency list, but then what is the desired output in the go-cam model?

If enzyme E is located in the plasma membrane (never mind for now if it is a complex with members in different locations), and catalyzes a reaction R that has inputs and outputs in the cytosol,
Then what should we see in the GO-CAM? e.g. R occurs_in cytosol ?

Have a look at the model http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1963640 and attached image to see the current go-cam version of the reaction you linked to above. (With the manual addition of location information on the inputs and outputs). Let me know what you think it should look like.

Screen Shot 2019-05-07 at 3 39 27 PM

@ukemi
Copy link

ukemi commented May 8, 2019

We talked a bit about this while you were away @goodb. Whatever we do here, we have to make it consistent with what we tell curators and have in the manual curation documentation. If I remember correctly @vanaukenk , we decided for things like receptors that were embedded in a membrane, but executed their catalytic function on a side of the membrane, we would tell curators to annotate the function as occurring in the membrane even though it is not precise. The rationale for that was that no biologist would balk and the statement that the FGF receptor functions in the plasma membrane.

@deustp01
Copy link
Collaborator

deustp01 commented May 8, 2019

@goodb @ukemi Right - told the story; left off the conclusion. That is also the Reactome view - locate the activity where the active unit is and, consistent with David's rationale, users accept that annotation style.

@goodb
Copy link
Contributor Author

goodb commented May 8, 2019

In this case, we do not have an active unit annotated within the complex. When we do have active units identified, we avoid this multi-location problem. Here, the whole complex enables the reaction taking place in the cytosol. The complex is directly assigned the location of plasma membrane, but then it has components in plasma membrane, extracellular, and cytosol.

Maybe in this case, we should simply ignore the locations of the members of the complex and use the direct assertion of the complex's location (here plasma membrane) to establish the occurs_in relation on the reaction node?

We can keep the extra location information on the parts (via part_of) in cases where there are multiple locations per complex.

@deustp01
Copy link
Collaborator

deustp01 commented May 8, 2019

Reviewing the list that Ben generated, I see a lot of cases (almost all?) where we have not assigned an active (it is optional) but could, which will cut the size of the problem way down.

@goodb
Copy link
Contributor Author

goodb commented May 8, 2019

In the case I'm worried about here, we do not have a specific active unit assigned for the complex. Based on the data I have, the whole complex enables the reaction. The problem occurs when the complex has components in different locations - here plasma membrane, cytosol, and extracellular.

One thing I can do that, in this case, would be consistent with what @ukemi and @deustp01 seem to be saying, is to ignore the location annotations for the components of the complex and just use the location that is directly asserted for the complex itself. In this case that would result in the addition of the statement reaction - occurs in - plasma membrane. We could keep the multi-location information on the complex components via part_of statements (as things stand now).

@goodb
Copy link
Contributor Author

goodb commented May 8, 2019

The resolution from the meeting today is to ignore the location information on the components of complexes. @thomaspd thinks they complicate the models beyond the scope he wants for go-cam. We will simply look only at the directly asserted location of the enabling complex itself to determine what CC the reaction occurs_in. As @deustp01 and his team work through the addition of missing active unit annotations, this problem will shrink.

@goodb goodb closed this as completed in 48cfc54 May 9, 2019
DONE 2020-05 (Paris) Pathways2GO Version 1.0 automation moved this from In progress to Done May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

5 participants