-
Notifications
You must be signed in to change notification settings - Fork 2
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ID / name collisions in ecocyc and goa_sars-cov-2 against uniprot_reviewed_virus_bacteria causing problem in NEO pipeline #80
Comments
@kltm Is thus for Ecocyc to fix, or for PAINT? |
@pgaudet Filtering IBAs is what happens for the "main" GO pipeline as part of applying the GO rules, not for this NEO pipeline, which essentially just takes a set of files, OBO-ifies them, and turns them into an ontology for the autocomplete to run off of--there are no rules or filters run on this input. NEO is all annotatable entities, so we likely do not want to filter things for violations in the same way that the "main" pipeline does. Since this was introduced recently and seems to be an actual issue, I feel that this is something that we'd want the upstream to take care of (unsure if this would be in their processing or in PAINT). If necessary, we could start trying to filter things, but I'd be rather uneasy about that. Alternatively, if ecocyc had a GPI available, we could switch over to that (essentially what we're doing by wringing out the GAF). |
@pgaudet Just wanted to follow up on this in a little more detail. The actual issue here is identifier collision and what owltools is doing with the OBO, not in the GAF directly, so IBA or not doesn't really matter. The problematic stanza in obo in ecocyc (not really, see below) is:
What I think might actually be going on here is that there is a conflict with incompatible tags (name) appearing in
From this, I think the solution is to drop either file. Doing a little experimentation (below), I think that there may be thousands of other issues in using both of these files at the same time that we just haven't had the chance to run into yet. Find identifier intersection:
|
let's just drop the ecocyc GPI from neo |
Removed (noting that ecocyc was a GAF, not a GPI). Now testing. |
@cmungall I think you can see where this is going...
The shared IDs between
Our source for sars-cov-2 is https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi. Should we just go in and pop those out, ask uniprot upstream to remove, or something else? |
We have a separate ticket on that one. I still think the hand curated GPI
that Marcin did is better for curators but if the virus group is happy to
switch, and has conventions to magically choose the right protein I'm OK.
…On Thu, Feb 3, 2022 at 4:19 PM kltm ***@***.***> wrote:
@cmungall <https://github.com/cmungall> I think you can see where this is
going...
grep "id: " neo-goa_sars-cov-2.obo | sort > /tmp/ids_cov.txt
comm -12 /tmp/ids_rev.txt /tmp/ids_cov.txt | wc -l
14
The shared IDs between neo-goa_sars-cov-2.obo and
neo-uniprot_reviewed_virus_bacteria.obo are:
id: UniProtKB:P0DTC1
id: UniProtKB:P0DTC2
id: UniProtKB:P0DTC3
id: UniProtKB:P0DTC4
id: UniProtKB:P0DTC5
id: UniProtKB:P0DTC6
id: UniProtKB:P0DTC7
id: UniProtKB:P0DTC8
id: UniProtKB:P0DTC9
id: UniProtKB:P0DTD1
id: UniProtKB:P0DTD2
id: UniProtKB:P0DTD3
id: UniProtKB:P0DTD8
Our source for sars-cov-2 is
https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi.
Should we just go in and pop those out, ask uniprot upstream to remove, or
something else?
—
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMOJIDP5GZUTQMP6VBD3UZMLRLANCNFSM5NM6YF7Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I believe you're referring to geneontology/go-site#1431 ? |
@cmungall Specifically from there geneontology/go-site#1431 (comment) . |
Remove colliding IDs geneontology/neo#80
@cmungall For example Knowledge-Graph-Hub/kg-covid-19#440 (feel free to close). |
no editing of the hand curated file, it's good, and it's used elsewhere
Just remove it from the load for now, we can revisit later, just let
Patrick know when it's done
…On Thu, Feb 3, 2022 at 4:54 PM kltm ***@***.***> wrote:
@cmungall <https://github.com/cmungall> Specifically from there geneontology/go-site#1431
(comment)
<geneontology/go-site#1431 (comment)>
.
Okay, as a simple workaround for the moment, should we edit the
hand-curated kg-covid file to remove those 14 items and revisit this later
on as part of geneontology/go-site#1431
<geneontology/go-site#1431> ?
—
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMONUVA4O4LXWN4ZPW5LUZMPTTANCNFSM5NM6YF7Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
For Sars-CoV2 we'd like to keep the old file, as fixed by @cmungall Is that OK? Where do we specify this, should we create a virus.yaml file for this (and other viruses that we might fix in the future)? Thanks, Pascale |
Okay, to be honest, I'm a little confused about the current state of desires here. As of this moment, we are loading: What file is to be loaded in addition to this? And this file has been fixed so that it no longer collides with what we're currently loading? |
Currently not a problem anymore if we dont load the viruses and bacteria-reviewed file (#77) |
The NEO pipeline is now failing with the following error:
"multiple name tags not allowed"
Originating error:
10:49:24 Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:P17846 id( UniProtKB:P17846)synonym( cysI RELATED)synonym( cysI BROAD)synonym( P17846 RELATED)synonym( b2763 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct)name( cysI NCBITaxon:83333)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)synonym( cysI BROAD[NCBITaxon:83333 ])name( cysI ecocyc)synonym( sulfite reductase hemoprotein subunit ecocyc EXACT)synonym( JW2733 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:83333)is_a( CHEBI:33695))
In gene_association.ecocyc.gz, the triggering line seems to be:
Tagging @pgaudet @vanaukenk @balhoff
The text was updated successfully, but these errors were encountered: