Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ID / name collisions in ecocyc and goa_sars-cov-2 against uniprot_reviewed_virus_bacteria causing problem in NEO pipeline #80

Closed
kltm opened this issue Feb 2, 2022 · 16 comments
Labels

Comments

@kltm
Copy link
Member

kltm commented Feb 2, 2022

The NEO pipeline is now failing with the following error:

"multiple name tags not allowed"

Originating error: 10:49:24 Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:P17846 id( UniProtKB:P17846)synonym( cysI RELATED)synonym( cysI BROAD)synonym( P17846 RELATED)synonym( b2763 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct)name( cysI NCBITaxon:83333)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)synonym( cysI BROAD[NCBITaxon:83333 ])name( cysI ecocyc)synonym( sulfite reductase hemoprotein subunit ecocyc EXACT)synonym( JW2733 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:83333)is_a( CHEBI:33695))

In gene_association.ecocyc.gz, the triggering line seems to be:

UniProtKB	P17846	cysI	part_of	GO:0009337	PMID:21873635	IBA	PANTHER:PTN001353165|SGD:S000003898|UniProtKB:P17846	C"sulfite reductase, hemoprotein subunit"	cysI|b2763|ECK2758	proteintaxon:83333	20200213	GO_Central		PR:P17846

Tagging @pgaudet @vanaukenk @balhoff

@pgaudet
Copy link

pgaudet commented Feb 3, 2022

@kltm
I thought you were filtering IBAs coming from upstreams?

Is thus for Ecocyc to fix, or for PAINT?

@kltm
Copy link
Member Author

kltm commented Feb 3, 2022

@pgaudet Filtering IBAs is what happens for the "main" GO pipeline as part of applying the GO rules, not for this NEO pipeline, which essentially just takes a set of files, OBO-ifies them, and turns them into an ontology for the autocomplete to run off of--there are no rules or filters run on this input. NEO is all annotatable entities, so we likely do not want to filter things for violations in the same way that the "main" pipeline does.

Since this was introduced recently and seems to be an actual issue, I feel that this is something that we'd want the upstream to take care of (unsure if this would be in their processing or in PAINT). If necessary, we could start trying to filter things, but I'd be rather uneasy about that.

Alternatively, if ecocyc had a GPI available, we could switch over to that (essentially what we're doing by wringing out the GAF).

@kltm
Copy link
Member Author

kltm commented Feb 3, 2022

@pgaudet Just wanted to follow up on this in a little more detail. The actual issue here is identifier collision and what owltools is doing with the OBO, not in the GAF directly, so IBA or not doesn't really matter. The problematic stanza in obo in ecocyc (not really, see below) is:

[Term]
id: UniProtKB:P17846
name: cysI ecocyc
synonym: "sulfite reductase hemoprotein subunit ecocyc" EXACT []
synonym: "cysI" BROAD [NCBITaxon:83333]
is_a: CHEBI:33695 ! information biomacromolecule
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct
relationship: in_taxon NCBITaxon:83333

What I think might actually be going on here is that there is a conflict with incompatible tags (name) appearing in neo-uniprot_reviewed_virus_bacteria.obo as well:

id: UniProtKB:P17846
name: cysI NCBITaxon:83333
synonym: "cysI" BROAD []
synonym: "cysI" RELATED []
synonym: "JW2733" RELATED []
synonym: "b2763" RELATED []
synonym: "P17846" RELATED []
is_a: CHEBI:36080 ! protein
relationship: in_taxon NCBITaxon:83333
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein

From this, I think the solution is to drop either file. Doing a little experimentation (below), I think that there may be thousands of other issues in using both of these files at the same time that we just haven't had the chance to run into yet.

Find identifier intersection:

grep "id: " neo-uniprot_reviewed_virus_bacteria.obo | sort > /tmp/ids_rev.txt
grep "id: " neo-ecocyc.obo | sort > /tmp/ids_eco.txt
comm -12 /tmp/ids_eco.txt  /tmp/ids_rev.txt | wc -l
3895

@kltm kltm changed the title GAF annotation in ecocyc causing error in NEO pipeline ID / name collision in ecocyc and uniprot_reviewed_virus_bacteria causing problem in NEO pipeline Feb 3, 2022
@kltm
Copy link
Member Author

kltm commented Feb 3, 2022

Also tagging in @cmungall , as we are now getting back into looking at (removal|inclusion|filtering|merging) sources. Assuming that I'm reading this right, this may just be an extension of #77 .

@cmungall
Copy link
Member

cmungall commented Feb 3, 2022

let's just drop the ecocyc GPI from neo

kltm added a commit that referenced this issue Feb 3, 2022
@kltm
Copy link
Member Author

kltm commented Feb 3, 2022

Removed (noting that ecocyc was a GAF, not a GPI). Now testing.

@kltm
Copy link
Member Author

kltm commented Feb 4, 2022

@cmungall I think you can see where this is going...

grep "id: " neo-goa_sars-cov-2.obo | sort > /tmp/ids_cov.txt
comm -12 /tmp/ids_rev.txt /tmp/ids_cov.txt | wc -l
14

The shared IDs between neo-goa_sars-cov-2.obo and neo-uniprot_reviewed_virus_bacteria.obo are:

id: UniProtKB:P0DTC1
id: UniProtKB:P0DTC2
id: UniProtKB:P0DTC3
id: UniProtKB:P0DTC4
id: UniProtKB:P0DTC5
id: UniProtKB:P0DTC6
id: UniProtKB:P0DTC7
id: UniProtKB:P0DTC8
id: UniProtKB:P0DTC9
id: UniProtKB:P0DTD1
id: UniProtKB:P0DTD2
id: UniProtKB:P0DTD3
id: UniProtKB:P0DTD8

Our source for sars-cov-2 is https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi. Should we just go in and pop those out, ask uniprot upstream to remove, or something else?

@cmungall
Copy link
Member

cmungall commented Feb 4, 2022 via email

@kltm
Copy link
Member Author

kltm commented Feb 4, 2022

I believe you're referring to geneontology/go-site#1431 ?

@kltm
Copy link
Member Author

kltm commented Feb 4, 2022

@cmungall Specifically from there geneontology/go-site#1431 (comment) .
Okay, as a simple workaround for the moment, should we edit the hand-curated kg-covid file to remove those 14 items and revisit this later on as part of geneontology/go-site#1431 ?

kltm added a commit to Knowledge-Graph-Hub/kg-covid-19 that referenced this issue Feb 4, 2022
@kltm
Copy link
Member Author

kltm commented Feb 4, 2022

@cmungall For example Knowledge-Graph-Hub/kg-covid-19#440 (feel free to close).
Basically, it's all the "normal" IDs in that file.

@cmungall
Copy link
Member

cmungall commented Feb 4, 2022 via email

@kltm
Copy link
Member Author

kltm commented Feb 4, 2022

From thread with @cmungall and @pgaudet switching to uniprot_reviewed_virus_bacteria over kg-covid.

@kltm kltm changed the title ID / name collision in ecocyc and uniprot_reviewed_virus_bacteria causing problem in NEO pipeline ID / name collisions in ecocyc and goa_sars-cov-2 against uniprot_reviewed_virus_bacteria causing problem in NEO pipeline Feb 4, 2022
@pgaudet
Copy link

pgaudet commented Feb 22, 2022

For Sars-CoV2 we'd like to keep the old file, as fixed by @cmungall

Is that OK? Where do we specify this, should we create a virus.yaml file for this (and other viruses that we might fix in the future)?

Thanks, Pascale

@kltm
Copy link
Member Author

kltm commented Feb 22, 2022

Okay, to be honest, I'm a little confused about the current state of desires here.

As of this moment, we are loading:
sgd pombase mgi zfin rgd dictybase fb tair wb goa_human goa_human_complex goa_human_rna goa_human_isoform goa_pig xenbase pseudocap ecocyc

What file is to be loaded in addition to this? And this file has been fixed so that it no longer collides with what we're currently loading?

@pgaudet
Copy link

pgaudet commented Feb 23, 2022

Currently not a problem anymore if we dont load the viruses and bacteria-reviewed file (#77)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants