Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should GPAD association writer in ontobio, use GPI files and isoform protein identifiers in associations to modify the subject of annotation in GPAD output? #36

Closed
sierra-moxon opened this issue Feb 26, 2024 · 7 comments
Assignees

Comments

@sierra-moxon
Copy link
Member

sierra-moxon commented Feb 26, 2024

From Li's comments here: geneontology/go-site#2043

It looks like we should add code to ontobio so that we can produce GPADs with protein subject identifiers when GAF annotations have isoform identifiers that match ids in the associated GPI file. This is a medium-ish change to the GPAD association writer and would result in GPAD and GAF annotation files with different subjects.

tagging @kltm

snipped from the other ticket for ease of understanding:

==========================================
in the GAF file I produce:

SMoxon@SMoxon-M82 ontobio % grep "MGI:87961" mgi_022624.gaf | grep "A2ASQ1-2"
MGI	MGI:87961	Agrn	enables	GO:0005201	PMID:22159717	RCA		F	Agrin		protein	taxon:10090	20180725	BHF-UCL	occurs_in(UBERON:0002048)	PR:A2ASQ1-2
MGI	MGI:87961	Agrn	located_in	GO:0062023	PMID:22159717	HDA		C	Agrin		protein	taxon:10090	20180725	BHF-UCL	part_of(UBERON:0002048)	PR:A2ASQ1-2

Per David above:

Hi @sierra-moxon The only issue that I see with these gaf lines is the last column. If you switch to: MGI MGI:87961 Agrn enables GO:0005201 PMID:22159717 RCA F Agrin protein taxon:10090 20180725 BHF-UCL occurs_in(UBERON:0002048) PR:A2ASQ1-2 MGI MGI:87961 Agrn located_in GO:0062023 PMID:22159717 HDA C Agrin protein taxon:10090 20180725 BHF-UCL part_of(UBERON:0002048) PR:A2ASQ1-2

I think this will work. We can look together at 3/noon.

this is what I think it looks like in the final GPAD:

MGI:MGI:87961		RO:0002327	GO:0005201	PMID:22159717	ECO:0000245			2018-07-25	BHF-UCL	BFO:0000066(UBERON:0002048)

Thanks Sierra @sierra-moxon
Gaf file looks good!
My understanding is when there are isoform information in the gaf (last column of gaf), we will use the isoform PR:ID as the DB Object ID in the first column of GPAD? Am I right @ukemi ?
So final GPAD will looks like:
PR:A2ASQ1-2 RO:0002327 GO:0005201 PMID:22159717 ECO:0000245

@sierra-moxon sierra-moxon self-assigned this Feb 26, 2024
@sierra-moxon sierra-moxon changed the title Should GPAD association writer in ontobio, use GPI files and isoform protein identifiers in GAF files to modify the subject of annotation in GPAD output? Should GPAD association writer in ontobio, use GPI files and isoform protein identifiers in associations to modify the subject of annotation in GPAD output? Feb 26, 2024
@kltm
Copy link
Member

kltm commented Feb 27, 2024

@kltm Is wondering if this use case can be covered by extension or property?
To summarize, the isoform is found in the GAF, but not the GPAD--this is a total loss of information in the GPAD as the GPI file has the mapping, but not the reverse mapping for any given annotation (i.e. many-to-one).

@sierra-moxon
Copy link
Member Author

sierra-moxon commented Feb 29, 2024

from managers call:
action: pull in the column 17 isoform id, into the subject id field of the GPAD. (for all species. yes.)

discussion:
original protein2GO annotation:

SMoxon@SMoxon-M82 GOA_taxon_10090_ISOFORM % grep "A2ASQ1-2" goa_mouse_isoform.gaf 
UniProtKB	A2ASQ1	Agrn	enables	GO:0005201	PMID:22159717	RCA		F	Agrin	Agrn|Agrin	protein	taxon:10090	20180725	BHF-UCL	occurs_in(UBERON:0002048)	UniProtKB:A2ASQ1-2
UniProtKB	A2ASQ1	Agrn	located_in	GO:0062023	PMID:22159717	HDA		C	Agrin	Agrn|Agrin	protein	taxon:10090	20180725	BHF-UCL	part_of(UBERON:0002048)	UniProtKB:A2ASQ1-2
SMoxon@SMoxon-M82 GOA_taxon_10090_ISOFORM % 

sierra is already checking the GPI in the original conversion from uniprot->MGI + PR
some discussion of the utility of taking Trembl annotations that can't be mapped to GCRP (MODs handle protein->GCRP mapping)

@sierra-moxon
Copy link
Member Author

sierra-moxon commented Feb 29, 2024

examples from Lori (PAINT still have UniProt as the subject) - Sierra does not handle PAINT validation to the GPI.

UniProtKB:P03985
UniProtKB:P18530

this should be another issue somewhere else, not in this "upstream remainders" project. - @LiNiMGI will handle this :)

@pgaudet
Copy link

pgaudet commented Mar 13, 2024

@sierra-moxon What is the action here?

@sierra-moxon
Copy link
Member Author

Hi @pgaudet - this was the action for this ticket, and the fix is in the works in my branch of ontobio used for this project:

from managers call: action: pull in the column 17 isoform id, into the subject id field of the GPAD. (for all species. yes.)

the UniProt comment "(PAINT still have UniProt as the subject)" was another topic that came up tangentially while we were talking about this ticket and so I captured it as aside. I do not know the answer to where this is going to be handled, but Li will have hopefully filed it as an issue elsewhere.

@LiNiMGI
Copy link
Collaborator

LiNiMGI commented Mar 13, 2024

At the moment MGI filter those PAINT (PAINT still have UniProt as the subject) annotations out. @pgaudet we can talk more tomorrow and see what we can do about it. According to Dustin, the conversion from UniProt to MGI for PAINT annotations is done on the PANTHER side and is tied to likely older data (Reference proteome/QfO releases) than is current in MGI.

@sierra-moxon
Copy link
Member Author

sierra-moxon commented Mar 19, 2024

new file generated with fixes: http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/mgi.gpad.gz

SMoxon@SMoxon-M82 pipeline % grep "A2ASQ1-2" ~/Downloads/mgi_0318_24.gpad
PR:A2ASQ1-2 RO:0002327 GO:0005201 PMID:22159717 ECO:0000245 2024-03-18 BHF-UCL BFO:0000066(UBERON:0002048)
PR:A2ASQ1-2 RO:0001025 GO:0062023 PMID:22159717 ECO:0007005 2024-03-18 BHF-UCL BFO:0000050(UBERON:0002048)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants