-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WormBase GPI is splitting some lines causing neo pipeline to break #595
Comments
Okay, thanks @cmungall |
This issue has been fixed for the next WB release, which should be available on our ftp site later next week. |
Let's close it when it percolates through - looks like it's still there |
Another issue is the variable number of columns. There should always be 10, even if the last one or two are null. e.g this one has 8:
9:
7:
|
Another issue UniProtKB_GCRP is not a prefix we have registered in db-xrefs.yaml:
The xrefs should just be |
@cmungall |
@tonysawfordebi - any suggestions on indicating GCRP membership? |
Another issue I'm afraid: each entity ID should only be present once. The following has a dupe with a different symbol in each:
|
@cmungall - in the gpi file that we generate for indexing protein metadata in QuickGO, we have a property - reference_proteome - in the properties column to indicate whether the protein is part of the reference proteome or not (actually, the value of the property is the internal identifier of the proteome, rather than a simple boolean flag). If the protein is part of the reference proteome for the species, we also have another property - is_isoform - that indicates whether the protein is an isoform or the canonical form. Another property that we set is db_subset, which indicates whether the entry is Swiss-Prot or TrEMBL. |
@cmungall - the duplicate WP:CE52235 protein ids are actually cases where two genes encode the same protein sequence. We have other cases like this, e.g. histones. For annotation, depending on the data being annotated, the curator could either select the unique gene ID for annotation or might need to select the protein if the specific genetic locus was not known. @tonysawfordebi - yes, I had remembered these existing properties over the weekend :-). Here are the gpi properties (and values that I'm aware of) relating to sequences, including the ones you mention above:
Wrt the gpi files submitted by MODs, we (WB) thought it might be useful to indicate which of the UniProtKB accessions we reference were part of the GCRP. Looking at these property tags I'm not sure which one we should use and what makes most sense for the value. Would something like this work for the MOD files: uniprot_gcrp=YES Also, for properties like db_subset and reference_proteome, is it implicit that these properties refer to the source in column 1 or should we make the property names more explicit wrt the database? Will need to update: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md |
@vanaukenk - I'd forgotten about the gpi file that we generate for WB (and FB and dicty and SGD!) Unlike the one that we generate for QuickGO indexing purposes, they don't include the reference_proteome and is_isoform properties, but they could if you feel that information would be useful. And yes, if we do include such properties in these files then we should probably do it by having a property called uniprot_gcrp that takes the values 'canonical' or 'isoform', and is omitted if the protein is not in the GCRP. Or something like that, |
You could specific multiple gene parents, as the parent field has cardinality>1. Aside from this ticket, I'm wondering what our annotation policy is for histone genes and other analogous cases. I assume we just duplicate annotations to the identical genes? btw, the db-xrefs yaml file seems to not have a way of resolving protein entries, and I can't find this in wormbase https://wormbase.org/search/protein/CE52235 |
I will make sure all these issues get fixed. For the resolution of protein entries, the correct local id is actually WP:CE52235 (https://wormbase.org/search/protein/WP:CE52235). These additional prefixes are an anachronism and confuse things when forming CURIEs (i.e. should the CURIE be WB:WP:CE52235? Or is "WP" a resource?). In the next release of WB (which we will start preparing in a few weeks), we will drop these prefixes. The local id for the above will then be simply CE52235. As for the global_id / CURIE, that depends on how we choose to solve the bigger picture of making sure that all front-line ids in WormBase resolve. One way of doing this (proposed by @cmungall ) is for us to write our own resolver that will recognise all of our local ids and resolve them to the correct page. This would allow us to make all of our CURIEs have the form "WB:XXXXX". For a small number of specific data types though, this will not be possible (e.g. "JC8.10a" is an identifier for distinct CDS and Transcript objects in WB). |
@cmungall @tonysawfordebi Regarding the duplication issue, the GPI format spec states that the Parent field is cardinality "0 or 1". Also, we are trying to represent the central dogma using the Parent column, GFF3-style. That is, for a protein line, we are populating the Parent column with id of the transcript from which it is translated. Reading the spec though, it seems that this is not really what the Parent field was intended to represent. It seems very UniProt-centric (perhaps unsurprisingly). |
Thanks @khowe @cmungall The way our WB protein IDs work right now, though, if we ever needed to specifically indicate, for example, the histone protein encoded by the his-2 locus, I don't believe we could do it since that histone protein sequence is shared amongst 15 different loci. In practice, this hasn't happened that much for WB GO curation, but maybe it's worth thinking about if/how that could be handled in the future. @khowe what are your thoughts? |
Which GPI docs are you referring to? I regard the formal spec in markdown as canonical: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md multiple parents are allowed. There is shocklingly little docs on the semantics of this field, but the intent was analogous to GFF3/Chado Somehow the docs also spread to the wiki and drupal these may be out of sync, we have not done a good jon of coordinating this |
Thanks. We need to unify these |
@cmungall |
When I wrote our requirements doc for the GPI file. I used this: http://www.geneontology.org/page/gene-product-information-gpi-format |
But it appears that we also updated this page: http://wiki.geneontology.org/index.php/Proposed_GPI1.2_format |
Okay, most of these issues have already been fixed in the latest WormBase GPI: ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases/WS265/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS265.gene_product_info.gpi.gz This will be propagated to ftp.wormbase.org with a release-neutral URL in the next few weeks. The duplication issue is still present. I hear @cmungall 's assertion that the github version of the spec is authoritative, and will make the change to have one line for each protein, with multiple Parents where appropriate. However, I am somewhat confused by multiple (different) versions of the spec floating around that all call themselves version "1.2". |
Can this be closed ? |
I have no memory of this. Closing. |
File:
ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz
Some of the description lines include newlines, causing a single line to be split over two lines, breaking parsing. For example:
cc @rankishore
The text was updated successfully, but these errors were encountered: