Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WormBase GPI is splitting some lines causing neo pipeline to break #595

Closed
cmungall opened this issue Apr 9, 2018 · 24 comments
Closed

WormBase GPI is splitting some lines causing neo pipeline to break #595

cmungall opened this issue Apr 9, 2018 · 24 comments
Assignees

Comments

@cmungall
Copy link
Member

cmungall commented Apr 9, 2018

File:
ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz

Some of the description lines include newlines, causing a single line to be split over two lines, breaking parsing. For example:

$ gzip -dc mirror/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz | grep -n -B1 CELE_C33A11 
24676-WB        WBGene00007877  nfki-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24677:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    protein_coding_gene     taxon:6239              UniProtKB:G5EDE9
--
24678-WB        C33A11.1        nfki-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24679:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    transcript      taxon:6239      WB:WBGene00007877       
--
24680-WB        WP:CE24824      NFKI-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24681:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    protein taxon:6239      WB:C33A11.1     UniProtKB_GCRP:G5EDE9|UniProtKB:G5EDE9

cc @rankishore

@vanaukenk
Copy link
Contributor

Okay, thanks @cmungall
I'll pass this along to the Hinxton group who generate the file with each WB release.

@vanaukenk
Copy link
Contributor

This issue has been fixed for the next WB release, which should be available on our ftp site later next week.
@cmungall - shall I close this ticket?

@cmungall
Copy link
Member Author

Let's close it when it percolates through - looks like it's still there

@cmungall
Copy link
Member Author

cmungall commented Apr 19, 2018

Another issue is the variable number of columns. There should always be 10, even if the last one or two are null.

e.g this one has 8:

WB       ZC247.1 ZC247.1         CELE_ZC247.1    transcript      taxon:6239      WB:WBGene00013859

9:

WB       WP:CE43614      ZC247.1         CELE_ZC247.1    protein taxon:6239      WB:ZC247.1      UniProtKB_GCRP:G5EBP5|UniProtKB:G5EBP5

7:

WBGene00271791  W03D2.15                CELE_W03D2.15   ncRNA_gene      taxon:6239 

@cmungall
Copy link
Member Author

cmungall commented Apr 19, 2018

Another issue

UniProtKB_GCRP is not a prefix we have registered in db-xrefs.yaml:

WB      WP:CE10938      F53F1.4         CELE_F53F1.4    protein taxon:6239      WB:F53F1.4      UniProtKB_GCRP:Q9XVM6|UniProtKB:Q9XVM6

The xrefs should just be UniProtKB:Q9XVM6

@vanaukenk
Copy link
Contributor

@cmungall
We'll fix the columns issue.
Wrt the GCRP, we had thought it might be useful to indicate in the file which UniProtKB accessions corresponded to the GCRP for a given WB gene. The different prefix might not have been the best approach, but perhaps we could indicate this information some way in the properties field, i.e. column 10.
I'm not sure what the best property name and value would be, maybe something like:
UniProtKB_accession_type:GCRP
I'm open to suggestions on that part.

@cmungall
Copy link
Member Author

@tonysawfordebi - any suggestions on indicating GCRP membership?

@cmungall
Copy link
Member Author

Another issue I'm afraid:

each entity ID should only be present once. The following has a dupe with a different symbol in each:

WB      WP:CE52235      C08E8.6         CELE_C08E8.6    protein taxon:6239      WB:C08E8.6      UniProtKB_GCRP:Q7YX97|UniProtKB:Q7YX97
WB      WP:CE52235      C08E8.9         CELE_C08E8.9    protein taxon:6239      WB:C08E8.9      UniProtKB_GCRP:Q7YX97|UniProtKB:Q7YX97

cmungall added a commit to geneontology/neo that referenced this issue Apr 21, 2018
@tonysawfordebi
Copy link
Contributor

@cmungall - in the gpi file that we generate for indexing protein metadata in QuickGO, we have a property - reference_proteome - in the properties column to indicate whether the protein is part of the reference proteome or not (actually, the value of the property is the internal identifier of the proteome, rather than a simple boolean flag). If the protein is part of the reference proteome for the species, we also have another property - is_isoform - that indicates whether the protein is an isoform or the canonical form. Another property that we set is db_subset, which indicates whether the entry is Swiss-Prot or TrEMBL.

@vanaukenk
Copy link
Contributor

@cmungall - the duplicate WP:CE52235 protein ids are actually cases where two genes encode the same protein sequence. We have other cases like this, e.g. histones. For annotation, depending on the data being annotated, the curator could either select the unique gene ID for annotation or might need to select the protein if the specific genetic locus was not known.

@tonysawfordebi - yes, I had remembered these existing properties over the weekend :-). Here are the gpi properties (and values that I'm aware of) relating to sequences, including the ones you mention above:

 db_subset=TrEMBL or Swiss-Prot
 uniprot_proteome=UP000001940 (C. elegans, for example)
 is_isoform=?
 reference_proteome=?

Wrt the gpi files submitted by MODs, we (WB) thought it might be useful to indicate which of the UniProtKB accessions we reference were part of the GCRP. Looking at these property tags I'm not sure which one we should use and what makes most sense for the value. Would something like this work for the MOD files:

uniprot_gcrp=YES

Also, for properties like db_subset and reference_proteome, is it implicit that these properties refer to the source in column 1 or should we make the property names more explicit wrt the database?

Will need to update: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

@tonysawfordebi
Copy link
Contributor

@vanaukenk - I'd forgotten about the gpi file that we generate for WB (and FB and dicty and SGD!)

Unlike the one that we generate for QuickGO indexing purposes, they don't include the reference_proteome and is_isoform properties, but they could if you feel that information would be useful. And yes, if we do include such properties in these files then we should probably do it by having a property called uniprot_gcrp that takes the values 'canonical' or 'isoform', and is omitted if the protein is not in the GCRP. Or something like that,

@cmungall
Copy link
Member Author

the duplicate WP:CE52235 protein ids are actually cases where two genes encode the same protein sequence. We have other cases like this, e.g. histones. For annotation, depending on the data being annotated, the curator could either select the unique gene ID for annotation or might need to select the protein if the specific genetic locus was not known.

You could specific multiple gene parents, as the parent field has cardinality>1.

Aside from this ticket, I'm wondering what our annotation policy is for histone genes and other analogous cases. I assume we just duplicate annotations to the identical genes?

btw, the db-xrefs yaml file seems to not have a way of resolving protein entries, and I can't find this in wormbase https://wormbase.org/search/protein/CE52235

@khowe
Copy link

khowe commented Apr 26, 2018

I will make sure all these issues get fixed.

For the resolution of protein entries, the correct local id is actually WP:CE52235 (https://wormbase.org/search/protein/WP:CE52235). These additional prefixes are an anachronism and confuse things when forming CURIEs (i.e. should the CURIE be WB:WP:CE52235? Or is "WP" a resource?).

In the next release of WB (which we will start preparing in a few weeks), we will drop these prefixes. The local id for the above will then be simply CE52235.

As for the global_id / CURIE, that depends on how we choose to solve the bigger picture of making sure that all front-line ids in WormBase resolve. One way of doing this (proposed by @cmungall ) is for us to write our own resolver that will recognise all of our local ids and resolve them to the correct page. This would allow us to make all of our CURIEs have the form "WB:XXXXX". For a small number of specific data types though, this will not be possible (e.g. "JC8.10a" is an identifier for distinct CDS and Transcript objects in WB).

@khowe
Copy link

khowe commented Apr 26, 2018

@cmungall @tonysawfordebi Regarding the duplication issue, the GPI format spec states that the Parent field is cardinality "0 or 1".

Also, we are trying to represent the central dogma using the Parent column, GFF3-style. That is, for a protein line, we are populating the Parent column with id of the transcript from which it is translated. Reading the spec though, it seems that this is not really what the Parent field was intended to represent. It seems very UniProt-centric (perhaps unsurprisingly).

@vanaukenk
Copy link
Contributor

Thanks @khowe

@cmungall
Wrt annotation, in WB at least, we typically associate GO annotations with WBGenes and many of our experiments are genetically based, so that works fairly well. However, there may certainly be cases where an experiment demonstrates something about a protein sequence that is shared amongst different genes. In that case, we probably would not annotate anything to a WBGene ID because we couldn't be certain that the annotation would be correct for all of the genes.

The way our WB protein IDs work right now, though, if we ever needed to specifically indicate, for example, the histone protein encoded by the his-2 locus, I don't believe we could do it since that histone protein sequence is shared amongst 15 different loci. In practice, this hasn't happened that much for WB GO curation, but maybe it's worth thinking about if/how that could be handled in the future. @khowe what are your thoughts?

@cmungall
Copy link
Member Author

Which GPI docs are you referring to? I regard the formal spec in markdown as canonical: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

multiple parents are allowed. There is shocklingly little docs on the semantics of this field, but the intent was analogous to GFF3/Chado

Somehow the docs also spread to the wiki and drupal these may be out of sync, we have not done a good jon of coordinating this

@khowe
Copy link

khowe commented Apr 26, 2018

@cmungall
Copy link
Member Author

Thanks. We need to unify these

@vanaukenk
Copy link
Contributor

@cmungall
Unfortunately, I'm not sure if any groups have referred to the md version of the gpad/gpi documentation as the official documentation.
We really do need to sort this all out before onboarding more groups.
Let me know if/how I can help.

@ukemi
Copy link
Contributor

ukemi commented Apr 26, 2018

When I wrote our requirements doc for the GPI file. I used this: http://www.geneontology.org/page/gene-product-information-gpi-format
I'm not even sure the md file was available at that point. At any rate, we assumed the one on the GO web site was the official specs.

@ukemi
Copy link
Contributor

ukemi commented Apr 26, 2018

But it appears that we also updated this page:

http://wiki.geneontology.org/index.php/Proposed_GPI1.2_format

@khowe
Copy link

khowe commented Apr 26, 2018

Okay, most of these issues have already been fixed in the latest WormBase GPI:

ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases/WS265/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS265.gene_product_info.gpi.gz

This will be propagated to ftp.wormbase.org with a release-neutral URL in the next few weeks.

The duplication issue is still present. I hear @cmungall 's assertion that the github version of the spec is authoritative, and will make the change to have one line for each protein, with multiple Parents where appropriate. However, I am somewhat confused by multiple (different) versions of the spec floating around that all call themselves version "1.2".

@pgaudet
Copy link
Contributor

pgaudet commented Feb 12, 2020

Can this be closed ?

@kltm
Copy link
Member

kltm commented Feb 12, 2020

I have no memory of this. Closing.

@kltm kltm closed this as completed Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants