WormBase GPI is splitting some lines causing neo pipeline to break #595

cmungall · 2018-04-09T23:03:13Z

File:
ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz

Some of the description lines include newlines, causing a single line to be split over two lines, breaking parsing. For example:

$ gzip -dc mirror/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz | grep -n -B1 CELE_C33A11 
24676-WB        WBGene00007877  nfki-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24677:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    protein_coding_gene     taxon:6239              UniProtKB:G5EDE9
--
24678-WB        C33A11.1        nfki-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24679:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    transcript      taxon:6239      WB:WBGene00007877       
--
24680-WB        WP:CE24824      NFKI-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24681:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    protein taxon:6239      WB:C33A11.1     UniProtKB_GCRP:G5EDE9|UniProtKB:G5EDE9

cc @rankishore

The text was updated successfully, but these errors were encountered:

vanaukenk · 2018-04-10T00:22:38Z

Okay, thanks @cmungall
I'll pass this along to the Hinxton group who generate the file with each WB release.

vanaukenk · 2018-04-10T12:38:45Z

This issue has been fixed for the next WB release, which should be available on our ftp site later next week.
@cmungall - shall I close this ticket?

cmungall · 2018-04-18T23:58:37Z

Let's close it when it percolates through - looks like it's still there

cmungall · 2018-04-19T04:51:29Z

Another issue is the variable number of columns. There should always be 10, even if the last one or two are null.

e.g this one has 8:

WB       ZC247.1 ZC247.1         CELE_ZC247.1    transcript      taxon:6239      WB:WBGene00013859

9:

WB       WP:CE43614      ZC247.1         CELE_ZC247.1    protein taxon:6239      WB:ZC247.1      UniProtKB_GCRP:G5EBP5|UniProtKB:G5EBP5

7:

WBGene00271791  W03D2.15                CELE_W03D2.15   ncRNA_gene      taxon:6239

cmungall · 2018-04-19T04:55:46Z

Another issue

UniProtKB_GCRP is not a prefix we have registered in db-xrefs.yaml:

WB      WP:CE10938      F53F1.4         CELE_F53F1.4    protein taxon:6239      WB:F53F1.4      UniProtKB_GCRP:Q9XVM6|UniProtKB:Q9XVM6

The xrefs should just be UniProtKB:Q9XVM6

vanaukenk · 2018-04-19T13:32:20Z

@cmungall
We'll fix the columns issue.
Wrt the GCRP, we had thought it might be useful to indicate in the file which UniProtKB accessions corresponded to the GCRP for a given WB gene. The different prefix might not have been the best approach, but perhaps we could indicate this information some way in the properties field, i.e. column 10.
I'm not sure what the best property name and value would be, maybe something like:
UniProtKB_accession_type:GCRP
I'm open to suggestions on that part.

cmungall · 2018-04-21T01:08:25Z

@tonysawfordebi - any suggestions on indicating GCRP membership?

cmungall · 2018-04-21T01:10:02Z

Another issue I'm afraid:

each entity ID should only be present once. The following has a dupe with a different symbol in each:

WB      WP:CE52235      C08E8.6         CELE_C08E8.6    protein taxon:6239      WB:C08E8.6      UniProtKB_GCRP:Q7YX97|UniProtKB:Q7YX97
WB      WP:CE52235      C08E8.9         CELE_C08E8.9    protein taxon:6239      WB:C08E8.9      UniProtKB_GCRP:Q7YX97|UniProtKB:Q7YX97

tonysawfordebi · 2018-04-23T08:58:21Z

@cmungall - in the gpi file that we generate for indexing protein metadata in QuickGO, we have a property - reference_proteome - in the properties column to indicate whether the protein is part of the reference proteome or not (actually, the value of the property is the internal identifier of the proteome, rather than a simple boolean flag). If the protein is part of the reference proteome for the species, we also have another property - is_isoform - that indicates whether the protein is an isoform or the canonical form. Another property that we set is db_subset, which indicates whether the entry is Swiss-Prot or TrEMBL.

vanaukenk · 2018-04-23T13:36:28Z

@cmungall - the duplicate WP:CE52235 protein ids are actually cases where two genes encode the same protein sequence. We have other cases like this, e.g. histones. For annotation, depending on the data being annotated, the curator could either select the unique gene ID for annotation or might need to select the protein if the specific genetic locus was not known.

@tonysawfordebi - yes, I had remembered these existing properties over the weekend :-). Here are the gpi properties (and values that I'm aware of) relating to sequences, including the ones you mention above:

 db_subset=TrEMBL or Swiss-Prot
 uniprot_proteome=UP000001940 (C. elegans, for example)
 is_isoform=?
 reference_proteome=?

Wrt the gpi files submitted by MODs, we (WB) thought it might be useful to indicate which of the UniProtKB accessions we reference were part of the GCRP. Looking at these property tags I'm not sure which one we should use and what makes most sense for the value. Would something like this work for the MOD files:

uniprot_gcrp=YES

Also, for properties like db_subset and reference_proteome, is it implicit that these properties refer to the source in column 1 or should we make the property names more explicit wrt the database?

Will need to update: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

tonysawfordebi · 2018-04-23T13:50:51Z

@vanaukenk - I'd forgotten about the gpi file that we generate for WB (and FB and dicty and SGD!)

Unlike the one that we generate for QuickGO indexing purposes, they don't include the reference_proteome and is_isoform properties, but they could if you feel that information would be useful. And yes, if we do include such properties in these files then we should probably do it by having a property called uniprot_gcrp that takes the values 'canonical' or 'isoform', and is omitted if the protein is not in the GCRP. Or something like that,

cmungall · 2018-04-23T19:02:40Z

the duplicate WP:CE52235 protein ids are actually cases where two genes encode the same protein sequence. We have other cases like this, e.g. histones. For annotation, depending on the data being annotated, the curator could either select the unique gene ID for annotation or might need to select the protein if the specific genetic locus was not known.

You could specific multiple gene parents, as the parent field has cardinality>1.

Aside from this ticket, I'm wondering what our annotation policy is for histone genes and other analogous cases. I assume we just duplicate annotations to the identical genes?

btw, the db-xrefs yaml file seems to not have a way of resolving protein entries, and I can't find this in wormbase https://wormbase.org/search/protein/CE52235

khowe · 2018-04-26T16:04:26Z

I will make sure all these issues get fixed.

For the resolution of protein entries, the correct local id is actually WP:CE52235 (https://wormbase.org/search/protein/WP:CE52235). These additional prefixes are an anachronism and confuse things when forming CURIEs (i.e. should the CURIE be WB:WP:CE52235? Or is "WP" a resource?).

In the next release of WB (which we will start preparing in a few weeks), we will drop these prefixes. The local id for the above will then be simply CE52235.

As for the global_id / CURIE, that depends on how we choose to solve the bigger picture of making sure that all front-line ids in WormBase resolve. One way of doing this (proposed by @cmungall ) is for us to write our own resolver that will recognise all of our local ids and resolve them to the correct page. This would allow us to make all of our CURIEs have the form "WB:XXXXX". For a small number of specific data types though, this will not be possible (e.g. "JC8.10a" is an identifier for distinct CDS and Transcript objects in WB).

khowe · 2018-04-26T16:23:49Z

@cmungall @tonysawfordebi Regarding the duplication issue, the GPI format spec states that the Parent field is cardinality "0 or 1".

Also, we are trying to represent the central dogma using the Parent column, GFF3-style. That is, for a protein line, we are populating the Parent column with id of the transcript from which it is translated. Reading the spec though, it seems that this is not really what the Parent field was intended to represent. It seems very UniProt-centric (perhaps unsurprisingly).

vanaukenk · 2018-04-26T17:00:24Z

Thanks @khowe

@cmungall
Wrt annotation, in WB at least, we typically associate GO annotations with WBGenes and many of our experiments are genetically based, so that works fairly well. However, there may certainly be cases where an experiment demonstrates something about a protein sequence that is shared amongst different genes. In that case, we probably would not annotate anything to a WBGene ID because we couldn't be certain that the annotation would be correct for all of the genes.

The way our WB protein IDs work right now, though, if we ever needed to specifically indicate, for example, the histone protein encoded by the his-2 locus, I don't believe we could do it since that histone protein sequence is shared amongst 15 different loci. In practice, this hasn't happened that much for WB GO curation, but maybe it's worth thinking about if/how that could be handled in the future. @khowe what are your thoughts?

cmungall · 2018-04-26T17:21:47Z

Which GPI docs are you referring to? I regard the formal spec in markdown as canonical: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

multiple parents are allowed. There is shocklingly little docs on the semantics of this field, but the intent was analogous to GFF3/Chado

Somehow the docs also spread to the wiki and drupal these may be out of sync, we have not done a good jon of coordinating this

khowe · 2018-04-26T17:38:11Z

@cmungall this one: http://www.geneontology.org/page/gene-product-information-gpi-format

cmungall · 2018-04-26T18:01:43Z

Thanks. We need to unify these

vanaukenk · 2018-04-26T18:03:02Z

@cmungall
Unfortunately, I'm not sure if any groups have referred to the md version of the gpad/gpi documentation as the official documentation.
We really do need to sort this all out before onboarding more groups.
Let me know if/how I can help.

ukemi · 2018-04-26T18:41:15Z

When I wrote our requirements doc for the GPI file. I used this: http://www.geneontology.org/page/gene-product-information-gpi-format
I'm not even sure the md file was available at that point. At any rate, we assumed the one on the GO web site was the official specs.

ukemi · 2018-04-26T18:46:50Z

But it appears that we also updated this page:

http://wiki.geneontology.org/index.php/Proposed_GPI1.2_format

khowe · 2018-04-26T22:23:58Z

Okay, most of these issues have already been fixed in the latest WormBase GPI:

ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases/WS265/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS265.gene_product_info.gpi.gz

This will be propagated to ftp.wormbase.org with a release-neutral URL in the next few weeks.

The duplication issue is still present. I hear @cmungall 's assertion that the github version of the spec is authoritative, and will make the change to have one line for each protein, with multiple Parents where appropriate. However, I am somewhat confused by multiple (different) versions of the spec floating around that all call themselves version "1.2".

pgaudet · 2020-02-12T17:34:08Z

Can this be closed ?

kltm · 2020-02-12T17:35:33Z

I have no memory of this. Closing.

cmungall assigned vanaukenk Apr 9, 2018

cmungall mentioned this issue Apr 20, 2018

skip suspicious lines geneontology/neo#27

Merged

cmungall added a commit to geneontology/neo that referenced this issue Apr 21, 2018

temp fix for geneontology/go-site#595

7ab68c9

vanaukenk mentioned this issue Aug 16, 2018

Pull noctua entities from GPIs, allowing contributors greater control over their entities geneontology/neo#10

Open

vanaukenk mentioned this issue Sep 5, 2018

Loss of entity (label) in autocomplete / NEO / display geneontology/noctua#580

Closed

cmungall mentioned this issue Jan 3, 2019

Some genes not available in autocomplete, symbol and id issues geneontology/neo#38

Closed

kltm mentioned this issue Jan 3, 2019

Revert build-neo-makefile.py geneontology/neo#39

Merged

kltm closed this as completed Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WormBase GPI is splitting some lines causing neo pipeline to break #595

WormBase GPI is splitting some lines causing neo pipeline to break #595

cmungall commented Apr 9, 2018 •

edited

Loading

vanaukenk commented Apr 10, 2018

vanaukenk commented Apr 10, 2018

cmungall commented Apr 18, 2018

cmungall commented Apr 19, 2018 •

edited

Loading

cmungall commented Apr 19, 2018 •

edited

Loading

vanaukenk commented Apr 19, 2018

cmungall commented Apr 21, 2018

cmungall commented Apr 21, 2018

tonysawfordebi commented Apr 23, 2018

vanaukenk commented Apr 23, 2018

tonysawfordebi commented Apr 23, 2018

cmungall commented Apr 23, 2018

khowe commented Apr 26, 2018

khowe commented Apr 26, 2018

vanaukenk commented Apr 26, 2018

cmungall commented Apr 26, 2018

khowe commented Apr 26, 2018

cmungall commented Apr 26, 2018

vanaukenk commented Apr 26, 2018

ukemi commented Apr 26, 2018

ukemi commented Apr 26, 2018

khowe commented Apr 26, 2018

pgaudet commented Feb 12, 2020

kltm commented Feb 12, 2020

WormBase GPI is splitting some lines causing neo pipeline to break #595

WormBase GPI is splitting some lines causing neo pipeline to break #595

Comments

cmungall commented Apr 9, 2018 • edited Loading

vanaukenk commented Apr 10, 2018

vanaukenk commented Apr 10, 2018

cmungall commented Apr 18, 2018

cmungall commented Apr 19, 2018 • edited Loading

cmungall commented Apr 19, 2018 • edited Loading

vanaukenk commented Apr 19, 2018

cmungall commented Apr 21, 2018

cmungall commented Apr 21, 2018

tonysawfordebi commented Apr 23, 2018

vanaukenk commented Apr 23, 2018

tonysawfordebi commented Apr 23, 2018

cmungall commented Apr 23, 2018

khowe commented Apr 26, 2018

khowe commented Apr 26, 2018

vanaukenk commented Apr 26, 2018

cmungall commented Apr 26, 2018

khowe commented Apr 26, 2018

cmungall commented Apr 26, 2018

vanaukenk commented Apr 26, 2018

ukemi commented Apr 26, 2018

ukemi commented Apr 26, 2018

khowe commented Apr 26, 2018

pgaudet commented Feb 12, 2020

kltm commented Feb 12, 2020

cmungall commented Apr 9, 2018 •

edited

Loading

cmungall commented Apr 19, 2018 •

edited

Loading

cmungall commented Apr 19, 2018 •

edited

Loading