-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add available coronavirus data to the pipeline #1431
Comments
@kltm you want a GPI file with only the one viral species ? |
We previously discussed just the SARS-CoV-2 genome, but we could extend to the other coronavirus genomes. But let's do the SARS-CoV-2 genome first |
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_sars-cov-2.gpi |
Great! @thomaspd indicated we may need the isoforms also - is this something that will require further upstream protein sequence curation in uniprot? |
@cmungall I don't think we have this data in uniprot private release yet. |
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa |
@cmungall Your neo with geneontology/neo#55 changes are failing with:
|
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf |
@cmungall I'm thinking now that the issues are around here:
and
Given this, without starting the rewrite of the
|
@kltm - recall that my changes to the Makefile hardcoded the URLs for the virus |
from @lpalbou on gitter (better too report here than gitter, where it won't get lost). FAO @alexsign looking at the GPAD of covid: ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa
But one is specified with a PRO and the other no. Is it legit ? Does it mean the second one should be treated as a different isoform maybe ? My answer: no this looks like a bug. |
@cmungall Yes, but the generation of the neo-* targets seems to be entirely done through the datasets.json metadata ball, which seems to be the origin of the error when doing the full build. |
@cmungall it's a legit manual annotation done by Patrick.Masson@isb-sib.ch at SIB. |
Hi, That 's not a bug and you might see several of these discrepancies. That's the problem we have for viral polyproteins. For example the coronavirus: https://www.uniprot.org/uniprot/P0C6X7.txt |
It would be important to be able to capture that. |
Important: Note that everyone reading this thread should be aware the that UniProt PRO IDs have nothing to do with the PRO ontology Thus far we have managed the different protein end products from a GCRP protein using identifiers that use the dash nomenclature, e.g. P0DTC2-n. I would prefer to use these here, but I don't know the uniprot rules for when these get created. Just because these are the products of post-translational cleavage, from a user point of view they are still distinct gene products made by the same gene so I don't see why it should be treated differently. We can extend this to use the uniprot chain/pro IDs, but there are challenges First, these don't seem to be resolvable. Using our existing regsitered prefixes the prefixed ID UniProtKB:P0DTC2:PRO_0000449647 would be resolved as: https://www.uniprot.org/uniprot/P0DTC2:PRO_0000449647 but this is a 404 I would prefer to avoid double-barreled prefixes. Should the prefixed ID not be UniProtKB:PRO_0000449647 but this also fails to resolve: https://www.uniprot.org/uniprot/PRO_0000449647 There is also the problem that because some organisms use the PRO ontology, even if we are well-behaved prefixes, we will cause massive confusion to our community by using the uniprot PRO and the PRO ontology at the same time. |
Also remember every distinct entry in the gpa should be in the gpi, I don't see the PRO ID in the gpi... |
I agree that it would be ideal if there were a way to resolve the UniProt PRO IDs, and we can ask the UniProt team (Maria, maybe?) how that might be done. But even if it's not done yet, it would be very helpful to have a GPI file that has the UniProt PRO ID, and lists the parent ID of the UniProt polyprotein record. From what Patrick had told me, the PRO ID (the chain within the polyprotein) has a name associated with it, so we could also get that in column 4 of the GPI file: @alexsign, would adding a line to the GPI file for each chain be doable? |
My preference would to avoid the MGI-style double-barreled delimiter, and have the global ID be simply UniProtKB:PRO_nnnnn (ie. col2 of the GAF is |
@alexsign any comments on these suggestions ? |
@cmungall Now about the links on UniProt website. Please keep in mind this is PRE-release data, so even simple link like If links are absolutely must, you have to strip :PRO... ids from the link (this is how it's done in QuickGO), or replace ":" to "#" symbol. For a time being you have to use https://covid-19.uniprot.org/uniprotkb/ before the actual ID instead of https://www.uniprot.org/uniprot/. Making links like Unfortunately, your preferred link We can discuss changes we can make from both sides to make this data public ASAP without changing too much in our pipelines. |
It was not missed, I have not yet run another pre-release. I have a TC on Wed with Alex and Chris and was hoping to be enlightened about how GOA determines the DB_Object_Name. I found this page http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/#db-object-symbol-column-3, which seems to say that the gene/ORF name is taken if available and gene product name if there are several. So I don't get why the ORF name was not used for P0DTC6, but in any case Patrick has now assigned a "Short" protein name for the "Recommended" protein name of all SARS2 entries that will go into the next pre-release. Note that the old confusing duplicated names will stay as "Alternative" protein names, because this is what they unfortunately had been called initially. |
Update: there are simply too many problems with the GPI coming out of the current pipeline @realmarcin and I are making a copy of the GPI and re-curating it. you can find our copy here: https://github.com/Knowledge-Graph-Hub/kg-covid-19/blob/master/curated/ORFs/uniprot_sars-cov-2.gpi We are working with @pmasson55 and @alexsign to feed back these changes to uniprot and get them in the goa gpi, but we cannot wait for this any longer |
Note that we are going with @pmasson55 suggestion to use P0DTD1 as the arbitrary parent to generate reference nsp IDs:
This is consistent with what @bmeldal and IntAct are doing However, PDB are using P0DTC1 as the xref for nsps such as nsp3: https://www.rcsb.org/structure/6W9C Should they instead use? P0DTD1-PRO_0000449621 It may well be the case the expression system they used made use of the shorter pp... but it seems the structure is the same regardless of which pp? |
Thank you, @cmungall I will ask John Beresford to get in touch with you regarding the PDB choices. We had conversations about it before. |
I thought we were loading https://github.com/Knowledge-Graph-Hub/kg-covid-19/blob/master/curated/ORFs/uniprot_sars-cov-2.gpi as the source for neo, but it appears not |
Hi Chris, Concerning the spike protein, there are no short names in SwissProt: Do you think it might be the reason why the display is showing the numbers instead of S2. |
yep, we're using my manually curated GPI for now
…On Wed, Apr 28, 2021 at 4:42 AM pmasson55 ***@***.***> wrote:
Hi Chris,
Concerning the spike protein, there are no short names in SwissProt:
DE RecName: Full=Spike glycoprotein {ECO:0000255|HAMAP-Rule:MF_04099};
DE Contains:
DE RecName: Full=Spike protein S1 {ECO:0000255|HAMAP-Rule:MF_04099};
DE Contains:
DE RecName: Full=Spike protein S2 {ECO:0000255|HAMAP-Rule:MF_04099};
DE Contains:
DE RecName: Full=Spike protein S2' {ECO:0000255|HAMAP-Rule:MF_04099};
Do you think it might be the reason why the display is showing the numbers
instead of S2.
The GPI shows
UniProtKB P0DTC2-PRO_0000449648 Spike protein S2 Spike protein S2
Maybe it cannot find the shortname and thus replace it by aminoacid
numbers?
If it's the case, I can fix that in SwissProt.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1431 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMOIYB64TLSB6LKROFSLTK7YCXANCNFSM4LSHCDWQ>
.
|
Due to id collisions, we'll be temporarily switching over to using uniprot_reviewed_virus_bacteria. See geneontology/neo#80 |
@kltm and all, I know I gave the go ahead to use the uniprot reviewed, but when I look again, I see we should definitelt NOT. Sorry for the bad advice. Let's recap where we are. There are 3 alternate files:
goa/uniprot_reviewed_sars-cov-2.gpiWe should definitely NOT use 1, it has many many problems. many of these are things that were previously mention in this ticket, but have been reverted. First of all, this file is missing all the polyproteins. There are only top-level uniprot entries in here. So for example, we are missing all of the nsps. Recall these I think we can definitely rule out using 1, it is missing many key polyproteins, e.g. nsp1-16. Recall that the nsps do not get bona fide uniprot entries, they get PRO IDs. So for example, the reviewed file misses nsp3, the main protease, which is of huge importance Second, this file misses key synonyms and actually has incorrect synonyms. For example:
The correct name for this is ORF6. Also it will be very easy for a curator to mistake this for nsp6, which is a completely different protein (not represented in the reviewed file). Note this was all noted back in early 2020, I thought we had fixed this, see @rachhuntley's comments: I strongly recommend the URL for uniprot_reviewed_sars-cov-2.gpi is removed, as it is not useful in itself because it misses so many important proteins. goa/uniprot_sars-cov-2.gpiThis is better in that it does not remove the key proteins. however, it has the duplicate problem we have suffered for 2 years now: For example, for nsp3:
Recall that there is a loose convention that PRO_0000449621 is "canonical" as discussed in this ticket, but there is nothing in the file to suggest this is the case kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpiI defer to @pmasson55 but I still think this curated file is the best representation of the SARS2 proteome and we should continue to use this. It correctly has a single entry for nsps like nsp3:
It also correctly names proteins like ORF6 |
Okay, when we revisit this, we should be using |
is this done ? |
No progress noted and a bunch of TODO boxes at the top, so I don't think so. |
Everything that was requested by Swiss-Prot/ViralZone could be done. |
Add available SARS-CoV-2 data to the pipeline
Tagging @pgaudet @cmungall
Questions:
The text was updated successfully, but these errors were encountered: