Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add available coronavirus data to the pipeline #1431

Closed
3 of 8 tasks
kltm opened this issue Mar 23, 2020 · 123 comments
Closed
3 of 8 tasks

Add available coronavirus data to the pipeline #1431

kltm opened this issue Mar 23, 2020 · 123 comments

Comments

@kltm
Copy link
Member

kltm commented Mar 23, 2020

Add available SARS-CoV-2 data to the pipeline

  • @alexsign to produce GPI file
  • add to NEO (change config line in Makefile)
  • @alexsign to produce GAF/GPAD (this will be mostly interpro2go etc to start with)
  • Add GAF/GPAD to yaml, so can be loaded into amigo, added to release files
  • Patrick/ViralZone will do a GO-CAM for SARS-CoV-2
  • This should naturally flow into GO-CAM site. @lpalbou look into a way to highlight
  • @alexsign to load GPADs emanating from GO-CAMs back into GOA
  • UPDATE 2020-05-13 do the same for SARS-CoV

Tagging @pgaudet @cmungall

Questions:

  • @alexsign is it easy for you to give us the GPI in advance of the proteins going in to uniprot main release? If not, it is trivial for us to parse the xml from ftp://ftp.uniprot.org/pub/databases/uniprot/pre_release/
  • @alexsign will the GPI include all of the isoforms? It looks from coronavirus.xml on the EBI FTP site at the moment there is only accessions for the GPCR proteins
@pgaudet
Copy link
Contributor

pgaudet commented Mar 24, 2020

@kltm you want a GPI file with only the one viral species ?

@cmungall
Copy link
Member

We previously discussed just the SARS-CoV-2 genome, but we could extend to the other coronavirus genomes. But let's do the SARS-CoV-2 genome first

@alexsign
Copy link
Contributor

ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_sars-cov-2.gpi
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi

@cmungall
Copy link
Member

Great! @thomaspd indicated we may need the isoforms also - is this something that will require further upstream protein sequence curation in uniprot?

@alexsign
Copy link
Contributor

@cmungall I don't think we have this data in uniprot private release yet.

@alexsign
Copy link
Contributor

ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa

@kltm
Copy link
Member Author

kltm commented Mar 26, 2020

@cmungall Your neo with geneontology/neo#55 changes are failing with:

18:18:26  make: *** No rule to make target 'target/neo-goa_sars-cov-2.obo', needed by 'all_obo'.  Stop.

@alexsign
Copy link
Contributor

ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gaf

@kltm
Copy link
Member Author

kltm commented Mar 26, 2020

@cmungall I'm thinking now that the issues are around here:

datasets.json: trigger
	wget http://s3.amazonaws.com/go-public/metadata/datasets.json -O $@ && touch $@

and

Makefile-gafs: datasets.json
	./build-neo-makefile.py -i $< > $@.tmp && mv $@.tmp $@

Given this, without starting the rewrite of the

  • remove your changes from the Makefile
  • get datasets.json into the main pipeline (maybe under different name)
  • point this Makefile at the new correct upstream
  • add the COVID-19 GPI metadata
  • rerun (and generate new Makefile-gafs) once upstream dataset is updated

@cmungall cmungall changed the title Add available COVID-19 data to the pipeline Add available coronavirus data to the pipeline Mar 27, 2020
@cmungall
Copy link
Member

@kltm - recall that my changes to the Makefile hardcoded the URLs for the virus

@cmungall
Copy link
Member

from @lpalbou on gitter (better too report here than gitter, where it won't get lost). FAO @alexsign

looking at the GPAD of covid: ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpa
we have two annotations for the same gene:

UniProtKB    P0DTC2    part_of    GO:0055036    GO_REF:0000044    ECO:0000322    UniProtKB-SubCell:SL-0275        20200321    UniProt        go_evidence=IEA
UniProtKB    P0DTC2:PRO_0000449647    enables    GO:0005515    PMID:32132184    ECO:0000353    UniProtKB:Q9BYF1        20200320    UniProt        go_evidence=IPI

But one is specified with a PRO and the other no. Is it legit ? Does it mean the second one should be treated as a different isoform maybe ?

My answer: no this looks like a bug.

@kltm
Copy link
Member Author

kltm commented Mar 27, 2020

@cmungall Yes, but the generation of the neo-* targets seems to be entirely done through the datasets.json metadata ball, which seems to be the origin of the error when doing the full build.

@alexsign
Copy link
Contributor

@cmungall it's a legit manual annotation done by Patrick.Masson@isb-sib.ch at SIB.

@pmasson55
Copy link

Hi,

That 's not a bug and you might see several of these discrepancies. That's the problem we have for viral polyproteins. For example the coronavirus: https://www.uniprot.org/uniprot/P0C6X7.txt
These ployproteins are cleaved once synthesized, leading to the generation of 10 to 15 viral proteins. Since they are post-transcriptional cleavage products, they are represented with only one accession number ( AC:P0C6X7 in this case). The problem is that if we use GO with this accession ( and we did that at the beginning of GO annotation) you end up with polyproteins that contain all the annotations of the 15 viral products, which doesn t mean anything at the end. If we just take the cellular component example, if half of the proteins are cytoplasmic and half are nuclear, that the polyprotein entry will have cytoplasm and nucleus annotation, which doesn't tell much at the end. Instead, we started using PROID to tag specific components of the polyprotein. So we can assign terms much better. In addition, we can still assign term to the full polyprotein ( using just AC:P0C6X7) if necessary as we sometimes have information about the uncleaved polyprotein (function or localization before cleavage) so you can have both, just the Accession number and the accession number with a PRO ID. Hope this is more clear now, I ll follow if you have more questions.

@pgaudet
Copy link
Contributor

pgaudet commented Mar 28, 2020

It would be important to be able to capture that.

@cmungall
Copy link
Member

Important: Note that everyone reading this thread should be aware the that UniProt PRO IDs have nothing to do with the PRO ontology

Thus far we have managed the different protein end products from a GCRP protein using identifiers that use the dash nomenclature, e.g. P0DTC2-n. I would prefer to use these here, but I don't know the uniprot rules for when these get created. Just because these are the products of post-translational cleavage, from a user point of view they are still distinct gene products made by the same gene so I don't see why it should be treated differently.

We can extend this to use the uniprot chain/pro IDs, but there are challenges

First, these don't seem to be resolvable. Using our existing regsitered prefixes the prefixed ID UniProtKB:P0DTC2:PRO_0000449647 would be resolved as:

https://www.uniprot.org/uniprot/P0DTC2:PRO_0000449647

but this is a 404

I would prefer to avoid double-barreled prefixes. Should the prefixed ID not be UniProtKB:PRO_0000449647

but this also fails to resolve:

https://www.uniprot.org/uniprot/PRO_0000449647

There is also the problem that because some organisms use the PRO ontology, even if we are well-behaved prefixes, we will cause massive confusion to our community by using the uniprot PRO and the PRO ontology at the same time.

@cmungall
Copy link
Member

Also remember every distinct entry in the gpa should be in the gpi, I don't see the PRO ID in the gpi...

@thomaspd
Copy link

I agree that it would be ideal if there were a way to resolve the UniProt PRO IDs, and we can ask the UniProt team (Maria, maybe?) how that might be done. But even if it's not done yet, it would be very helpful to have a GPI file that has the UniProt PRO ID, and lists the parent ID of the UniProt polyprotein record. From what Patrick had told me, the PRO ID (the chain within the polyprotein) has a name associated with it, so we could also get that in column 4 of the GPI file:
FT CHAIN 1..180
FT /note="Host translation inhibitor nsp1"
FT /evidence="ECO:0000250"
FT /id="PRO_0000037309"

@alexsign, would adding a line to the GPI file for each chain be doable?

@cmungall
Copy link
Member

My preference would to avoid the MGI-style double-barreled delimiter, and have the global ID be simply UniProtKB:PRO_nnnnn (ie. col2 of the GAF is PRO_nnnnn), and having the uniprot resolves reolve https://www.uniprot.org/uniprot/PRO_0000449647

@pgaudet
Copy link
Contributor

pgaudet commented Mar 31, 2020

@alexsign any comments on these suggestions ?

@alexsign
Copy link
Contributor

@cmungall
Hi Chris,
The annotations of the UniProt "PRO" IDs (chains, ploy-peptides and so on) are not the new thing. We produce and publish this kind data for a while.
From my part I see no issue to include an extra line into the GPI file if this will really help. I assume you need something like this:
UniProtKB P0DTC2 S Spike glycoprotein S|2 protein taxon:2697049
UniProtKB P0DTC2 :PRO_0000449647 S Spike glycoprotein S|2 protein taxon:2697049

Now about the links on UniProt website. Please keep in mind this is PRE-release data, so even simple link like
https://www.uniprot.org/uniprot/P0DTC2
will not work. The data simply not there in the current public release of the website. The next UniProt website public release, which will have sars-cov-2 data, is on April 22nd. The link should work after this date.
The UniProt consortium understands importance of this data and created the data portal everyone to use. Please try the following link to see it.
https://covid-19.uniprot.org/uniprotkb/P0DTC2

If links are absolutely must, you have to strip :PRO... ids from the link (this is how it's done in QuickGO), or replace ":" to "#" symbol. For a time being you have to use https://covid-19.uniprot.org/uniprotkb/ before the actual ID instead of https://www.uniprot.org/uniprot/.
Both links
https://covid-19.uniprot.org/uniprotkb/P0DTC2
and
https://covid-19.uniprot.org/uniprotkb/P0DTC2#PRO_0000449647
works the same way right now.
I have an issue opened with our web development team to make P0DTC2#PRO_0000449647 link scroll to PRO ID information as well.
I understand it's not ideal, and I open to an alternative suggestions.

Making links like
https://www.uniprot.org/uniprot/P0DTC2:PRO_0000449647
work might be possible in the future as well, but it needs to be discussed with the web developers.

Unfortunately, your preferred link
https://www.uniprot.org/uniprot/PRO_0000449647
might never work for UniProt because "PRO" IDs are part of the protein record and not a separate entity, and not indexed as such. They simply provide you with extra information.

We can discuss changes we can make from both sides to make this data public ASAP without changing too much in our pipelines.
The UniProtKB syntax from db-xrefs.yaml for now is:
id_syntax: ([OPQ][0-9][A-Z0-9]{3}[0-9]|A-NR-Z{1,2}[0-9])((-[0-9]+)|:PRO_[0-9]{10}|:VAR_[0-9]{6}){0,1}

@redaschi
Copy link

It was not missed, I have not yet run another pre-release. I have a TC on Wed with Alex and Chris and was hoping to be enlightened about how GOA determines the DB_Object_Name. I found this page http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/#db-object-symbol-column-3, which seems to say that the gene/ORF name is taken if available and gene product name if there are several. So I don't get why the ORF name was not used for P0DTC6, but in any case Patrick has now assigned a "Short" protein name for the "Recommended" protein name of all SARS2 entries that will go into the next pre-release. Note that the old confusing duplicated names will stay as "Alternative" protein names, because this is what they unfortunately had been called initially.

@rachhuntley
Copy link
Contributor

Thanks Nicole and Birgit. @redaschi @bmeldal

@cmungall
Copy link
Member

cmungall commented Jun 26, 2020

Update:

there are simply too many problems with the GPI coming out of the current pipeline

@realmarcin and I are making a copy of the GPI and re-curating it. you can find our copy here: https://github.com/Knowledge-Graph-Hub/kg-covid-19/blob/master/curated/ORFs/uniprot_sars-cov-2.gpi

We are working with @pmasson55 and @alexsign to feed back these changes to uniprot and get them in the goa gpi, but we cannot wait for this any longer

@lpalbou @kltm we should load this gpi for neo

cmungall added a commit to Knowledge-Graph-Hub/kg-covid-19 that referenced this issue Jun 26, 2020
cmungall added a commit to Knowledge-Graph-Hub/kg-covid-19 that referenced this issue Jun 26, 2020
cmungall added a commit to Knowledge-Graph-Hub/kg-covid-19 that referenced this issue Jun 26, 2020
cmungall added a commit to Knowledge-Graph-Hub/kg-covid-19 that referenced this issue Jun 26, 2020
cmungall added a commit to Knowledge-Graph-Hub/kg-covid-19 that referenced this issue Jun 26, 2020
@cmungall
Copy link
Member

Note that we are going with @pmasson55 suggestion to use P0DTD1 as the arbitrary parent to generate reference nsp IDs:

nsp1 P0DTD1:PRO_0000449619
nsp2 P0DTD1:PRO_0000449620
nsp3 P0DTD1:PRO_0000449621
nsp4 P0DTD1:PRO_0000449622
nsp5 P0DTD1:PRO_0000449623
nsp6 P0DTD1:PRO_0000449624
nsp7 P0DTD1:PRO_0000449625
nsp8 P0DTD1:PRO_0000449626
nsp9 P0DTD1:PRO_0000449627
nsp10 P0DTD1:PRO_0000449628
nsp12 (Pol) P0DTD1:PRO_0000449629
nsp13 (Hel) P0DTD1:PRO_0000449630
nsp14 (exoN) P0DTD1:PRO_0000449631
nsp15 P0DTD1:PRO_0000449632
nsp16 P0DTD1:PRO_0000449633

and FROM R1A_SARS2 (P0DTC1):
unique nsp11 P0DTC1:PRO_0000449645

This is consistent with what @bmeldal and IntAct are doing

However, PDB are using P0DTC1 as the xref for nsps such as nsp3:

https://www.rcsb.org/structure/6W9C

Should they instead use? P0DTD1-PRO_0000449621

It may well be the case the expression system they used made use of the shorter pp... but it seems the structure is the same regardless of which pp?

@bmeldal
Copy link
Contributor

bmeldal commented Jul 1, 2020

Thank you, @cmungall

I will ask John Beresford to get in touch with you regarding the PDB choices. We had conversations about it before.

@cmungall
Copy link
Member

I thought we were loading https://github.com/Knowledge-Graph-Hub/kg-covid-19/blob/master/curated/ORFs/uniprot_sars-cov-2.gpi as the source for neo, but it appears not

@pmasson55
Copy link

Hi Chris,

Concerning the spike protein, there are no short names in SwissProt:
DE RecName: Full=Spike glycoprotein {ECO:0000255|HAMAP-Rule:MF_04099};
DE Contains:
DE RecName: Full=Spike protein S1 {ECO:0000255|HAMAP-Rule:MF_04099};
DE Contains:
DE RecName: Full=Spike protein S2 {ECO:0000255|HAMAP-Rule:MF_04099};
DE Contains:
DE RecName: Full=Spike protein S2' {ECO:0000255|HAMAP-Rule:MF_04099};

Do you think it might be the reason why the display is showing the numbers instead of S2.
The GPI shows
UniProtKB P0DTC2-PRO_0000449648 Spike protein S2 Spike protein S2
Maybe it cannot find the shortname and thus replace it by aminoacid numbers?
If it's the case, I can fix that in SwissProt.

@cmungall
Copy link
Member

cmungall commented Apr 30, 2021 via email

@kltm
Copy link
Member Author

kltm commented Feb 4, 2022

Due to id collisions, we'll be temporarily switching over to using uniprot_reviewed_virus_bacteria. See geneontology/neo#80

@cmungall
Copy link
Member

@kltm and all, I know I gave the go ahead to use the uniprot reviewed, but when I look again, I see we should definitelt NOT. Sorry for the bad advice.

Let's recap where we are. There are 3 alternate files:

  1. ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_sars-cov-2.gpi (13 entries)
  2. ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_sars-cov-2.gpi (55 entries)
  3. https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi (32 entries)

goa/uniprot_reviewed_sars-cov-2.gpi

We should definitely NOT use 1, it has many many problems. many of these are things that were previously mention in this ticket, but have been reverted.

First of all, this file is missing all the polyproteins. There are only top-level uniprot entries in here. So for example, we are missing all of the nsps. Recall these

I think we can definitely rule out using 1, it is missing many key polyproteins, e.g. nsp1-16. Recall that the nsps do not get bona fide uniprot entries, they get PRO IDs.

So for example, the reviewed file misses nsp3, the main protease, which is of huge importance

Second, this file misses key synonyms and actually has incorrect synonyms. For example:

UniProtKB P0DTC6 6 Non-structural protein 6 6 protein taxon:2697049

The correct name for this is ORF6. Also it will be very easy for a curator to mistake this for nsp6, which is a completely different protein (not represented in the reviewed file).

Note this was all noted back in early 2020, I thought we had fixed this, see @rachhuntley's comments:
#1431 (comment)

I strongly recommend the URL for uniprot_reviewed_sars-cov-2.gpi is removed, as it is not useful in itself because it misses so many important proteins.

goa/uniprot_sars-cov-2.gpi

This is better in that it does not remove the key proteins. however, it has the duplicate problem we have suffered for 2 years now:

For example, for nsp3:

UniProtKB P0DTC1-PRO_0000449637 nsp3 Non-structural protein 3 nsp3|PL-PRO|P0DTC1(819-2763) protein taxon:2697049 UniProtKB:P0DTC1
UniProtKB P0DTD1-PRO_0000449621 nsp3 Non-structural protein 3 nsp3|PL-PRO|P0DTD1(819-2763)|rep|1a-1b protein taxon:2697049 UniProtKB:P0DTD1

Recall that there is a loose convention that PRO_0000449621 is "canonical" as discussed in this ticket, but there is nothing in the file to suggest this is the case

kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi

I defer to @pmasson55 but I still think this curated file is the best representation of the SARS2 proteome and we should continue to use this.

It correctly has a single entry for nsps like nsp3:

✗ curl -L -s https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi | grep nsp3
UniProtKB P0DTD1-PRO_0000449621 nsp3 Non-structural protein 3 PL-PRO|P0DTD1(819-2763)|UniProtKB:P0DTD1, 819-2763|PLpro (SARS2)|nsp3 (SARS2)|rep/Clv:nsp3 (SARS2)|main proteinase (SARS2)|papain-like proteinase (SARS2)|UniProtKB:P0DTC1, 819-2763|PRO_0000449637|nsp-3|ns3|ns-3|Papain-like proteinase|Papain-like protease|Papain-like protease 2|Papain-like proteinase 2|PL-PRO|PL2-PRO|PL2pro|PLpro|Severe acute respiratory syndrome (SARS) coronavirus nonstructural protein 3|819-2763|SARS-CoV-2-PLpro|2764-3263|ADRP protein taxon:2697049 UniProtKB:P0DTD1 PR:000050272|UniProtKB:P0DTD1-PRO_0000449637|PRO_0000449621

It also correctly names proteins like ORF6

@kltm
Copy link
Member Author

kltm commented Feb 19, 2022

Okay, when we revisit this, we should be using kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi at https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi (github ).

@pgaudet
Copy link
Contributor

pgaudet commented May 23, 2023

is this done ?

@kltm
Copy link
Member Author

kltm commented May 23, 2023

No progress noted and a bunch of TODO boxes at the top, so I don't think so.

@pgaudet
Copy link
Contributor

pgaudet commented Sep 8, 2023

Everything that was requested by Swiss-Prot/ViralZone could be done.

@pgaudet pgaudet closed this as completed Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests