Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwissProt and TrEMBL vs. UniProt Knowledgebase #17

Closed
ariutta opened this issue Oct 1, 2015 · 14 comments
Closed

SwissProt and TrEMBL vs. UniProt Knowledgebase #17

ariutta opened this issue Oct 1, 2015 · 14 comments
Labels
Milestone

Comments

@ariutta
Copy link
Contributor

ariutta commented Oct 1, 2015

Our datasources.txt official names (last column) don't always match the recommended Miriam names. For example, datasources.txt has "UniProt/Trembl" where Miriam has "UniProt Knowledgebase". (Update: not an exact match. See later comments.)

@ariutta ariutta added the bug label Oct 1, 2015
@egonw
Copy link
Member

egonw commented Dec 12, 2015

@ariutta, @Chris-Evelo, @AlexanderPico the MIRIAM name does not reflect the subset... same applies for the Swissprot subset. Is that OK? Or should the proper names be "UniProt Knowledgebase/TrEMBL" (etc)?

@egonw egonw added this to the 2.1.2 milestone Dec 12, 2015
@Christian-B
Copy link
Contributor

The way I have always understood it. (but may be wrong) is.
Things start put as Trembl when they are automatically added and given a tremble URL,
When they are then later manually verified they are added to Swissport and given a DIFFERENT Swissprot url (Where different is both the datasource and ID part)
However most people including OPS continue to use the Trembl URLs but only included in their data the objects which have been added to the Swissprot dataset.
So in BredgeBb/ IMS terms the correct term is Trembl and NOT swissprot as it is still the Trembl URLs that are being used. In fact BridgeDb would have no way of knowing automatically looking just at the ID which is which.

Indentifiers.org/Miriam also ONLY do regex pattern matching to see if a URL is valid. There is no regex rules to split TREMBL only URLs from Trembl URsfor items that are included in the Swissprot subset. So they will never be able to make the distinction.

@ariutta
Copy link
Contributor Author

ariutta commented Dec 14, 2015

@egonw, you're right -- "UniProt Knowledgebase" refers to both Swissprot and TrEMBL. When I opened this issue I incorrectly thought "http://identifiers.org/uniprot" only referred to TrEMBL, but here's part of an email reply from Nick Juty:

... I've updated the record for TrEMBL synonyms as requested. Our record does actually refer to SwissProt as well. I guess you want those added too? We go by what is listed on the UniProt website in this case: http://www.ebi.ac.uk/uniprot. It states 1) 'The UniProt Knowledgebase (UniProtKB) ...consists of two sections:
UniProtKB/Swiss-Prot which is manually annotated and is reviewed and
UniProtKB/TrEMBL which is automatically annotated and is not reviewed.'

Since "http://identifiers.org/uniprot" refers to both TrEMBL and Swissprot, I don't know how we should indicate the official name and the Miriam URN in datasources.txt. I'm open to opinions on whether "UniProt Knowledgebase/TrEMBL" is best and whether "urn:miriam:uniprot" should be listed for 1) both TrEMBL and Swissprot or 2) just TrEMBL.

@ariutta ariutta changed the title Incorrect official names in datasources.txt Swissprot and TrEMBL vs. UniProt Knowledgebase Dec 14, 2015
@ariutta ariutta changed the title Swissprot and TrEMBL vs. UniProt Knowledgebase SwissProt and TrEMBL vs. UniProt Knowledgebase Dec 14, 2015
@Chris-Evelo
Copy link

For me the important question is whether we can up with a method to use URN's that are both technically correct and that allow biologists to actually judge immediately from that URN whether data is evaluated (as it is when from UniProt) or not. We might have to go back to identifiers.org or the UniProt team for that.

@Christian-B
Copy link
Contributor

Hi Chris,

To the best of my knowledge the is no way by looking at just the URL/URN even the uniprot one if the object has been added to swissprot or is just Trembl, especially as this could change over time. The biologist would have to look the data up at uniprot.

identifiers.org do not keep individual data records. They only store ID regex patterns passing all id level calls down to the underlying data source (in this case uniport)

@Chris-Evelo
Copy link

Just thinking out loud now. But things that are in SwissProt actually have a SwissProt ID too right? And the UniProt entry should (and probably will) link to that. Can we use that somehow?

@Christian-B
Copy link
Contributor

Yes as far I know things in Swissprot have a second completely different SwissProtID having a linkset to that would be nice but never done in OPS.
Note the SwissProt URN is different in both the base and ID part so no semi automated system could be used,

@ariutta
Copy link
Contributor Author

ariutta commented Dec 16, 2015

Another comment from Nick Juty:

I had a very quick look. I would map both swiss-prot and trembl to our Uniprot collection, as that is what it reflects. I'm not sure of how this causes issues, aside from not knowing from the URI itself which is a manually curated record, and which is not. I've also added UniProtKB/Swiss-Prot as a synonym in case that helps you a bit?

@ariutta
Copy link
Contributor Author

ariutta commented Dec 16, 2015

Currently identifiers.org has "UniProt Knowledgebase" as the preferred name and additionally has these alternative names listed for the http://identifiers.org/uniprot/ data collection:

  • UniProtKB
  • UniProt
  • Protein Knowledgebase
  • UniProt-TrEMBL
  • UniProt/TrEMBL
  • UniProtKB/Swiss-Prot

If you want me to suggest any additions or changes, just let me know.

@ariutta
Copy link
Contributor Author

ariutta commented Dec 16, 2015

Followup comment from Nick:

I think that the gene name is CALM, and in humans it is CALM_HUMAN. But we use the identifier provided by UniProt for the record, not for the gene or protein. The identifier for the record is the stable identifier.

So is datasources.txt wrong to list CALM_HUMAN as a sample identifier for SwissProt?

(Moved this to issue #25.)

@AlexanderPico
Copy link
Contributor

Just to throw another wrench into this discussion… The datasource names and identifiers used by BridgeDb are not only reliant on our collective best sense of what it should be, nor on what identifiers.org http://identifiers.org/ does, but also critically dependent on what the primary resources such as Ensembl decide to do. The BridgeDb database build has been simplified over many years to depend on resources like Ensembl to make these decisions, since they dedicate a ton of time to the problem and represent a widely-use community resource. In many ways this relieves us of having to figure this out and make the necessarily compromising decisions.

In other words, let’s just do what Ensembl does and it simplifies BOTH the decision making and the build process.

If something about how the source data from Ensembl is really really offensive, or caused specific data integrity issues, or breaks a critical analysis workflow, well then we should take it up with Ensembl and ask if they can change it. The more we drift from them, the more we have to maintain these differences in code and in practice.

My perspective on keeping things simple :)

  • Alex

@ariutta
Copy link
Contributor Author

ariutta commented Dec 17, 2015

@AlexanderPico, keeping things simple sounds good. I took a look at the Ensembl entry for CALM2, and here's how they refer to Uniprot:

UniProtKB This gene has proteins that correspond to the following Uniprot identifiers: P62158

They don't appear to distinguish between SwissProt and TrEMBL.

@AlexanderPico
Copy link
Contributor

Ah, not the Ensembl website, but rather their databases. Specifically the data tables we use to make the bridgedb database. This is different from their singular representation. It covers all the alias representations, including uniprot.

  • Alex

On Dec 17, 2015, at 11:40 AM, Anders Riutta notifications@github.com wrote:

@AlexanderPico, keeping things simple sounds good. I took a look at the Ensembl entry for CALM2 , and here's how they refer to Uniprot:

UniProtKB This gene has proteins that correspond to the following Uniprot identifiers: P62158

They don't appear to distinguish between SwissProt and TrEMBL.


Reply to this email directly or view it on GitHub.

@egonw
Copy link
Member

egonw commented Mar 31, 2018

Closed by 192a18f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants