New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProteinAlphabet to ThreeLetterProtein converter #238

Open
eleyine opened this Issue Sep 17, 2013 · 9 comments

Comments

Projects
None yet
5 participants
@eleyine
Copy link

eleyine commented Sep 17, 2013

This is a small issue but I haven't found a way to easily convert from the single-letter Protein alphabet to the three-letter alphabet and vice-versa.

It would be nice to have the possibility to specify the protein alphabet to translate to as such:

   from Bio.Seq import Seq
   from Bio.Alphabet import IUPAC, ThreeLetterProtein
   coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
   protein = coding_dna.translate(alphabet=ThreeLetterProtein)

or define a converter function from one alphabet to the other.

@bow

This comment has been minimized.

Copy link
Member

bow commented Sep 17, 2013

There is a function in Bio.SeqUtils that converts one letter protein codes into three letter codes. It works, however, on plain strings instead of Seq or Alphabet objects.

e.g.

   from Bio.Seq import Seq
   from Bio.Alphabet import IUPAC
   from Bio.SeqUtils import seq3
   coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
   protein_one = coding_dna.translate()
   protein_three = seq3(coding_dna.translate())

It returns the three-letter protein code as strings, however, not Seq objects.

@lennax

This comment has been minimized.

Copy link
Contributor

lennax commented Sep 22, 2013

Protein strings can be converted in both directions:

from Bio.SeqUtils import seq1, seq3
from Bio.Seq import Seq
from Bio.Alphabet import ProteinAlphabet, ThreeLetterProtein

one_lett_str = "YEEI"
one_lett_seq = Seq(one_lett_str, ProteinAlphabet())

three_lett_str = seq3(str(one_lett_seq))
# Kept for completeness but not recommended
# three_lett_seq = Seq(three_lett_str, ThreeLetterProtein())
print three_lett_str
# TyrGluGluIle

new_one_lett_str = seq1(three_lett_str)
print new_one_lett_str
# YEEI

P.S. Hello @openhatch (task 70)

@peterjc

This comment has been minimized.

Copy link
Member

peterjc commented Sep 23, 2013

One reason the seq3 function gives you a string is that using the Seq objects with anything other than a one-letter alphabet is not really defined.

@lennax I'm not comfortable with how you've used a Seq object with the three-letter alphabet - the length is wrong, and it will fail the planned alphabet checks ( https://redmine.open-bio.org/issues/2597 ). i.e. I don't think you should do this:

>>> from Bio.Alphabet import ThreeLetterProtein
>>> from Bio.Seq import Seq
>>> s3 = Seq("TyrGluGluIle", ThreeLetterProtein())
>>> len(s3)
12

You can do this with a MutableSeq which uses an array:

>>> from Bio.Alphabet import ThreeLetterProtein
>>> from Bio.Seq import MutableSeq
>>> m3 = MutableSeq(["Tyr", "Glu", "Glu", "Ile"], ThreeLetterProtein())
>>> m3
MutableSeq('TyrGluGluIle', ThreeLetterProtein())
>>> len(m3)
4

Separately note that ThreeLetterProtein is a class, and you need ThreeLetterProtein() for an instance of the Alphabet class - something else which the Seq object could/should check.

@lennax

This comment has been minimized.

Copy link
Contributor

lennax commented Sep 23, 2013

Ahh, I missed the instance-required bit. Thanks for pointing that out.

So does this suggest that seq3 should be modified to output a list? Or just strongly discourage the use of a three letter protein string as input to any Seq objects? i.e. call seq1 on it and use ProteinAlphabet().

@peterjc

This comment has been minimized.

Copy link
Member

peterjc commented Sep 23, 2013

I'm not sure what the use cases are for seq3, but perhaps an option for a list of 3-letter strings makes sense?

I would like to strongly discourage the use of strings like "TryGluGluIle" to the Seq object, and the proposed alphabet letter check should achieve this...

@MarkusPiotrowski

This comment has been minimized.

Copy link
Contributor

MarkusPiotrowski commented Mar 21, 2014

Actually I was thinking that specifying the alphabet as ThreeLetterProtein is the way to get the correct results (e.g., with len()) with sequences like "TyrGluGluIle". Isn't this the obvious assumption? What else is it good for than for telling the respective Seq methods to behave different?
I just ran into a problem with seq1 by handing to it a seq object with a ThreeLetterProtein alphabet. The problem is the use of upper which ends up in Alphabet/_ init _/_upper() where upper is called on the letter variable. Unfortunately, in the three-letter alphabet letter is a list. → AttributeError.

@bow

This comment has been minimized.

Copy link
Member

bow commented Mar 27, 2014

Hmm..I was thinking maybe we should disallow / deprecate ThreeLetterProtein completely?

Internally, three letter protein codes mean the same thing as their one letter counterpart. It's only the display that's different. Three letter protein codes may be useful for illustrating DNA-amino acid alignment (as in translated DNA), but even here it's inaccurate to keep the three letter protein code since each three letter code represents only a single residue.

Add that to the fact that our Seq objects do not utilize ThreeLetterProtein alphabet at all, it seems that deprecating / removing it is more sensible.

@MarkusPiotrowski

This comment has been minimized.

Copy link
Contributor

MarkusPiotrowski commented Mar 27, 2014

I don't know if three-letter presentations of a protein sequence are widely used. Actually, I doubt that. So it may be OK to depreciate and remove it.
I just wanted to illustrate the view of a naive user of Biopython (as me) who sees: "Ah, there is a three-letter protein alphabet, So I can use my three-letter protein string without converting and everything is fine!" Thus if there is a three-letter protein alphabet and I attach this alphabet to my sequence, then I would expect that all seq methods can deal with this, e.g. len() or upper() (e.g. by calling seq1). Isn't this one reason for having an alphabet attached to a sequence? Not only to define letters, but also to tell the respective methods to behave different with different sequence types? (Peter asked me to do so with the molecular_weight method in SeqUtils)

peterjc added a commit that referenced this issue Nov 13, 2014

We don't have upper/lower case variants of ThreeLetterProtein
This was brought up in discussion on issue #238.
@peterjc

This comment has been minimized.

Copy link
Member

peterjc commented Jul 4, 2018

Cross reference #1681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment