Skip to content
Browse files

Use SeqUtils.seq1() instead of SCOP.to_one_letter_code().

  • Loading branch information...
1 parent 87a3a3a commit ca7762e8247dbcd6b3cbc6d0a7082eaf36dceb46 @cbrueffer cbrueffer committed with peterjc Dec 10, 2012
Showing with 6 additions and 8 deletions.
  1. +6 −8 Bio/SeqIO/
14 Bio/SeqIO/
@@ -22,9 +22,8 @@ def PdbSeqresIterator(handle):
- # Late-binding import to avoid circular dependency on SeqIO in Bio.SCOP
- # TODO - swap in Bow's SeqUtils.seq1 once that's merged
- from Bio.SCOP.three_to_one_dict import to_one_letter_code
+ # Late-binding import to avoid circular dependency on SeqIO in Bio.SeqUtils
+ from Bio.SeqUtils import seq1
chains = collections.defaultdict(list)
metadata = collections.defaultdict(list)
@@ -42,8 +41,7 @@ def PdbSeqresIterator(handle):
chn_id = line[11]
# Number of residues in the chain (repeated on every record)
# num_res = int(line[13:17])
- residues = [to_one_letter_code.get(res, 'X')
- for res in line[19:].split()]
+ residues = [seq1(res) for res in line[19:].split()]
elif rec_name == 'DBREF':
# ID code of this entry (PDB ID)
@@ -124,14 +122,14 @@ def PdbAtomIterator(handle):
# Only import PDB when needed, to avoid/delay NumPy dependency in SeqIO
from Bio.PDB import PDBParser
- from Bio.SCOP.three_to_one_dict import to_one_letter_code
+ from Bio.SeqUtils import seq1
def restype(residue):
"""Return a residue's type as a one-letter code.
Non-standard residues (e.g. CSD, ANP) are returned as 'X'.
- return to_one_letter_code.get(residue.resname, 'X')
+ return seq1(residue.resname)
# Deduce the PDB ID from the PDB header
# ENH: or filename?
@@ -149,7 +147,7 @@ def restype(residue):
for chn_id, chain in sorted(model.child_dict.iteritems()):
# HETATM mod. res. policy: remove mod if in sequence, else discard
residues = [res for res in chain.get_unpacked_list()
- if res.get_resname().upper() in to_one_letter_code]
+ if seq1(res.get_resname().upper()) != "X"]
if not residues:
# Identify missing residues in the structure

10 comments on commit ca7762e

Biopython Project member

Hi Peter,

this was a change I intentionally didn't create a pull request for back then. First of all, as far as I remember it breaks the regression test (or maybe it was a similar change to another file). The bigger problem here is that seq1() accepts a lot fewer three letter codes than to_one_letter_code(). So, from my point of view this has the possibility to break previously working functionality.

I think I raised this problem in some earlier pull request but never did so on the mailing list. What do you think?



Biopython Project member

Thanks - this comment reached me via email before I noticed the failure on TravisCI, reverted:

Either I was too hasty and didn't run the full test suite, or the Python I used didn't have NumPy.

I don't recall the previous discussion on a pull request, but we need to think about the weird 3 letter codes used in the PDB and other structures for modified amino acids, and if that belongs in Bio.SeqUtils or not. Moving this discussion to the dev mailing list might be best.

etal commented on ca7762e Apr 2, 2013

Hmm. Is it possible to use the "custom_map" argument of seq1 to accommodate an extended alphabet? If not, I see two reasonable solutions:

  1. Modify SeqUtils.seq1 to use the extended protein alphabet (perhaps optionally). Is there a reason seq1 uses a more limited alphabet currently? I don't see an issue with performance. (My preference.)
  2. Copy SeqUtils.seq1 into SeqIO/ as a private function, and modify it to use the extended alphabet and drop any unneeded functionality.

Thoughts? Shall we take it to the dev list?

Biopython Project member
peterjc commented on ca7762e Apr 2, 2013

Using the custom_map argument sound good - anyone want to try that on a new branch?

Also perhaps (after some background reading on SCOP and PDB naming of modified residues) the dict can be moved from Bio.SCOP to Bio.Data and given an informative module docstring?

Biopython Project member
bow commented on ca7762e Apr 2, 2013

I tried the custom_map approach here. It seems to be going ok (tests are being run here).

Also, moving SCOP.three_to_one_dict into Bio.Data seems sensible. Perhaps we can extend the CodonTable class to output a dict as well, so that we don't have to define another dict for seq1 or seq3 inside Bio.SeqUtils?

By the way, when I wrote seq1, I simply thought reversing the dictionary used by seq3 would be enough and the simplest way to go (so no particular reason).

etal commented on ca7762e Apr 2, 2013

Bow's patch looks good to me, although it still depends on Bio.SCOP. I support the motion to move SCOP.three_to_one_dict to Bio.Data, probably under a new name, and then add "from Bio.Data import pdb_three_to_one_dict as three_to_one_dict" or similar.

According to git-grep, the module itself is only used in Bio.SCOP.Raf, Bio.SeqIO.PdbIO, and the script Scripts/PDB/ (which generates the module).

Biopython Project member

I like the "moving SCOP.three_to_one_dict to Bio.Data" solution as well.

Biopython Project member
peterjc commented on ca7762e Apr 4, 2013

I've applied @Bow's patch here: 843447a (thanks).

A further update from either of you to move the dictionary would be welcome.

Biopython Project member
bow commented on ca7762e Apr 4, 2013

Thanks :). I'll play around with the Bio.SCOP dict into Bio.Data and post an update soon.

Biopython Project member
bow commented on ca7762e Apr 4, 2013

Pull request posted here: #174, with a link back to this thread.

Please sign in to comment.
Something went wrong with that request. Please try again.