support translation of ambiguity codes #30

kyuhas · 2021-04-13T14:48:35Z

No description provided.

kyuhas · 2021-04-13T14:55:36Z

this is linked to this hgvs package issue: biocommons/hgvs#595

cassiemk · 2021-04-13T15:05:32Z

@reece we need a solution for this bug immediately, please review ASAP

reece · 2021-04-13T18:02:01Z

Will look today. Thanks for the contribution.

kyuhas · 2021-04-13T19:18:01Z

Also, assuming this PR is good to merge, could you publish a new version of bioutils to pypi? Thanks!

reece

Thanks for the contribution, Kaylee.

This implementation translates any code with an ambiguity code as X. However, it's often possible to translate codons with ambiguity codes where the ambiguity is irrelevant to the outcome. Since we're adding ambiguity support, I think we should strive for the fuller support eventually. I'll follow up on slack with discussion for some options on how to proceed.

src/bioutils/cytobands.py

reece · 2021-04-14T14:24:29Z

src/bioutils/sequences.py

@@ -441,7 +453,12 @@ def translate_cds(seq, full_codons=True, ter_symbol="*"):
    protein_seq = list()
    for i in range(0, len(seq) - len(seq) % 3, 3):
        try:
-            aa = dna_to_aa1_lut[seq[i:i + 3]]
+            the_seq = seq[i:i + 3]


Can we rename the_seq to codon for clarity?

sure thing!

reece · 2021-04-14T14:32:13Z

src/bioutils/sequences.py

-            aa = dna_to_aa1_lut[seq[i:i + 3]]
+            the_seq = seq[i:i + 3]
+            wildcard_nucleotides = ["B", "D", "H", "V", "N", "U", "W", "S", "M", "K", "R", "Y", "Z"]
+            if any([wildcard_base in the_seq for wildcard_base in wildcard_nucleotides]):


As written, any wildcard cause the AA to be X. I think we can do better than that. For example, in a standard translation table, CUN ⇒ Leu, GCN ⇒ Ala, GGN ⇒ Gly, AAY ⇒ Asn, etc. See overall comments for discussion.

reece · 2021-04-14T14:38:32Z

src/bioutils/sequences.py

@@ -441,7 +453,12 @@ def translate_cds(seq, full_codons=True, ter_symbol="*"):
    protein_seq = list()
    for i in range(0, len(seq) - len(seq) % 3, 3):
        try:
-            aa = dna_to_aa1_lut[seq[i:i + 3]]
+            the_seq = seq[i:i + 3]
+            wildcard_nucleotides = ["B", "D", "H", "V", "N", "U", "W", "S", "M", "K", "R", "Y", "Z"]


IUPAC calls these "ambiguity codes". Please use that name so that the intent is clearer.
(e.g., something like iupac_ambiguity_codes = "BDHVNUWSMKRYZ")

Also, a list of chars is better written as a string for readability. (Lists and strings are both Sequences and have the same interface for lookup, length, iterability, etc.)

kayleeyuhas added 2 commits April 13, 2021 10:47

issue 595: support wildcard nucleotides

669a653

fix failing test and reformat

7cc5ebb

reece requested changes Apr 14, 2021

View reviewed changes

improve variable names and use string instead of list

5d7484b

reece changed the title ~~595 support wildcards~~ hgvs#595 support wildcards Apr 14, 2021

reece changed the title ~~hgvs#595 support wildcards~~ support translation of wildcards Apr 14, 2021

reece changed the title ~~support translation of wildcards~~ support translation of ambiguity codes Apr 14, 2021

reece mentioned this pull request Apr 14, 2021

Improve support for degenerate codons #31

Closed

reece merged commit c5ea54e into biocommons:main Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support translation of ambiguity codes #30

support translation of ambiguity codes #30

kyuhas commented Apr 13, 2021

kyuhas commented Apr 13, 2021

cassiemk commented Apr 13, 2021

reece commented Apr 13, 2021

kyuhas commented Apr 13, 2021

reece left a comment

reece Apr 14, 2021

kyuhas Apr 14, 2021

reece Apr 14, 2021

reece Apr 14, 2021

support translation of ambiguity codes #30

support translation of ambiguity codes #30

Conversation

kyuhas commented Apr 13, 2021

kyuhas commented Apr 13, 2021

cassiemk commented Apr 13, 2021

reece commented Apr 13, 2021

kyuhas commented Apr 13, 2021

reece left a comment

Choose a reason for hiding this comment

reece Apr 14, 2021

Choose a reason for hiding this comment

kyuhas Apr 14, 2021

Choose a reason for hiding this comment

reece Apr 14, 2021

Choose a reason for hiding this comment

reece Apr 14, 2021

Choose a reason for hiding this comment