-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support translation of ambiguity codes #30
Conversation
this is linked to this hgvs package issue: biocommons/hgvs#595 |
@reece we need a solution for this bug immediately, please review ASAP |
Will look today. Thanks for the contribution. |
Also, assuming this PR is good to merge, could you publish a new version of bioutils to pypi? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, Kaylee.
This implementation translates any code with an ambiguity code as X. However, it's often possible to translate codons with ambiguity codes where the ambiguity is irrelevant to the outcome. Since we're adding ambiguity support, I think we should strive for the fuller support eventually. I'll follow up on slack with discussion for some options on how to proceed.
src/bioutils/sequences.py
Outdated
@@ -441,7 +453,12 @@ def translate_cds(seq, full_codons=True, ter_symbol="*"): | |||
protein_seq = list() | |||
for i in range(0, len(seq) - len(seq) % 3, 3): | |||
try: | |||
aa = dna_to_aa1_lut[seq[i:i + 3]] | |||
the_seq = seq[i:i + 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the_seq
to codon
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing!
src/bioutils/sequences.py
Outdated
aa = dna_to_aa1_lut[seq[i:i + 3]] | ||
the_seq = seq[i:i + 3] | ||
wildcard_nucleotides = ["B", "D", "H", "V", "N", "U", "W", "S", "M", "K", "R", "Y", "Z"] | ||
if any([wildcard_base in the_seq for wildcard_base in wildcard_nucleotides]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, any wildcard cause the AA to be X. I think we can do better than that. For example, in a standard translation table, CUN ⇒ Leu, GCN ⇒ Ala, GGN ⇒ Gly, AAY ⇒ Asn, etc. See overall comments for discussion.
src/bioutils/sequences.py
Outdated
@@ -441,7 +453,12 @@ def translate_cds(seq, full_codons=True, ter_symbol="*"): | |||
protein_seq = list() | |||
for i in range(0, len(seq) - len(seq) % 3, 3): | |||
try: | |||
aa = dna_to_aa1_lut[seq[i:i + 3]] | |||
the_seq = seq[i:i + 3] | |||
wildcard_nucleotides = ["B", "D", "H", "V", "N", "U", "W", "S", "M", "K", "R", "Y", "Z"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IUPAC calls these "ambiguity codes". Please use that name so that the intent is clearer.
(e.g., something like iupac_ambiguity_codes = "BDHVNUWSMKRYZ"
)
Also, a list of chars is better written as a string for readability. (Lists and strings are both Sequences and have the same interface for lookup, length, iterability, etc.)
No description provided.