-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method for codon optimization of sequences #4368
Conversation
…DS files containing irregularities
…use of CDS files containing irregularities" This reverts commit 920659e. Code was good but inappropriate for anticipated merge request.
remote contains addition of name as contributor. local contains fixes to syntax required for pull request.
Looks like this replaced #4367 - GitHub lets you update your branch and it updates the pull request to match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not looked at the details of codon optimisation scoring, but found several generic points that I'd like you to address.
Hopefully someone else can look at the core calculation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry - I clicked the wrong button - should be changes requested, not approved as it is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me (without understanding the core algorithm being implemented)
GitHub says there is a conflict to resolve - not sure where, we might have to fix that at the command line if the GitHub interface won't let me do it here. |
Ah never mind - we can do a squash-and-merge, meaning all your changes would become a single commit. GitHub is saying we can't do a rebase-and-merge, probably because your branch history has several merges in it. |
When I have used this method, I use a CAI generated from an organism's entire .cds file. In this case it is very unlikely that any two codons would be equally preferred (both have value 1). If someone did use the One solution to this issue could be to provide a warning that multiple codons are equivalently preferred, and a message saying which one is being (arbitrarily) used for the optimized DNA sequence. Alternatively, it could refuse to run. Or alternate between the equally-preferred codons. Thoughts? |
Can you make a test case for that corner case? My inclination would be pick something arbitrary but consistent fir a tie-break (e.g. alphabetical sort) and issue a warning ( A random choice is also sensible, it makes the test case harder ever so slightly harder to write ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me two goes, but addressed my remaining concern. I am inclinded to squash-and-merge this pull request.
I need to run black style - at times like this I am reminded we could have that happen automatically on pull requests, e.g. see #4322 |
By hand...
Is there anything I can do to help move this pull request along? Are we waiting on another review? |
@crockeraw Can you add some more explanation to the docstring? It's not clear to me from the docstring what the input and output is, and when one would use this function. |
Bio/SeqUtils/__init__.py
Outdated
if self[codon] == 1.0: | ||
if aminoacid in pref_codons: | ||
message = f"{pref_codons[aminoacid]} and {codon} are equally preferred. Using {codon}" | ||
warnings.warn(message, RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that by default, the warning will be shown only once in a single Python session for a specific aminoacid and codon. Calling optimize twice with the same sequence would show the warning only the first time. Also, if a subsequent sequence contains the same codon, then again the warning will not be shown.
A better way is to have strict=True|False argument that specifies if an Exception should be raised if equally preferred codons are found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't object to the suggested strict mode, but personally am content with the proposed warning behaviour.
+1 on @mdehoon's suggestion to expand the docstring. Ideally this would include a (short) doctest, but given the |
I could add a doctest for |
@crockeraw A separate follow up pull request adding doctests to the module would be more welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job on the docstring.
Bio/SeqUtils/__init__.py
Outdated
msg = f"{pref_codons[aminoacid]} and {codon} are equally preferred." | ||
if strict: | ||
raise Exception(msg) | ||
warnings.warn(f"{msg} Using {codon}", RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debatable.
If the default was strict, then non-strict mode wouldn't really need the warning.
But given the default is non-strict, I would keep the warning. I don't have a feel for the way this code might usually be used, but in short batches or interactive coding, the warning would be very helpful - and in large scale analysis can easily be silenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case I imagine is evaluating codon preferences from an organisms coding sequences or ribosomal genes, then using those codon preferences to codon-optimize a DNA sequence or set of sequences for expression in the organism of interest.
Equally preferred codons would only be likely when using a single, or a small number of sequences to generate the CAI table. If I were using this I would always want to be made aware of the fact that there are equally preferred codons and that one was used over the other in generating the optimized sequence. I can imagine that someone might want to throw an exception instead, but I do like keeping the warning as default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with warnings is that they are unreliable. By default, they get triggered only once, and many users won't be aware of this. It gives a false sense of security.
If I were using this I would always want to be made aware of the fact that there are equally preferred codons and that one was used over the other in generating the optimized sequence.
You can do this by setting strict=True
, and if the exception is raised, you can decide whether you want to repeat the calculation with strict=False
. Effectively it's the same as using a warning, but this way it is reliable and reproducible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So @mdehoon would you favour strict=True by default with an exception, and strict=False with no warnings?
Bio/SeqUtils/__init__.py
Outdated
if aminoacid in pref_codons: | ||
msg = f"{pref_codons[aminoacid]} and {codon} are equally preferred." | ||
if strict: | ||
raise Exception(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError
is more appropriate here than a general Exception
You'll need to update the tests for the change to the default. |
…ched this file otherwise.
Black is failing here but passing locally... Sorry, I'm not sure what is going on. |
Is pre-commit not passing because of issues on the master branch? Due to #4428 |
Yes, the blackend-docs failure is from #4428, you can ignore that. |
Thank you all, that's merged now. |
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runpre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
I added a method to the CodonAdaptationIndex class in biopython/Bio/SeqUtils/init.py. This method, optimize(), accepts a DNA or protein sequence and returns a DNA sequence coding for the same amino acids, but using only the preferred codons from the CodonAdaptationIndex instance.
I added a couple of lines in biopython/Tests/test_SeqUtils.py to apply the method to a sequence and assert that the calculated CAI after optimizing is equal to 1.