Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method for codon optimization of sequences #4368

Merged
merged 34 commits into from
Aug 22, 2023
Merged

Conversation

crockeraw
Copy link
Contributor

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

I added a method to the CodonAdaptationIndex class in biopython/Bio/SeqUtils/init.py. This method, optimize(), accepts a DNA or protein sequence and returns a DNA sequence coding for the same amino acids, but using only the preferred codons from the CodonAdaptationIndex instance.

I added a couple of lines in biopython/Tests/test_SeqUtils.py to apply the method to a sequence and assert that the calculated CAI after optimizing is equal to 1.

@crockeraw crockeraw requested a review from peterjc as a code owner July 20, 2023 21:17
@peterjc
Copy link
Member

peterjc commented Jul 21, 2023

Looks like this replaced #4367 - GitHub lets you update your branch and it updates the pull request to match.

Bio/SeqUtils/__init__.py Outdated Show resolved Hide resolved
Bio/SeqUtils/__init__.py Outdated Show resolved Hide resolved
Bio/SeqUtils/__init__.py Outdated Show resolved Hide resolved
Bio/SeqUtils/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@peterjc peterjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at the details of codon optimisation scoring, but found several generic points that I'd like you to address.

Hopefully someone else can look at the core calculation...

@peterjc peterjc self-requested a review July 21, 2023 08:48
Copy link
Member

@peterjc peterjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry - I clicked the wrong button - should be changes requested, not approved as it is)

Copy link
Member

@peterjc peterjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me (without understanding the core algorithm being implemented)

@peterjc
Copy link
Member

peterjc commented Jul 21, 2023

GitHub says there is a conflict to resolve - not sure where, we might have to fix that at the command line if the GitHub interface won't let me do it here.

@peterjc
Copy link
Member

peterjc commented Jul 21, 2023

Ah never mind - we can do a squash-and-merge, meaning all your changes would become a single commit.

GitHub is saying we can't do a rebase-and-merge, probably because your branch history has several merges in it.

@crockeraw
Copy link
Contributor Author

When I have used this method, I use a CAI generated from an organism's entire .cds file. In this case it is very unlikely that any two codons would be equally preferred (both have value 1). If someone did use the optimize method with multiple equally-preferred codons (for instance if they were relying only on a ribosomal protein sequence) , it would only use the one which came later in standard_dna_table.forward_table.items().

One solution to this issue could be to provide a warning that multiple codons are equivalently preferred, and a message saying which one is being (arbitrarily) used for the optimized DNA sequence. Alternatively, it could refuse to run. Or alternate between the equally-preferred codons. Thoughts?

@peterjc
Copy link
Member

peterjc commented Jul 21, 2023

Can you make a test case for that corner case? My inclination would be pick something arbitrary but consistent fir a tie-break (e.g. alphabetical sort) and issue a warning (import warnings etc).

A random choice is also sensible, it makes the test case harder ever so slightly harder to write ;)

Copy link
Member

@peterjc peterjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me two goes, but addressed my remaining concern. I am inclinded to squash-and-merge this pull request.

@peterjc
Copy link
Member

peterjc commented Jul 31, 2023

I need to run black style - at times like this I am reminded we could have that happen automatically on pull requests, e.g. see #4322

By hand...
@crockeraw
Copy link
Contributor Author

Is there anything I can do to help move this pull request along? Are we waiting on another review?

@mdehoon
Copy link
Contributor

mdehoon commented Aug 15, 2023

@crockeraw Can you add some more explanation to the docstring? It's not clear to me from the docstring what the input and output is, and when one would use this function.

if self[codon] == 1.0:
if aminoacid in pref_codons:
message = f"{pref_codons[aminoacid]} and {codon} are equally preferred. Using {codon}"
warnings.warn(message, RuntimeWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that by default, the warning will be shown only once in a single Python session for a specific aminoacid and codon. Calling optimize twice with the same sequence would show the warning only the first time. Also, if a subsequent sequence contains the same codon, then again the warning will not be shown.
A better way is to have strict=True|False argument that specifies if an Exception should be raised if equally preferred codons are found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't object to the suggested strict mode, but personally am content with the proposed warning behaviour.

@peterjc
Copy link
Member

peterjc commented Aug 16, 2023

+1 on @mdehoon's suggestion to expand the docstring.

Ideally this would include a (short) doctest, but given the CodonAdaptationIndex class doesn't seem to have any at all, that would be a bigger ask.

@crockeraw
Copy link
Contributor Author

I could add a doctest for CodonAdaptationIndex and for each of its functions in a separate pull request if that that would be helpful and cleaner than tacking on here.

@peterjc
Copy link
Member

peterjc commented Aug 17, 2023

@crockeraw A separate follow up pull request adding doctests to the module would be more welcome.

Copy link
Member

@peterjc peterjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on the docstring.

msg = f"{pref_codons[aminoacid]} and {codon} are equally preferred."
if strict:
raise Exception(msg)
warnings.warn(f"{msg} Using {codon}", RuntimeWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need this warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debatable.

If the default was strict, then non-strict mode wouldn't really need the warning.

But given the default is non-strict, I would keep the warning. I don't have a feel for the way this code might usually be used, but in short batches or interactive coding, the warning would be very helpful - and in large scale analysis can easily be silenced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case I imagine is evaluating codon preferences from an organisms coding sequences or ribosomal genes, then using those codon preferences to codon-optimize a DNA sequence or set of sequences for expression in the organism of interest.

Equally preferred codons would only be likely when using a single, or a small number of sequences to generate the CAI table. If I were using this I would always want to be made aware of the fact that there are equally preferred codons and that one was used over the other in generating the optimized sequence. I can imagine that someone might want to throw an exception instead, but I do like keeping the warning as default behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with warnings is that they are unreliable. By default, they get triggered only once, and many users won't be aware of this. It gives a false sense of security.

If I were using this I would always want to be made aware of the fact that there are equally preferred codons and that one was used over the other in generating the optimized sequence.

You can do this by setting strict=True, and if the exception is raised, you can decide whether you want to repeat the calculation with strict=False. Effectively it's the same as using a warning, but this way it is reliable and reproducible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @mdehoon would you favour strict=True by default with an exception, and strict=False with no warnings?

if aminoacid in pref_codons:
msg = f"{pref_codons[aminoacid]} and {codon} are equally preferred."
if strict:
raise Exception(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError is more appropriate here than a general Exception

@peterjc
Copy link
Member

peterjc commented Aug 21, 2023

You'll need to update the tests for the change to the default.

@crockeraw
Copy link
Contributor Author

Black is failing here but passing locally... Sorry, I'm not sure what is going on.

@crockeraw
Copy link
Contributor Author

Is pre-commit not passing because of issues on the master branch? Due to #4428
Or is this a separate issue?

@peterjc
Copy link
Member

peterjc commented Aug 21, 2023

Yes, the blackend-docs failure is from #4428, you can ignore that.

@peterjc peterjc merged commit e02bd4e into biopython:master Aug 22, 2023
32 of 33 checks passed
@peterjc
Copy link
Member

peterjc commented Aug 22, 2023

Thank you all, that's merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants