Method for codon optimization of sequences #4368

crockeraw · 2023-07-20T21:17:18Z

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

I added a method to the CodonAdaptationIndex class in biopython/Bio/SeqUtils/init.py. This method, optimize(), accepts a DNA or protein sequence and returns a DNA sequence coding for the same amino acids, but using only the preferred codons from the CodonAdaptationIndex instance.

I added a couple of lines in biopython/Tests/test_SeqUtils.py to apply the method to a sequence and assert that the calculated CAI after optimizing is equal to 1.

…f sequences

…DS files containing irregularities

…use of CDS files containing irregularities" This reverts commit 920659e. Code was good but inappropriate for anticipated merge request.

remote contains addition of name as contributor. local contains fixes to syntax required for pull request.

peterjc · 2023-07-21T08:38:01Z

Looks like this replaced #4367 - GitHub lets you update your branch and it updates the pull request to match.

Bio/SeqUtils/__init__.py

Tests/test_SeqUtils.py

peterjc

I've not looked at the details of codon optimisation scoring, but found several generic points that I'd like you to address.

Hopefully someone else can look at the core calculation...

peterjc

(Sorry - I clicked the wrong button - should be changes requested, not approved as it is)

…edup

peterjc

Looks OK to me (without understanding the core algorithm being implemented)

peterjc · 2023-07-21T19:03:01Z

GitHub says there is a conflict to resolve - not sure where, we might have to fix that at the command line if the GitHub interface won't let me do it here.

peterjc · 2023-07-21T19:04:20Z

Ah never mind - we can do a squash-and-merge, meaning all your changes would become a single commit.

GitHub is saying we can't do a rebase-and-merge, probably because your branch history has several merges in it.

crockeraw · 2023-07-21T20:25:01Z

When I have used this method, I use a CAI generated from an organism's entire .cds file. In this case it is very unlikely that any two codons would be equally preferred (both have value 1). If someone did use the optimize method with multiple equally-preferred codons (for instance if they were relying only on a ribosomal protein sequence) , it would only use the one which came later in standard_dna_table.forward_table.items().

One solution to this issue could be to provide a warning that multiple codons are equivalently preferred, and a message saying which one is being (arbitrarily) used for the optimized DNA sequence. Alternatively, it could refuse to run. Or alternate between the equally-preferred codons. Thoughts?

peterjc · 2023-07-21T20:31:02Z

Can you make a test case for that corner case? My inclination would be pick something arbitrary but consistent fir a tie-break (e.g. alphabetical sort) and issue a warning (import warnings etc).

A random choice is also sensible, it makes the test case harder ever so slightly harder to write ;)

peterjc

Took me two goes, but addressed my remaining concern. I am inclinded to squash-and-merge this pull request.

peterjc · 2023-07-31T18:22:57Z

I need to run black style - at times like this I am reminded we could have that happen automatically on pull requests, e.g. see #4322

By hand...

crockeraw · 2023-08-15T15:55:51Z

Is there anything I can do to help move this pull request along? Are we waiting on another review?

mdehoon · 2023-08-15T23:29:30Z

@crockeraw Can you add some more explanation to the docstring? It's not clear to me from the docstring what the input and output is, and when one would use this function.

mdehoon · 2023-08-16T08:03:34Z

Bio/SeqUtils/__init__.py

+            if self[codon] == 1.0:
+                if aminoacid in pref_codons:
+                    message = f"{pref_codons[aminoacid]} and {codon} are equally preferred. Using {codon}"
+                    warnings.warn(message, RuntimeWarning)


Note that by default, the warning will be shown only once in a single Python session for a specific aminoacid and codon. Calling optimize twice with the same sequence would show the warning only the first time. Also, if a subsequent sequence contains the same codon, then again the warning will not be shown.
A better way is to have strict=True|False argument that specifies if an Exception should be raised if equally preferred codons are found.

I wouldn't object to the suggested strict mode, but personally am content with the proposed warning behaviour.

peterjc · 2023-08-16T09:42:24Z

+1 on @mdehoon's suggestion to expand the docstring.

Ideally this would include a (short) doctest, but given the CodonAdaptationIndex class doesn't seem to have any at all, that would be a bigger ask.

crockeraw · 2023-08-16T20:05:09Z

I could add a doctest for CodonAdaptationIndex and for each of its functions in a separate pull request if that that would be helpful and cleaner than tacking on here.

Bio/SeqUtils/__init__.py

peterjc · 2023-08-17T09:55:47Z

@crockeraw A separate follow up pull request adding doctests to the module would be more welcome.

peterjc

Good job on the docstring.

mdehoon · 2023-08-21T04:17:26Z

Bio/SeqUtils/__init__.py

+                    msg = f"{pref_codons[aminoacid]} and {codon} are equally preferred."
+                    if strict:
+                        raise Exception(msg)
+                    warnings.warn(f"{msg} Using {codon}", RuntimeWarning)


Do you still need this warning?

Debatable.

If the default was strict, then non-strict mode wouldn't really need the warning.

But given the default is non-strict, I would keep the warning. I don't have a feel for the way this code might usually be used, but in short batches or interactive coding, the warning would be very helpful - and in large scale analysis can easily be silenced.

The use case I imagine is evaluating codon preferences from an organisms coding sequences or ribosomal genes, then using those codon preferences to codon-optimize a DNA sequence or set of sequences for expression in the organism of interest.

Equally preferred codons would only be likely when using a single, or a small number of sequences to generate the CAI table. If I were using this I would always want to be made aware of the fact that there are equally preferred codons and that one was used over the other in generating the optimized sequence. I can imagine that someone might want to throw an exception instead, but I do like keeping the warning as default behavior.

The problem with warnings is that they are unreliable. By default, they get triggered only once, and many users won't be aware of this. It gives a false sense of security.

If I were using this I would always want to be made aware of the fact that there are equally preferred codons and that one was used over the other in generating the optimized sequence.

You can do this by setting strict=True, and if the exception is raised, you can decide whether you want to repeat the calculation with strict=False. Effectively it's the same as using a warning, but this way it is reliable and reproducible.

So @mdehoon would you favour strict=True by default with an exception, and strict=False with no warnings?

mdehoon · 2023-08-21T04:18:12Z

Bio/SeqUtils/__init__.py

+                if aminoacid in pref_codons:
+                    msg = f"{pref_codons[aminoacid]} and {codon} are equally preferred."
+                    if strict:
+                        raise Exception(msg)


ValueError is more appropriate here than a general Exception

peterjc · 2023-08-21T15:40:38Z

You'll need to update the tests for the change to the default.

…ched this file otherwise.

crockeraw · 2023-08-21T16:49:47Z

Black is failing here but passing locally... Sorry, I'm not sure what is going on.

crockeraw · 2023-08-21T17:31:32Z

Is pre-commit not passing because of issues on the master branch? Due to #4428
Or is this a separate issue?

peterjc · 2023-08-21T18:24:10Z

Yes, the blackend-docs failure is from #4428, you can ignore that.

peterjc · 2023-08-22T10:03:30Z

Thank you all, that's merged now.

crockeraw and others added 10 commits July 20, 2023 10:33

added method CodonAdaptationIndex.optimize() for codon optimization o…

b05be43

…f sequences

added option to ignore illegal codons in CAI. Enables direct use of C…

920659e

…DS files containing irregularities

added test for CodonAdaptationIndex.optimize() in tests_SeqUtils.py

b21517c

minor change to doc string

ebc7236

Revert "added option to ignore illegal codons in CAI. Enables direct …

b4839d7

…use of CDS files containing irregularities" This reverts commit 920659e. Code was good but inappropriate for anticipated merge request.

Update NEWS.rst

db3c0a9

Update CONTRIB.rst

c90deb5

Merge branch 'master' into master

489989b

fixed convention issues identified by pre-commit

3558884

Merge branch 'master' of github.com:crockeraw/biopython-codonOptimizer

8bb2d98

remote contains addition of name as contributor. local contains fixes to syntax required for pull request.

crockeraw requested a review from peterjc as a code owner July 20, 2023 21:17