-
Notifications
You must be signed in to change notification settings - Fork 6
4.3.6 Evolution
hcwang and qxdu edited on Aug 4, 2023, 1 version
Aviv Regev et al. introduced two types of evolutionary scenarios in their work [1]: random genetic drift and sswm. In each case, they first simulated the scenario, using their convolutional model to predict the expression for each sequence, and then tested the evolved sequences of the model experimentally, where possible.
We provide a simplified algorithm for both Genetic Drift and SSWM. Their workflow can be described below[1]:
This type of algorithm does not require generative model. There's no sample_number
params, since we will directly mutate on inputs, and record the entire trajectory.
They first simulated random genetic drift of regulatory sequences, with no selection on expression levels. They randomly introduced a single mutation in each random starting sequence, repeated this process for multiple consecutive generations and used their convolutional model to predict the difference in expression between the mutated sequences in each trajectory relative to the corresponding starting sequence. Expression levels diverged as the number of mutations increased.
The advantage of the algorithm here is that each change only changes at most one base, making the editing distance controllable; In addition, we allow users to specify variable regions on the sequence, that is, only modify the sequence in the parts they expect.
Input parameters are as follows:
Initialization params:
params | description | default value |
---|---|---|
predictor | predictor model class | None |
predictor_modelpath | trained model path of predictor | None |
natural_datapath | natural sequences datapath | None |
savepath | final results saving directory | None |
Running params:
params | description | default value |
---|---|---|
MaxIter | the feedback steps will be replicated for MaxIter times | 5 |
flanking | control the variable region of the sequences |
Before executing optimizer, you should have trained a generator and a predictor.
A simple demo will work like:
from gpro.optimizer.evolution.drift import Drift
project_path = "your project path"
predictor = CNN_K15_language(length=50)
predictor_modelpath = project_path + '/checkpoints/cnn_k15/checkpoint.pth'
natural_datapath = project_path + '/data/diffusion_prediction/seq.txt'
tmp = Drift(predictor=predictor, predictor_modelpath=predictor_modelpath,
natural_datapath=natural_datapath, savepath="./optimization/Drift")
tmp.run()
After multiple steps of drift, we use a predictor to directly filter the predicted optimal results across the entire trajectory.
Resulting files consists of compared_with_natural.pdf
, ExpIter.txt
, ExpIter.csv
, random_walking_result.csv
and each_iter_distribution.pdf
.
files | description |
---|---|
compared_with_natural.pdf | Box plot comparing model generated results with natural results |
ExpIter.txt | Save the FASTA file for the final result sequence |
ExpIter.csv | Save the sequences and predictions for the final result sequence |
each_iter_distribution.pdf | Box plot of predictive results of each drift |
random_walking_result.csv | A CSV document that records the predicted expression amount at each editing distance |
A compared_with_natural
box plot is shown below:
Each mutation is either beneficial or deleterious (strong selection, with mutations surviving drift and fixing in an asexual population), and mutation rates are low enough to only consider single-base substitutions during adaptive walks (weak mutation). Starting with a set of native promoter sequences, at each iteration (generation), for a given starting sequence of length L, we considered all of its 3L single-base mutational neighbours, used our convolutional model to predict their expression and took the sequence with the largest increase (or separately, decrease) in expression at each iteration (generation) as the starting sequence for the next generation.
Input parameters are as follows:
Initialization params:
params | description | default value |
---|---|---|
predictor | predictor model class | None |
predictor_modelpath | trained model path of predictor | None |
natural_datapath | natural sequences datapath | None |
savepath | final results saving directory | None |
Running params:
params | description | default value |
---|---|---|
MaxIter | the feedback steps will be replicated for MaxIter times | 10 |
flanking | control the variable region of the sequences | None |
mode | maximum or minimum the results | "max" |
Before executing optimizer, you should have trained a generator and a predictor.
A simple demo will work like:
from gpro.optimizer.evolution.sswm import SSWM
project_path = "your project path"
predictor = CNN_K15_language(length=50)
predictor_modelpath = project_path + '/checkpoints/cnn_k15/checkpoint.pth'
natural_datapath = project_path + '/data/diffusion_prediction/seq.txt'
tmp = SSWM(predictor=predictor, predictor_modelpath=predictor_modelpath,
natural_datapath=natural_datapath, savepath="./optimization/SSWM")
tmp.run()
Resulting files consists of compared_with_natural.pdf
, ExpIter.txt
, ExpIter.csv
, directed_evolution.csv
and each_iter_distribution.pdf
. The above files generally have suffixes for patterns and iters.
files | description |
---|---|
compared_with_natural.pdf | Box plot comparing model generated results with natural results |
ExpIter.txt | Save the FASTA file for the final result sequence |
ExpIter.csv | Save the sequences and predictions for the final result sequence |
each_iter_distribution.pdf | Box plot of predictive results of each drift |
directed_evolution.csv | A CSV document that records the predicted expression amount at each editing distance |
A compared_with_natural_max
box plot is shown below (after 10 MaxIter):
A compared_with_natural_min
box plot is shown below (after 2 MaxIter):
[1] Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6