4.3.6 Evolution

hcwang and qxdu edited on Aug 4, 2023, 1 version

Introduction

Aviv Regev et al. introduced two types of evolutionary scenarios in their work [1]: random genetic drift and sswm. In each case, they first simulated the scenario, using their convolutional model to predict the expression for each sequence, and then tested the evolved sequences of the model experimentally, where possible.

We provide a simplified algorithm for both Genetic Drift and SSWM. Their workflow can be described below[1]:

This type of algorithm does not require generative model. There's no sample_number params, since we will directly mutate on inputs, and record the entire trajectory.

Genetic Drift

They first simulated random genetic drift of regulatory sequences, with no selection on expression levels. They randomly introduced a single mutation in each random starting sequence, repeated this process for multiple consecutive generations and used their convolutional model to predict the difference in expression between the mutated sequences in each trajectory relative to the corresponding starting sequence. Expression levels diverged as the number of mutations increased.

The advantage of the algorithm here is that each change only changes at most one base, making the editing distance controllable; In addition, we allow users to specify variable regions on the sequence, that is, only modify the sequence in the parts they expect.

Input parameters are as follows:

Initialization params:

params	description	default value
predictor	predictor model class	None
predictor_modelpath	trained model path of predictor	None
natural_datapath	natural sequences datapath	None
savepath	final results saving directory	None

Running params:

params	description	default value
MaxIter	the feedback steps will be replicated for MaxIter times	5
flanking	control the variable region of the sequences

Before executing optimizer, you should have trained a generator and a predictor.

A simple demo will work like:

from gpro.optimizer.evolution.drift import Drift

project_path = "your project path"
    
predictor = CNN_K15_language(length=50)
predictor_modelpath = project_path + '/checkpoints/cnn_k15/checkpoint.pth'
natural_datapath = project_path + '/data/diffusion_prediction/seq.txt'

tmp = Drift(predictor=predictor, predictor_modelpath=predictor_modelpath,
               natural_datapath=natural_datapath, savepath="./optimization/Drift")

tmp.run()

After multiple steps of drift, we use a predictor to directly filter the predicted optimal results across the entire trajectory.

Resulting files consists of compared_with_natural.pdf, ExpIter.txt, ExpIter.csv, random_walking_result.csv and each_iter_distribution.pdf.

files	description
compared_with_natural.pdf	Box plot comparing model generated results with natural results
ExpIter.txt	Save the FASTA file for the final result sequence
ExpIter.csv	Save the sequences and predictions for the final result sequence
each_iter_distribution.pdf	Box plot of predictive results of each drift
random_walking_result.csv	A CSV document that records the predicted expression amount at each editing distance

A compared_with_natural box plot is shown below:

SSWM(strong-selection weak-mutation)

Each mutation is either beneficial or deleterious (strong selection, with mutations surviving drift and fixing in an asexual population), and mutation rates are low enough to only consider single-base substitutions during adaptive walks (weak mutation). Starting with a set of native promoter sequences, at each iteration (generation), for a given starting sequence of length L, we considered all of its 3L single-base mutational neighbours, used our convolutional model to predict their expression and took the sequence with the largest increase (or separately, decrease) in expression at each iteration (generation) as the starting sequence for the next generation.

Input parameters are as follows:

Initialization params:

params	description	default value
predictor	predictor model class	None
predictor_modelpath	trained model path of predictor	None
natural_datapath	natural sequences datapath	None
savepath	final results saving directory	None

Running params:

params	description	default value
MaxIter	the feedback steps will be replicated for MaxIter times	10
flanking	control the variable region of the sequences	None
mode	maximum or minimum the results	"max"

Before executing optimizer, you should have trained a generator and a predictor.

A simple demo will work like:

from gpro.optimizer.evolution.sswm import SSWM

project_path = "your project path"
    
predictor = CNN_K15_language(length=50)
predictor_modelpath = project_path + '/checkpoints/cnn_k15/checkpoint.pth'
natural_datapath = project_path + '/data/diffusion_prediction/seq.txt'

tmp = SSWM(predictor=predictor, predictor_modelpath=predictor_modelpath,
                   natural_datapath=natural_datapath, savepath="./optimization/SSWM")

tmp.run()

Resulting files consists of compared_with_natural.pdf, ExpIter.txt, ExpIter.csv, directed_evolution.csv and each_iter_distribution.pdf. The above files generally have suffixes for patterns and iters.

files	description
compared_with_natural.pdf	Box plot comparing model generated results with natural results
ExpIter.txt	Save the FASTA file for the final result sequence
ExpIter.csv	Save the sequences and predictions for the final result sequence
each_iter_distribution.pdf	Box plot of predictive results of each drift
directed_evolution.csv	A CSV document that records the predicted expression amount at each editing distance

A compared_with_natural_max box plot is shown below (after 10 MaxIter):

A compared_with_natural_min box plot is shown below (after 2 MaxIter):

Citations

[1] Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6

Installation - Quick Start - Cheat Sheet - Demos - Default Datasets - Model Comparisons - Customization