# Introduction

The multievolve package utilizes a series of four modules to prepare data and deploy models to perform machine learning guided directed evolution.

1. ```Splitters``` are used to split the dataset into training, validation, and test sets.

2. ```Featurizers``` are used to featurize the sequences.

3. ```Predictors``` are used to train and deploy models to perform property prediction.

4. ```Proposers``` are used to propose and evaluate new sequences given a list of trained models.

For developers, additional classes can be added to each module to implement custom functionality.

In [1]:
from multievolve.splitters import *
from multievolve.featurizers import *
from multievolve.predictors import *
from multievolve.proposers import *

## Setting up

First, define the following variables:
- ```experiment_name```: the name of the experiment

- ```protein_name```: the name of the protein

- ```wt_file```: the path to the wildtype sequence

- ```training_dataset_fname```: the path to the training dataset

In [2]:
experiment_name = "example_experiment"
protein_name = "example_protein"
wt_file = "../../data/example_protein/apex.fasta"
training_dataset_fname = '../../data/example_protein/example_dataset.csv'

Datasets should be in CSV format with following columns:

- ```mutation```: the mutation, formatted as ```A123V```, wherein multi-mutants are separated by forward slashes (```/```). If there is no mutation, the value should be ```WT```.

- ```property_value```: the property value

- ```evolution_round```: the evolution round in which the variant was measured (optional)

In [None]:
df = pd.read_csv(training_dataset_fname)
df.head()

## Splitters

Several splitters are available in the ```splitters``` module. Each splitter can split the dataset into training, validation, and test sets using different strategies. To learn more about the splitters, check out the ```splitters.ipynb``` notebook.

Each splitter has the following parameters:

- ```protein_name```: the name of the protein

- ```training_dataset_fname```: the path to the training dataset

- ```wt_file```: the path to the wildtype sequence

- ```csv_has_header```: whether the CSV has a header

- ```use_cache```: whether to cache the processed dataset for later use (default: ```False```)

- ```y_scaling```: whether to scale the property values between 0 and 1 (default: ```False```)

- ```val_split```: the proportion of the dataset to include in the validation set (default: ```None```). The validation set is only used for when training neural network models.

We will initilize two splitters: one for non-neural network models and one for neural network models. We will use a validation set of 15% of the data for the neural network models.

In [4]:
splitter = RandomProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=None)
splitter_nn = RandomProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=0.15)

The ```data``` attribute of the splitter views the dataset.

In [None]:
splitter.data.head()

After initializing the splitter, run ```splitter.split_data()``` to split the data. For ```RandomProteinSplitter```, the ```split_data()``` method takes the following parameters:

- ```test_size```: the proportion of the dataset to include in the test set

In [None]:
splitter.split_data(test_size=0.15)
splitter_nn.split_data(test_size=0.15)

And that's it! The dataset has now been split into training, validation, and test sets and can be fed into the ```Predictors``` module to train and deploy models. If you check the ```splits``` attribute of the splitter, you will see that the dataset has been split into training, validation, and test sets in the form of a dictionary.

In [None]:
splitter.splits.keys()

In [None]:
print(splitter.splits['X_train'][:3])
print(splitter.splits['y_train'][:3])

# Featurizers

Featurizers are used to featurize the sequences. To learn more about the different featurizers, check out the ```featurizers.ipynb``` notebook.

Each featurizer has the following parameters:

- ```protein```: the name of the protein for caching

- ```use_cache```: whether to cache the features for later use (default: ```False```)

- ```flatten_features```: whether to flatten the feature vectors (default: ```False```)


In [9]:
featurizer = OneHotFeaturizer(protein=protein_name, use_cache=True, flatten_features=False)

# Predictors

Predictors are used to train and deploy models to perform property prediction. To learn more about the different predictors, check out the documentation.

Each predictor has the following parameters:

- ```splitter```: the splitter to use

- ```featurizer```: the featurizer to use

- ```use_cache```: whether to cache the model for later use (default: ```False```)

- ```show_plots```: whether to show matplotlib plots (default: ```True```)

## Training non-neural network models

There a several models available in the ```predictors``` module. To learn more about the different models, check out the documentation.

- ```RidgeRegressor```: a ridge regression model

- ```RandomForestRegressor```: a random forest regression model

- ```GPLinearRegressor```: a gaussian process linear regression model

- ```GPQuadRegressor```: a gaussian process quadratic regression model

- ```GPRBFRegressor```: a gaussian process radial basis function regression model

In [None]:
predictor = RidgeRegressor(splitter, featurizer, use_cache=True, show_plots=True)

After initializing the predictor, run ```predictor.run_model()``` to train and deploy the model. This command returns a dictionary of performance statistics as well as a plot of the model's performance on the test set.

In [None]:
stats = predictor.run_model()

In [None]:
pd.DataFrame([stats]).transpose().rename(columns={0: 'Value'})

That's it! The model has now been trained and deployed and can be used to predict the property of new sequences.

## Training neural network models

Training neural network models is similar to training machine learning models. The only major difference is that neural network models require a ```config``` dictionary to specify the network architecture.

In the multievolve package, there are two simple neural network models available: ```Fcn``` and ```Cnn```. ```Fcn``` is a fully connected neural network and ```Cnn``` is a convolutional neural network.

First, we will train a fully connected neural network.

### Fully connected neural network

The config dictionary for the fully connected neural network has the following parameters:

- ```layer_size```: the number of neurons in the hidden layers

- ```num_layers```: the number of hidden layers

- ```learning_rate```: the learning rate for the optimizer

- ```batch_size```: the batch size for training

- ```optimizer```: the optimizer to use (default: ```adam```)

- ```epochs```: the number of epochs to train for

In [None]:
# config
config = {
          'layer_size': 100,
          'num_layers' : 2,
          'learning_rate': 0.001,
          'batch_size': 32,
          'optimizer': 'adam',
          'epochs': 300
}

fcn_model = Fcn(splitter_nn, featurizer, config=config, use_cache=True, show_plots=True)
stats = fcn_model.run_model()
pd.DataFrame([stats]).transpose().rename(columns={0: 'Value'})

### Convolutional neural network

The convolutional neural network is a 2D convolutional neural network that scans across the featurized protein sequences with dimensions of ```(sequence_length, feature_length)``` with filter size ```(kernel_size, feature_length)```. When using the ```Cnn``` class, make sure to set ```flatten_features=False``` in the ```Featurizer``` class.

The config dictionary for the convolutional neural network has the following parameters:

- ```layersize_filtersize```: the number of hidden layers and the number of filters separated by a dash (```-```)

- ```kernel_size```: the kernel size for the convolutional layer

- ```learning_rate```: the learning rate for the optimizer

- ```batch_size```: the batch size for training

- ```optimizer```: the optimizer to use (default: ```adam```)

- ```epochs```: the number of epochs to train for

In [None]:
config = {
          'layersize_filtersize': "1-12",
          'kernel_size' : 17,
          'learning_rate':0.001,
          'batch_size': 32,
          'optimizer': 'adam',
          'epochs': 10
}

cnn_model = Cnn(splitter_nn, featurizer, config=config, use_cache=True)
stats = cnn_model.run_model()
pd.DataFrame([stats]).transpose().rename(columns={0: 'Value'})

# Proposers

Proposers are used to propose new sequences given a list of trained models. To learn more about the different proposers, check out the documentation. Generally, we used the ```CombinatorialProposer``` to propose new sequences.

Each proposer has the following parameters:

- ```start_seq```: the starting sequence to mutate, generally the wildtype sequence

- ```models```: the list of trained models

- ```trust_radius```: the maximum number of mutations allowed in the proposed variant

- ```num_seeds```: the maximum number of sequences to propose for evaluation, -1 means tests all possible variants

- ```mutation_pool```: the list of allowed mutations for generating the proposed variants

In [14]:
wt_seq = 'MGKSYPTVSADYQDAVEKAKKKLRGFIAEKRCAPLMLRLAFHSAGTFDKGTKTGGPFGTIKHPAELAHSANNGLDIAVRLLEPLKAEFPILSYADFYQLAGVVAVEVTGGPKVPFHPGREDKPEPPPEGRLPDATKGSDHLRDVFGKAMGLTDQDIVALSGGHTIGAAHKERSGFEGPWTSNPLIFDNSYFTELLSGEKEGLLQLPSDKALLSDPVFRPLVDKYAADEDAFFADYAEAHQKLSELGFADA'
mutations = ['T192V', 'T192K', 'A167R', 'N72A', 'D222E', 'A148Q', 'D229A', 'S138A', 'K61R', 'S196A', 'I185V', 'L84V', 'E87Q', 'G50R', 'L80M']

proposer = CombinatorialProposer(
    start_seq=wt_seq,
    models=[fcn_model], 
    trust_radius=10, 
    num_seeds=-1, 
    # num_seeds=20, 
    mutation_pool=mutations)

After initializing the proposer, run ```proposer.propose()``` to propose new sequences. This command returns a dataframe of proposed sequences and their evaluations.

In [None]:
proposal_results = proposer.propose()
proposal_results.head()

After proposing new sequences, run ```proposer.evaluate_proposals()``` to evaluate the proposed sequences. This command returns a dataframe of proposed sequences and their evaluations.

In [None]:
proposer.evaluate_proposals()

The ```proposals``` dataframe now contains the proposed sequences and their evaluations.

In [None]:
proposer.proposals.head()

The results can now be saved to a CSV file using ```proposer.save_proposals()```. Results will be saved in the following folder: ```destination_folder/proposers/results/```

In [19]:
proposer.save_proposals(f'{experiment_name}_proposals') 

Another alternative proposer is the ```DeepMutationalScanningProposer```, which generates every possible single amino acid substitution and predicts the property of each proposed sequence.

In [None]:
dms_proposer = DeepMutationalScanningProposer(
    start_seq=wt_seq, 
    models=[predictor]
    )
dms_proposer.propose()
dms_proposer.evaluate_proposals()
dms_proposer.save_proposals(f'{experiment_name}_dms_proposals') 

In [None]:
dms_proposer.proposals.head()

# (Aside) Cache save locations

A new directory named ```proteins``` will be created. Under the ```protein_name```, there will be cache folders for splitters, featurizers, predictors, and proposers. The cache folder organization will look like this:

```
example_protein/         
├── example_dataset.csv  
├── feature_cache/       
│   └── onehot/          
├── model_cache/         
│   └── example_dataset/ 
│       ├── objects/     
│       └── results/     
└── proposers/      
    └── results/         
├── split_cache/         
│   └── example_dataset/ 
```

The ```feature_cache``` folder contains the featurized sequences separated based on featurizer type.

The ```model_cache``` folder contains the predictor objects separated by dataset. The ```objects``` folder contains the saved models and the ```results``` folder contain results generating when comparing multiple models (seen later in the ```Part_2_comparing_multiple_models.ipynb``` notebook).

The ```proposers``` folder contains the results of the evaluated proposed sequences.

The ```split_cache``` folder contains the splitter objects separated by dataset. 

If you check the ```file_attrs``` attribute of the splitter or predictor, you will see the cache save locations of the objects.

In [None]:
splitter.file_attrs

In [None]:
predictor.file_attrs

# Summary

Overall, we have seen how to use the ```splitters```, ```featurizers```, ```predictors```, and ```proposers``` modules to train and deploy models to perform property prediction in a few lines of code. We separated the code into these four modules to be able to compare different methods of data splits, featurizations, and models. The full example of a code block training a simple ridge regression model and proposing new sequences is shown below. By changing a single line of code, you can test a different data split method, featurization, or model, allowing for easy comparison of different methods.

For streamlined comparison of multiple methods of data splits, featurizations, and models, head over to ```Part_2_comparing_multiple_models.ipynb```.

Again, to learn more about the different modules, check out the ```splitters.ipynb``` and ```featurizers.ipynb``` notebooks.

## Full Example

In [None]:
# Define variables
experiment_name = "example_experiment"
protein_name = "example_protein"
wt_file = "../../data/example_protein/apex.fasta"
training_dataset_fname = '../../data/example_protein/example_dataset.csv'

# Initialize splitter
splitter = RandomProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=None)
splitter.split_data(test_size=0.15)

# Initialize featurizer
featurizer = OneHotFeaturizer(protein=protein_name, use_cache=True, flatten_features=False)

# Initialize predictor
predictor = RidgeRegressor(splitter, featurizer, use_cache=True)
stats = predictor.run_model()

# Initialize proposer
wt_seq = 'MGKSYPTVSADYQDAVEKAKKKLRGFIAEKRCAPLMLRLAFHSAGTFDKGTKTGGPFGTIKHPAELAHSANNGLDIAVRLLEPLKAEFPILSYADFYQLAGVVAVEVTGGPKVPFHPGREDKPEPPPEGRLPDATKGSDHLRDVFGKAMGLTDQDIVALSGGHTIGAAHKERSGFEGPWTSNPLIFDNSYFTELLSGEKEGLLQLPSDKALLSDPVFRPLVDKYAADEDAFFADYAEAHQKLSELGFADA'
mutations = ['T192V', 'T192K', 'A167R', 'N72A', 'D222E', 'A148Q', 'D229A', 'S138A', 'K61R', 'S196A', 'I185V', 'L84V', 'E87Q', 'G50R', 'L80M']
proposer = CombinatorialProposer(start_seq=wt_seq, models=[predictor], trust_radius=10, num_seeds=20, mutation_pool=mutations)
proposal_results = proposer.propose()
proposer.evaluate_proposals()
proposer.save_proposals(f'{experiment_name}_proposals') 