# Introduction

In the previous notebook, we saw how to use the ```splitters```, ```featurizers```, ```predictors```, and ```proposers``` modules to train and deploy models to perform property prediction. In this notebook, we will see how to test multiple methods of data splits, featurizations, and models at once. We will initialize several different ```splitters```, ```featurizers```, and ```predictors``` and test them all at once.

In [1]:
from multievolve.splitters import *
from multievolve.featurizers import *
from multievolve.predictors import *
from multievolve.proposers import *

## Setting up

First, define the following variables as before:
- ```experiment_name```: the name of the experiment

- ```protein_name```: the name of the protein

- ```wt_file```: the path to the wildtype sequence

- ```training_dataset_fname```: the path to the training dataset

In [2]:
experiment_name = "example_experiment"
protein_name = "example_protein"
wt_file = "../../data/example_protein/apex.fasta"
training_dataset_fname = '../../data/example_protein/example_dataset.csv'

## Non-neural network models

We will show an example for non-neural network models first.

First, we initialize our desired splitters and save them in a list.

In [None]:
random_splitter = RandomProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=None)
round_splitter = RoundProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=None)

random_splitter.split_data(test_size=0.2)
round_splitter.split_data(min_test_round=1, max_train_round=0)

splitters = [random_splitter, round_splitter]

Next, we initialize our desired featurizers.

In [4]:
onehot = OneHotFeaturizer(protein=protein_name, use_cache=True)
georgiev = GeorgievFeaturizer(protein=protein_name, use_cache=True)

featurizers = [onehot, georgiev]

Next, we initialize our desired predictors in the form of a list.

In [5]:
predictors = [RidgeRegressor, RandomForestRegressor]

Finally, we run the following function ```run_model_experiments``` to train and test all the models. The function will return a pandas dataframe with the results, which are also saved in the directory where the training dataset is located in the folder ```model_cache/dataset_name/results```

```run_model_experiments``` takes the following arguments:

- ```splitters```: A list of splitters to use for the experiment.

- ```featurizers```: A list of featurizers to use for the experiment.

- ```predictors```: A list of predictors to use for the experiment.

- ```experiment_name```: The name of the experiment.

- ```use_cache```: Whether to use the cache for the splitters, featurizers, and predictors.


In [None]:
results = run_model_experiments(splitters, featurizers, predictors, experiment_name,use_cache=False)

In [None]:
results

## Neural network models

Now, we will show an example for neural network models. You will need to have a wandb account to run this example.

For neural network models, we used wandb to perform hyperparameter sweeps, which include comparing different methods of splitting and featurizing.

We will initialize the same splitters and featurizers as before, and train various fully connected neural networks with different architectures and hyperparameters.

In [None]:
random_splitter = RandomProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=0.15)
round_splitter = RoundProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=0.15)

random_splitter.split_data(test_size=0.2)
round_splitter.split_data(min_test_round=1, max_train_round=0)

splitters = [random_splitter, round_splitter]

In [10]:
onehot = OneHotFeaturizer(protein=protein_name, use_cache=True) # just as an example, we will just use onehot featurizer

featurizers = [onehot]

In [11]:
models = [Fcn]

For neural network models, we use the function ```run_nn_model_experiments``` to train and test all the models. Unlike the previous function, the results of the model training are saved on the wandb server.

```run_nn_model_experiments``` takes the following arguments:

- ```splitters```: A list of splitters to use for the experiment.

- ```featurizers```: A list of featurizers to use for the experiment.

- ```models```: A list of models to use for the experiment.

- ```experiment_name```: The name of the experiment.

- ```use_cache```: Whether to use the cache for the splitters, featurizers, and predictors.

- ```sweep_depth```: The depth of the hyperparameter sweep. Options are 'test', 'standard', and 'custom'. 'test' is to test the training on a single hyperparameter configuration. 'standard' will test a standard set of hyperparameter configurations. 'custom' is a modified version of 'standard' that can be modified to test a subset of hyperparameter configurations.


- ```search_method```: The method of hyperparameter search. Options are 'test', 'grid', and 'bayes'. 'test' is to test the training on a single hyperparameter configuration. 'grid' is to test the training on all possible hyperparameter configurations. 'bayes' is to test the training on a bayesian search of hyperparameter configurations.

- ```count```: The number of different hyperparameter configurations to test. This is only used if ```search_method``` is 'bayes'.

If the user wants to modify the hyperparameter sweeps to test other hyperparameters, then the configuration files for the sweep configs can be found in ```multievolve/predictors/sweep_configs```.

In [None]:
run_nn_model_experiments(splitters, 
                         featurizers, 
                         models, 
                         experiment_name=experiment_name,
                         use_cache=False,
                         sweep_depth='test', 
                         search_method='test',
                         count=1
                         )