# Introduction

This notebook will discuss how to use the various featurizers in the ```multievolve``` package.

In [4]:
from multievolve.splitters import *
from multievolve.featurizers import *

## Setting up

First, define the following variables:

- ```protein_name```: the name of the protein

- ```wt_file```: the path to the wildtype sequence

- ```training_dataset_fname```: the path to the training dataset

In [5]:
protein_name = "example_protein"
wt_file = "../../data/example_protein/apex.fasta"
training_dataset_fname = '../../data/example_protein/example_dataset.csv'

Define a splitter object – we will just use this to load the dataset and pull sequences from to featurizer later.

In [6]:
splitter = RandomProteinSplitter(protein_name, training_dataset_fname, wt_file, csv_has_header=True, use_cache=True, y_scaling=False, val_split=None)

## Featurizers

There are many featurizers available in the ```multievolve``` package. We discuss a few of the most common ones below. 

- ```OneHotFeaturizer```: one-hot encoding of the protein sequence

- ```GeorgievFeaturizer```: Georgiev et al. (2022) featurizer

- ```AAIdxFeaturizer```: amino acid index featurizer    

- ```ESMLogitsFeaturizer```: ESM-2 logits featurizer

- ```ESM2EmbedFeaturizer```: ESM-2 embedding featurizer

There are also combinatorial featurizers that combine multiple featurizers.

- ```ESMAugmentedFeaturizer```: One hot encoding augmented likelihood scores from the ESM-1/ESM-2 models

- ```OnehotAndGeorgievFeaturizer```: One hot encoding combined with Georgiev et al. (2022) featurizer, wherein the encodings are stacked along the last axis (i.e. by position)

- ```OnehotAndAAIdxFeaturizer```: One hot encoding augmented with amino acid index featurizer, wherein the encodings are stacked along the last axis (i.e. by position)

- ```OnehotAndESMLogitsFeaturizer```: One hot encoding augmented with ESM-2 logits featurizer, wherein the encodings are stacked along the last axis (i.e. by position)

In [7]:
# Base Featurizers
onehot = OneHotFeaturizer(protein=protein_name, use_cache=True)
georgiev = GeorgievFeaturizer(protein=protein_name, use_cache=True)
aa_idx = AAIdxFeaturizer(protein=protein_name, use_cache=True)
esm_logits = ESMLogitsFeaturizer(protein=protein_name, use_cache=True)
esm_embed = ESM2EmbedFeaturizer(protein=protein_name, use_cache=True)

# Combinatorial Featurizers
esm_augmented = ESMAugmentedFeaturizer(protein=protein_name, use_cache=True, wt_file=wt_file)
onehotgeorgiev = OnehotAndGeorgievFeaturizer(protein=protein_name, use_cache=True)
onehotaaidx = OnehotAndAAIdxFeaturizer(protein=protein_name, use_cache=True)
onehotesmlogits = OnehotAndESMLogitsFeaturizer(protein=protein_name, use_cache=True)
onehotesmmsalogits = OnehotAndESMMSALogitsFeaturizer(protein=protein_name, use_cache=True)

Featurizers have the function ```featurize```, which takes in a list of sequences and returns the featurized sequences.

In [None]:
example_sequences = splitter.data[0][:5].tolist()

onehot.featurize(example_sequences)