# **Designing protein scaffolds using generative models - EvoDiff** Please make a copy of this notebook first !!!
---


**Protein generation with evolutionary diffusion: sequence is all you need**
Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Neil Tenenholtz, Robert Strome, Alan M. Moses, Alex X. Lu, Nicolo Fusi, Ava P. Amini, Kevin K. Yang


---
EvoDiff is a method for **sequence** generation, with or without conditional information (a motif, a MSA, etc).
EvoDiff can perform the following tasks:

*   (unconditional) monomer generation
*   motif scaffolding
*   binder design
*   MSA-informed monomer generation

This notebook is based on an a notebook by the EvoDiff authors:
https://github.com/microsoft/evodiff/blob/main/examples/evodiff.ipynb

 Fusion + modification for RosettaCon Copenhagen Workshop: M. Ertelt




This notebook does not include AF2 validation in order to keep it CPU only, but you can use the [AlphaFold3 server](https://golgi.sandbox.google.com/) to check out the predicted fold of your generated sequences.

In [None]:
#@title #Installing Evodiff (this takes a few minutes, ignore warnings)
import sys
# for some reason all functions only work in this install combination and only god knows why
!pip uninstall evodiff -y -q
!{sys.executable} -m pip install evodiff -q
!pip uninstall evodiff -y -q
!pip install git+https://github.com/microsoft/evodiff.git@main -q
!pip install biopython -q
##!{sys.executable} -m pip install --upgrade git+https://github.com/microsoft/evodiff.git@main -q
#!pip install torch_geometric -q
#!pip install biotite -q
#import torch
#!pip install torch-scatter -f https://data.pyg.org/whl/torch-{torch.__version__}.html -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.5/108.5 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.2/73.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.9/33.9 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.4/27.4 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.9/5.9 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Unconditional sequence generation

### Generate a sequence with EvoDiff-Seq-OADM 38M
For demonstration purposes, we show an example using the smaller 38M model here, and generation on a CPU. For production runs you should use the model EvoDiff-Seq-OADM 640M.

In [None]:
#@title Downloading the model
from evodiff.pretrained import OA_DM_38M
#from evodiff.pretrained import OA_DM_640M # this would be the larger model that you could use interchangeably


checkpoint = OA_DM_38M()
model, collater, tokenizer, scheme = checkpoint

Downloading: "https://zenodo.org/record/8045076/files/oaar-38M.tar?download=1" to /root/.cache/torch/hub/checkpoints/oaar-38M.tar
100%|██████████| 434M/434M [00:24<00:00, 18.6MB/s]


To generate one sequence, run:

The only thing you need to define is the desired sequence length via `seq_len` input

In [None]:
from evodiff.generate import generate_oaardm

seq_len = 150
tokenized_sample, generated_sequence = generate_oaardm(model, tokenizer, seq_len, batch_size=1, device='cpu')
print("Generated sequence:", generated_sequence)

100%|██████████| 150/150 [00:52<00:00,  2.87it/s]

Generated sequence: ['MTERTTPRRVSIGQHIRDASEAAISEALRLRRWSGSDALAPAAALPPELSASHLARFAETIDPLWTRKAAYLAKCLLLSADGAPAAQRYIYMQRALARHRAAGELSLTAAGRLAEANDAVVLSNLPSDLSKFCDTVVRARMESPPDLQAC']





Now after you've created a sequence, its time to validate it. This time around we will go for AF3 instead of AF2, which you can [access here](https://golgi.sandbox.google.com/). How does it compare to the unconditional design of RFdiff/AF2 hallucination?

### Generate a sequence with EvoDiff-D3PM-Uniform 38M

Again, we show an example here using the smaller model weights. For D3PM models we need additional inputs for inference, so we download checkpoints with `return_all=True`. If you are using a BLOSUM model, make sure to download the blosum matrix file in `data/` to your local files

In [None]:
#@title Downloading the D3PM model
from evodiff.pretrained import D3PM_UNIFORM_38M

checkpoint = D3PM_UNIFORM_38M(return_all=True)
model, collater, tokenizer, scheme, timestep, Q_bar, Q = checkpoint

sohl-dickstein


Downloading: "https://zenodo.org/record/8045076/files/d3pm-uniform-38M.tar?download=1" to /root/.cache/torch/hub/checkpoints/d3pm-uniform-38M.tar
100%|██████████| 434M/434M [00:26<00:00, 17.0MB/s]


We can then generate 1 sequence via the following, where again only `seq_len` needs to be defined:

In [None]:
from evodiff.generate import generate_d3pm

seq_len = 150

tokenized_sample, generated_sequence = generate_d3pm(model, tokenizer, Q, Q_bar, timestep, seq_len, batch_size=1, device='cpu')

100%|██████████| 499/499 [02:48<00:00,  2.97it/s]

final seq ['MVASRMILMWVGLIVFGAILRSISSNDTVTGLWLAAFSALSSISFVNQFRQLGGSPMGLVMGSCWYQVFAASRNREYPLQLLRWTSLLFLSVFFLFNYLARALPWVWLPQEDMLAARWMCALLAIEVALMVVVGIGERVLELERGLDHRF']





Now after you've created a sequence, its time to validate it. This time around we will go for AF3 instead of AF2, which you can [access here](https://golgi.sandbox.google.com/). How does it compare to the unconditional design of RFdiff/AF2 hallucination?

## Conditional generation
*Note: All conditional generation uses OADM models*


### Inpainting intrinsically disordered regions (IDRs) with EvoDiff-Seq



Using an exemplary input `sequence`, we will show you how to inpaint a new region (from `start_idx` to `end_idx`) of that sequence using EvoDiff-Seq

In [None]:
from evodiff.conditional_generation import inpaint_simple
from evodiff.pretrained import OA_DM_38M
checkpoint = OA_DM_38M()
model, collater, tokenizer, scheme = checkpoint

sequence = 'DQTERTVRSFEGRRTAPYLDSRNVLTIGYGHLLNRPGANKSWEGRLTSALPREFKQRLTELAASQLHETDVRLATARAQALYGSGAYFESVPVSLNDLWFDSVFNLGERKLLNWSGLRTKLESRDWGAAAKDLGRHTFGREPVSRRMAESMRMRRGIDLNHYNI'
start_idx = 20
end_idx = 50


sample, entire_sequence, generated_idr = inpaint_simple(model, sequence, start_idx, end_idx, tokenizer=tokenizer, device='cpu')

print("original sequence:", sequence)
print("generated sequence", entire_sequence)


print("\noriginal region:", sequence[start_idx:end_idx])
print("generated region:", generated_idr)

Now it's time to apply EvoDiff inpainting to our Myoglobin, like we've  used RFdiff and AF2, but this time in *sequence space*.

As a reminder the [PDB ID is 3RGK](https://www.rcsb.org/structure/3RGK) and you want to sample the following regions:
*   Residues 1-9
*   Residues 47-59
*   Residues 73-88
*   Residues 141-149

In [None]:
# Fill in your solution here



Now after you've created some sequences, its time to validate them. This time around we will go for AF3 instead of AF2, which you can [access here](https://golgi.sandbox.google.com/). Make sure to add your generated protein sequence, as well as a "Heme" ligand to check whether AF3 picks up on our binding site!

### Scaffolding functional motifs with EvoDiff-Seq

Below we provide an example of scaffolding the PDB-ID: 1PRW.

EvoDiff-Seq will search for a PDB file, given the 3 letter PDB code and attempt to download it if it cannot find it in the local directory. Raw PDB files can be missing residues, or have extra residues so we want to express extreme caution when using this code, to ensure that you are scaffolding the correct indices.  

First, provide the start (`start_idx`) and ending (`end_idx`) indices for the motif you are interested in scaffolding (the end index is includsive). If there are multiple domains, make sure these are indexed in numerical order. In the example below; `start_idx=[51,15]`, `end_idx=[70,34]` is not acceptable. For multiple domains, we retain the original spacing between motifs, and extract the entire domain from `start_idx[0]` to `end_idx[-1]` and fill the non-motif regions with a `MASK` token.

Next, we can specify what `scaffold_length` we want to generate. The code will randomly sample the location of the motif within the specified scaffold length.

In [None]:
from evodiff.conditional_generation import generate_scaffold
import os
os.mkdir('./scaffolding-pdbs')
data_top_dir = './' # Change this filepath to represent where scaffolding-msas and scaffolding-pdbs exists, this should be in the same folder as this notebook


In [None]:
pdb_code = '1prw'

start_idx = [15, 51]
end_idx = [34, 70]

num_seqs = 1


scaffold_length = 75

generated_sequence, new_start_idx, new_end_idx = generate_scaffold(model, pdb_code, start_idx, end_idx, scaffold_length, data_top_dir, tokenizer, device='cpu')

print("motif start indices", new_start_idx)
print("motif end indices", new_end_idx)

Now lets go back to our heme and try to scaffold just the binding site. How well does scaffolding a minimal motif work in sequence space? Again use AlphaFold3 and include a Heme ligand to check the diffused sequence.

In [None]:
# Start your solution here