#Enzyme tutorial 5: De novo PLAase design

This Colab is inspired by code from Samee Ullah and Indrek Kalvet. This code directly uses a LigandMPNN fork by Samee Ullah.

Provided functionality is tailored for demonstration, the code is not suited for production runs and diverse systems (multiple ligands or multiple chains). At the same time we are using almost the same code you'd use in production, so skills in writing LigandMPNN and RFdiffusion all-atom inputs from this tutorial are directly transferable to real life.

In [None]:
#@markdown ### Download RFdiffusion All-atom

!git clone https://github.com/baker-laboratory/rf_diffusion_all_atom.git
%cd rf_diffusion_all_atom
!wget http://files.ipd.uw.edu/pub/RF-All-Atom/weights/RFDiffusionAA_paper_weights.pt
!git submodule init
!git submodule update
%cd ../

Cloning into 'rf_diffusion_all_atom'...
remote: Enumerating objects: 49, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 49 (delta 5), reused 3 (delta 3), pack-reused 26 (from 1)[K
Receiving objects: 100% (49/49), 3.63 MiB | 27.73 MiB/s, done.
Resolving deltas: 100% (6/6), done.
/content/rf_diffusion_all_atom
--2025-10-16 12:13:42--  http://files.ipd.uw.edu/pub/RF-All-Atom/weights/RFDiffusionAA_paper_weights.pt
Resolving files.ipd.uw.edu (files.ipd.uw.edu)... 128.95.160.135, 128.95.160.134, 2607:4000:406::160:135, ...
Connecting to files.ipd.uw.edu (files.ipd.uw.edu)|128.95.160.135|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1337064321 (1.2G) [application/octet-stream]
Saving to: ‘RFDiffusionAA_paper_weights.pt’


2025-10-16 12:13:52 (127 MB/s) - ‘RFDiffusionAA_paper_weights.pt’ saved [1337064321/1337064321]

Submodule 'lib/rf2aa' (https://github.com/baker-laboratory/RoseTTAFold-Al

In [None]:
#@markdown ### Install micromamba

%%bash
ARCH=$(uname -m)
if [ "$ARCH" = "x86_64" ]; then
  URL="https://micro.mamba.pm/api/micromamba/linux-64/latest"
elif [ "$ARCH" = "aarch64" ]; then
  URL="https://micro.mamba.pm/api/micromamba/linux-aarch64/latest"
else
  echo "Unsupported arch: $ARCH" >&2
  exit 1
fi

# Stream-download and extract only bin/micromamba
wget -qO- "$URL" \
  | tar -xj --strip-components=1 bin/micromamba

chmod +x micromamba

In [None]:
#@markdown ### Specify python environment for LigandMPNN and RFdiffusionAA

ikalvet_env = '''name: diffusion_allatom
channels:
  - pytorch
  - dglteam/label/cu118
  - nvidia
  - conda-forge

dependencies:
  - assertpy=1.1
  - python=3.9.18
  - pytorch=2.2.1
  - pytorch-cuda=11.8
  - prody=2.4.1
  - dgl=2.1.0.cu118
  - deepdiff=6.7.1
  - e3nn=0.5.1
  - icecream=2.1.3
  - fire=0.5.0
  - hydra-core=1.3.2
  - openbabel=3.1.1
  - pandas=2.2.1
  - pydantic=2.6.3
  - numpy=1.26.4
  - scipy=1.12.0
  - torchdata=0.7.1
  - torchtriton=2.2.0
  - tqdm=4.66.2'''

with open('environment.yml', 'w') as fh:
  fh.write(ikalvet_env)

In [None]:
#@markdown ### Create environment

!./micromamba create -y -n env -f environment.yml -q --log-level error

In [None]:
#@markdown ### Download LigandMPNN

!git clone https://github.com/ullahsamee/LigandMPNN.git
%cd LigandMPNN
!bash get_model_params.sh "./model_params"
%cd /content

Cloning into 'LigandMPNN'...
remote: Enumerating objects: 201, done.[K
remote: Counting objects: 100% (81/81), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 201 (delta 66), reused 56 (delta 56), pack-reused 120 (from 1)[K
Receiving objects: 100% (201/201), 998.90 KiB | 12.18 MiB/s, done.
Resolving deltas: 100% (77/77), done.
/content/LigandMPNN
/content


In [None]:
#@markdown ### Install helper packages

%pip install -q MDanalysis

import MDAnalysis as mda
print('Ignore the WARNING statement, everything is fine')

from google.colab import files

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.7/108.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m124.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m101.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h





## Initially run everything up to this cell and return back to the tutorial GoogleDoc

When you are back, hopefully everything is installed.

## You are back? Great, let's design!

Have your PDB file at the ready. Actually, de novo design is much simpler than our previous tasks in a technical sense -- since we have nothing at the start, we don't need to bother about what to fix and what to design. Design everything!

## RFDiffusionAA

### Contig definition

The main option in RFdiffusionAA is a `contig`. Same as in RFdiffusion. This is a recipy for the design -- which original parts to keep, how much new parts and of which length to add, and what is the order in which these parts appear in the sequence.

Imagine we have a starting file with the motif. It's residues are 20,60,100, and the chain is always A.

Consider this line: `10-30,A20-20,30-70,A60-60,30-70,A100-100,10-30`

It tells RFdiffusionAA to, naturally, take the motif residues from the input PDB, and hallucinate linking sequences between them, to end up with a -- hopefully -- well-folded globule.

Just to reiterate -- the only difference between "take" and "create" is a **chain letter** in front of the contig section. So be extremely accurate!

*It also means that your input PDB files must always have explicitly specified chains.*

---

So, we need to write a contig. For our task it might look like this:
`40-60,A81-82,50-70,A144-145,50-70,A203-203,50-70,A257-257,20-40`

Because the motif in the PDB file consists of protein residues 81, 82, 144, 145, 203, 257.

In [None]:
#@markdown ### Input Options

%cd /content/

task_name='PLAase_de_novo' #@param {type:"string"}
upload_dict = files.upload()
pdb_string = upload_dict[list(upload_dict.keys())[0]]
ligand_name = "56S" #@param {type:"string"}
input_pdb = f"{task_name}.pdb"
with open(input_pdb,"wb") as out: out.write(pdb_string)

contig = "40-60,A81-82,50-70,A144-145,50-70,A203-203,50-70,A257-257,20-40" #@param {type:"string"}

num_designs = 5 #@param {type:"number"}

#@markdown RFdiffusionAA is quite demanding. We will decrease the number of diffusion steps (`diffuser.T=20`), but it will still require minutes per design on GPU. The first time you run a design, it will take longer due to preparation and caching of important files. So don't put a large number here.

#@markdown Look below, there should be now a button to upload your input PDB file. Click on it!

/content


Saving pla_start_motif.pdb to pla_start_motif.pdb


The command below will run RFdiffusion

In [None]:
%cd /content/rf_diffusion_all_atom/
!../micromamba run -n env python -u run_inference.py diffuser.T=20 inference.output_prefix=output/{task_name} inference.input_pdb=../{task_name}.pdb contigmap.contigs=[\'{contig}\'] inference.num_designs={num_designs} inference.design_startnum=1 inference.ligand={ligand_name}

/content/rf_diffusion_all_atom
Please either pass the dim explicitly or simply use torch.linalg.cross.
The default value of dim will change to agree with that of linalg.cross in a future release. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025831482/work/aten/src/ATen/native/Cross.cpp:63.)
  Z = torch.cross(Xn,Yn)
DGL backend not selected or invalid.  Assuming PyTorch for now.
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)
[2025-10-16 12:23:58,262][inference.model_runners][INFO] - Reading checkpoint from RFDiffusionAA_paper_weights.pt
loading RFDiffusionAA_paper_weights.pt
loaded RFDiffusionAA_paper_weights.pt
OVERRIDING: You are changing diffuser.T from the value this model was trained with.
[2025-10-16 12:24:13,185][inference.model_runners][INFO] - Loading checkpoint.
[2025-10-16 12:24:13,453][diffusion][INFO] - No IGS

Use the menu to the left to download your results. They are located in `rf_diffusion_all_atom/output/<task_name>_<number>.pdb`. Download them and open in PyMol. Analyze the new backbones, and pick one of these backbones to now run sequence design with LigandMPNN.

- For example, if I generated 5 designs with `task_name: Alex`, I will download them, look at them and decide, that I will -- for now -- proceed with `/content/rf_diffusion_all_atom/output/Alex_2.pdb`.
- In a real life scenario, we would take more than one structure to the next step.

It is also fun to have a look at diffusion trajectories. Locate in the folder `rf_diffusion_all_atom/output` folder `traj`. You will see there files that are named like this:

`sample_0_Xt-1_traj.pdb` - This is the trajectory of inputs to denoiser at each step of diffusion process

`sample_0_X0-1_traj.pdb` - This is the trajectory of denoiser outputs at each step. This is how the network tries to reconstruct the final solution, given a particular input.

At each diffusion step, we generate the X0 prediction, and then use it to drive the diffusion to the next step. So, we don't immediately accept this prediction, but we make a move based on what it gave us. It continues iteratively, hopefuly refining the quality of this X0 prediction.

You can check, does it actually stabilize at one solution, and how different the actual diffusion process looks like from these "final" predictions.

## Interface to write calls to LigandMPNN

After you choose a diffusion result you are happy with, you need to spend time in PyMol to write down numbers of residues. Variable length of linkers mean that all initial numbers are gone, and we don't know automatically, what is now a residue number of the catalytic Ser, for example. We need it to write a set of immutable residues.

Important: from the initial six residues, not all are equally immutable. Three -- Ser,His,Asp -- are definitely immutable. For other three you can decide for yourself.

You will also notice that we don't do any distance-based selection of mutable residues anymore. This is because we need to design the whole protein, we don't have any reasonable starting point that is better left unperturbed.

In [None]:
#@markdown ### Input Options

%cd /content/

task_name='all_atom_design' #@param {type:"string"}
upload_dict = files.upload()
pdb_string = upload_dict[list(upload_dict.keys())[0]]
input_pdb = f"{task_name}.pdb"
with open(input_pdb,"wb") as out: out.write(pdb_string)

enzyme_chain = "A" #@param {type:"string"}
ligand_chain = "B" #@param {type:"string"}

#@markdown ---

#@markdown Numbers of positions within protein chain that will not be redesigned -- put catalytic residues here (leaving just `1,10,100` is a bad idea):
immutable_positions = "42,110,224,167" #@param {type: "string"}

#@markdown ---

#@markdown ### Design Options
number_of_batches = 10 #@param {type:"number"}
batch_size = 3 #@param {type:"number"}
#@markdown Number of designs = number_of_batches * batch_size

#@markdown If you are running on GPU, keep batch_size at 3 or even increase it

#@markdown On a CPU, put 1

#@markdown On a GPU runtime, it takes ~6 seconds per design, so you might want to crank up these numbers

#@markdown ---
#@markdown Look below, there should be now a button to upload your input PDB file. Click on it!

/content


Saving PLAase_de_novo_1.pdb to PLAase_de_novo_1.pdb


Because in this task we have way more redesigned residues than immutable (fixed), it makes more sense to structure our call based on that. We will now use `--fix_residues` flag, and not `--redesigned_residues`

In [None]:
#@markdown ### Run parameters for LigandMPNN

%cd /content/

site = sorted(list(map(int, immutable_positions.split(','))))
fixed_string = ' '.join([f"{enzyme_chain}{x}" for x in site])

runstring = f'''python run.py \
--pdb_path "/content/{input_pdb}" \
--out_folder "/content/LigandMPNN/outputs/{task_name}" \
--number_of_batches {number_of_batches} \
--batch_size {batch_size} \
--model_type "ligand_mpnn" \
--fixed_residues "{fixed_string}"'''

print("\nThis is how the instruction for LigandMPNN would look like (just for your information):")
print(f"{runstring}")

/content

This is how the instruction for LigandMPNN would look like (just for your information):
python run.py --pdb_path "/content/PLAase_design.pdb" --out_folder "/content/LigandMPNN/outputs/PLAase_design" --number_of_batches 10 --batch_size 3 --model_type "ligand_mpnn" --fixed_residues "A42 A110 A167 A224"


### Run LigandMPNN -- simply run the cell below

In [None]:
%cd /content/LigandMPNN
!../micromamba run -n env {runstring}

/content/LigandMPNN
  import pkg_resources
@> ProDy is configured: verbosity='none'
Designing protein from this path: /content/PLAase_design.pdb
These residues will be redesigned:  ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'A16', 'A17', 'A18', 'A19', 'A20', 'A21', 'A22', 'A23', 'A24', 'A25', 'A26', 'A27', 'A28', 'A29', 'A30', 'A31', 'A32', 'A33', 'A34', 'A35', 'A36', 'A37', 'A38', 'A39', 'A40', 'A41', 'A43', 'A44', 'A45', 'A46', 'A47', 'A48', 'A49', 'A50', 'A51', 'A52', 'A53', 'A54', 'A55', 'A56', 'A57', 'A58', 'A59', 'A60', 'A61', 'A62', 'A63', 'A64', 'A65', 'A66', 'A67', 'A68', 'A69', 'A70', 'A71', 'A72', 'A73', 'A74', 'A75', 'A76', 'A77', 'A78', 'A79', 'A80', 'A81', 'A82', 'A83', 'A84', 'A85', 'A86', 'A87', 'A88', 'A89', 'A90', 'A91', 'A92', 'A93', 'A94', 'A95', 'A96', 'A97', 'A98', 'A99', 'A100', 'A101', 'A102', 'A103', 'A104', 'A105', 'A106', 'A107', 'A108', 'A109', 'A111', 'A112', 'A113', 'A114', 'A115', 'A116', 'A117', 'A118

In [None]:
#@markdown ### Output up to top-10 designs based on ligand confidence score

def top_10_fa(file_path):
  results = []
  seq_num = 0
  with open(file_path, 'r') as fh:
    for line in fh:
      if line.startswith('>'):
        seq_num += 1
        if seq_num == 2:
          title = line.strip()
          tk = title.split(', ')
          lc = tk[5].strip().split('=')[1]
          lc = float(lc)
        elif seq_num > 2:
          results.append((lc,title,seq))
          title = line.strip()
          tk = title.split(', ')
          lc = tk[5].strip().split('=')[1]
          lc = float(lc)
        else:
          continue
      else:
        if seq_num >= 2:
          seq = line.strip()
    results.append((lc,title,seq))
  results = sorted(results, key=lambda x: x[0], reverse=True)
  results = [x[1:] for x in results]
  return(results[:10])

top_designs = top_10_fa(f"/content/LigandMPNN/outputs/{task_name}/seqs/{task_name}.fa")
for x in top_designs:
  print(x[0])
  print(x[1])
  print('\n')

>PLAase_design, id=22, T=0.1, seed=68747, overall_confidence=0.3924, ligand_confidence=0.4622, seq_rec=0.2070
MKKPIFIGVKVTPEDPTAGVLAATEALKKIPFKLKRVYLCGSVNKEQAEAIAANLQAAGVQFDAIIYFDFDPAAFAKFTPEQLAALEAAARVLIATLLELVEGGIISGCSMSARMLIAGLRDDSEGVTFMTPDPAYAAGLRKLAAEAGSKMEVVAARPTPPEPVAIDAVATGTAAKTPAYLNTDDSGCTVAKGRSPSASGALCNIADQLIKEDPSLDKDGYVDHALSPAQFKELLARAEEAGAHAVACVGRPGSDRHPTA


>PLAase_design, id=12, T=0.1, seed=68747, overall_confidence=0.3971, ligand_confidence=0.4430, seq_rec=0.1719
MKKPIIIGVKITPDDPTAGFLAATEALKKVPEKAKRVYLCGSVNKEQAEAIAENLQKVGVKFDVIIYIEFDPAAFAKFTPEQLAALYAAARELIATLRELVKGGIISGCSLSARLLIAGLRDDDEGVTIMAPDPAYKAGLEKWCAEEGSRMEVVAANDTPLEPRAVDAVATGTAAGLPGYLNTTDDKCKLLKGISPSCSGPLCNIGDGLKQADPRLDKKGYDDHAYSPRQWKALLDQAEKAGAHAVACCGRPGSDKHPTA


>PLAase_design, id=17, T=0.1, seed=68747, overall_confidence=0.3839, ligand_confidence=0.4416, seq_rec=0.2148
MEPPIFISVVVTPDDPTAGINAATEALKKVPKQAKRIYLAGSVTAEQARAIAAQLAAAGVQFDIIINFSYDPAALPGFTPEQLAALEAATRKLIAELLKLNKGGIISGCSTSAAALIAGLKDDSDGITAMAVDPAAAAGLRALA

Here are your designs! Time to pick your favourite and predict it's structure with third-party tools like [Chai](https://lab.chaidiscovery.com/dashboard) or [Boltz](https://build.nvidia.com/mit/boltz2)!

## Just to remind you that you have more control

As before, you can play around yourself, by considering various input flags [here](https://github.com/dauparas/LigandMPNN).

 You can code everything it yourself, including file uploads, or just use the panel to the left to upload the file to whatever location you want. Just remember that when writing calls with absolute paths, the root is `"/content/..."`

Now, populate the script call yourself:

In [None]:
runstring = f'''python run.py \
  --pdb_path <FILL_ME> \
  --out_folder <FILL_ME> \
  --number_of_batches <FILL_ME> \
  --batch_size <FILL_ME> \
  --model_type "ligand_mpnn" \
  --fixed_residues <FILL_ME>'''

And then call it:

In [None]:
%cd /content/LigandMPNN
!../micromamba run -n env {runstring}

The output is located in out_folder/pdb_name/seqs/pdb_name.fa

It is a fasta file. You might know your favourite ways to work with them. If not, I have a silly script for you to quickly output top-10 designs based on the ligand_confidence. Feel free to rewrite it.

In [None]:
def top_10_fa(file_path):
  results = []
  seq_num = 0
  with open(file_path, 'r') as fh:
    for line in fh:
      if line.startswith('>'):
        seq_num += 1
        if seq_num == 2:
          title = line.strip()
          tk = title.split(', ')
          lc = tk[5].strip().split('=')[1]
          lc = float(lc)
        elif seq_num > 2:
          results.append((lc,title,seq))
          title = line.strip()
          tk = title.split(', ')
          lc = tk[5].strip().split('=')[1]
          lc = float(lc)
        else:
          continue
      else:
        if seq_num >= 2:
          seq = line.strip()
    results.append((lc,title,seq))
  results = sorted(results, key=lambda x: x[0], reverse=True)
  results = [x[1:] for x in results]
  return(results[:10])