# Protein Preparation

Unlike traditional physics-based methods, DiffDock does not take a receptor as input. It just takes the protein and ligand coordinates, and can find the binding site quite effectively. The only (optional) 'preparation' step is the pre-generation of ESM2 embeddings of the protein, which helps with faster runtime. However, in our case, we have to cleanup the PDB a bit.

DiffDock expects the residues to be numbered sequentially in the PDB. However, the PDB from the database uses chymotrypsinogen numbering, and so the residues are renumbered.

In [1]:
from Bio.PDB import PDBParser, PDBIO, Model, Chain, Residue

parser = PDBParser()
structure = parser.get_structure("prt", "../vina/1ppb_cleaned.pdb")

model = next(structure.get_models())  # usually one model
io = PDBIO()

# Create a new model to store renumbered residues
new_model = Model.Model(0)

for chain in model:
    new_chain = Chain.Chain(chain.id)
    new_res_id = 1
    for residue in chain:
        hetflag, resseq, icode = residue.id
        if hetflag == " ":  # skip heteroatoms like water/ligands if desired
            new_residue = Residue.Residue(
                (" ", new_res_id, " "), residue.resname, residue.segid
            )
            # copy atoms to the new residue
            for atom in residue:
                new_residue.add(atom.copy())
            new_chain.add(new_residue)
            new_res_id += 1
        else:
            # optional: keep hetero residues (ligands, water) unchanged
            new_chain.add(residue.copy())
    new_model.add(new_chain)

# Save new structure
io.set_structure(new_model)
io.save("1ppb.pdb")

