# Module 2: Run Protein Structure Design and Protein Structure Prediction


NOTE: The authors recommend running this notebook in Amazon SageMaker Studio with the following environment settings:  
* **PyTorch 1.13 Python 3.9 GPU-optimized** image  
* **Python 3** kernel  
* **ml.g4dn.xlarge** instance type  

---

Performing analysis of large macromolecules is an essential step for therapeutic design. Over the past few years, a number of deep-learning based approaches have added computational complexity to this essential task. There are a number of types of these structure-based workloads; with a few common ones shown in the figure below. In this second module, we will now show how you can leverage both protein structure prediction and protein structure design.

![A diagram of common protein structure design steps](img/01.png)


In this notebook, we will design **new** variants of the heavy chain for the structure of Herceptin (Trastuzumab). The steps for the pipeline are as follows:

* [RFDiffusion](https://github.com/RosettaCommons/RFdiffusion) is used to generate a small number of variant designs. We will only attempt to redesign parts of the `variable region`.

* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN) is then used to discover novel sequences that are expected to fold to the novel structure.

* ESMFold is then used to score each of the candidate proteins. ESMFold returns the average predicted local distance difference test (pLDDT) score; which represents the confidence (averaged over all residues) in the predicted structure. This will be used to assess whether the predicted structure is likely to be correct.

For running ESMFold, we will use the ESMFold endpoint deployed in Module 1, so please ensure that you have run that module **before** running this one.

This notebook shows you how to run protein structure design (using the RFDiffusion tool) then inverse protein folding (ProteinMPNN) along with protein structure prediction (ESMFold).  

Here is the structure of the heavy chain, which we are going to redesign. The residues we are going to redisgn are highlighted in green.  
![image of heavy chain](img/04.png)

## 1. Setup and installation

Install RFDiffusion and it's dependencies

In [None]:
%pip install -U -q -r protein-design-requirements.txt --disable-pip-version-check

Download and extract the RFDiffusion model weights (This will take several minutes)

In [None]:
%%bash
mkdir "data/weights"
aws s3 cp --no-sign-request "s3://aws-batch-architecture-for-alphafold-public-artifacts/compressed/rfdiffusion_parameters_220407.tar.gz" "weights.tar.gz"
tar --extract -z --file="weights.tar.gz" --directory="data/weights" --no-same-owner
rm "weights.tar.gz"

In [None]:
# %%bash
# #perform the actual installation
# pip -q install --upgrade pip
# cd RFdiffusion/env/SE3Transformer
# pip -q install --no-cache-dir -r requirements.txt
# python setup.py install
# cd ../.. # change into the root directory of the repository
# pip install -e . # install the rfdiffusion module from the root of the repository

# pip -q install jedi omegaconf hydra-core icecream
# pip -q install dgl==1.0.2+cu116 -f https://data.dgl.ai/wheels/cu116/repo.html
# pip -q install pyrsistent

## 2. Design new structures with RFdiffusion
First we will run RFDiffusion to design novel protein structures. We will pass to it the protein structures and which parts to design. Specifically, we want it to redesign **only** the residues 98-109 on the B chain, while keeping the rest of the structure the same.

Let's take a look at the regions of interest. In the following display, the B chain is blue while the target is in green.

In [12]:
import py3Dmol

view = py3Dmol.view(width=600, height=400)
with open("data/1n8z.pdb") as ifile:
    experimental_structure = "".join([x for x in ifile])
view.addModel(experimental_structure)
view.setStyle({"chain": "A"}, {"opacity": 0})
view.setStyle({"chain": "B"}, {"cartoon": {"color": "blue", 'opacity': 0.4}})
view.addStyle({"chain": "B", "resi":'98-109'},{'cartoon':{'color':'#57C4F8', 'opacity': 1.0}})
view.setStyle({"chain": "C"}, {"cartoon": {"color": "green", "opacity": 0.4}})
view.setStyle({"chain": "C", "resi":'540-580'}, {"cartoon": {"color": "#37F20E", "opacity": 1.0}})

view.zoomTo()
view.show()

Next we ask RFdiffusion to create some new structures for us. In this case, we provide some design constraints:

- Look at residues 540-580 of the target molecule (green structure in image above)
- Create a new structure that includes residues 1-97 of the heavy chain (blue above), then generate 10 new residues, then residues 110-120 from the heavy chain.
- Use residues 570 and 573 from the target molecule as "hotspots", meaning we want to make sure that the new structure interacts with these specific amino acids.
- Create 4 designs in total
- Leave the rest of the design parameters as the default values

In [13]:
from prothelpers.rfdiffusion import create_structures

create_structures(
    overrides = [
    'inference.input_pdb=data/1n8z.pdb',
    'inference.output_prefix=data/results/result',
    'inference.model_directory_path=data/weights',
    'contigmap.contigs=[C540-580/0 B1-97/13/B110-120]', 
    'ppi.hotspot_res=[C570,C573]',
    'inference.num_designs=4',
    # 'denoiser.noise_scale_ca=0',
    # 'denoiser.noise_scale_frame=0'
])

Reading models from data/weights
This is inf_conf.ckpt_path
data/weights/Complex_base_ckpt.pt
Assembling -model, -diffuser and -preprocess configs from checkpoint
USING MODEL CONFIG: self._conf[model][n_extra_block] = 4
USING MODEL CONFIG: self._conf[model][n_main_block] = 32
USING MODEL CONFIG: self._conf[model][n_ref_block] = 4
USING MODEL CONFIG: self._conf[model][d_msa] = 256
USING MODEL CONFIG: self._conf[model][d_msa_full] = 64
USING MODEL CONFIG: self._conf[model][d_pair] = 128
USING MODEL CONFIG: self._conf[model][d_templ] = 64
USING MODEL CONFIG: self._conf[model][n_head_msa] = 8
USING MODEL CONFIG: self._conf[model][n_head_pair] = 4
USING MODEL CONFIG: self._conf[model][n_head_templ] = 4
USING MODEL CONFIG: self._conf[model][d_hidden] = 32
USING MODEL CONFIG: self._conf[model][d_hidden_templ] = 32
USING MODEL CONFIG: self._conf[model][p_drop] = 0.15
USING MODEL CONFIG: self._conf[model][SE3_param_full] = {'num_layers': 1, 'num_channels': 32, 'num_degrees': 2, 'n_heads': 4, 'd

Our new designs are in the `data/results` folder. Let's take a look at them.

In [17]:
import os
from math import ceil

rfdiffusion_results_dir = "data/results"
structures = []
for obj in os.listdir(rfdiffusion_results_dir):
    if ".pdb" in obj:
        p = os.path.join(rfdiffusion_results_dir, obj)
        with open(p, "r") as f:
            structures.append(f.read())

total_cols = 2
total_rows = ceil(len(structures) / total_cols)
view = py3Dmol.view(viewergrid=(total_rows, total_cols), width=500, height=500)
view.removeAllModels()
k = 0
for i in range(total_cols):
    for j in range(total_rows):
        if k < len(structures):
            view.addModel(structures[k], "pdb", viewer=(i, j))
            k += 1
        else:
            break
            
view.setStyle({"chain": "A"}, {"cartoon": {"color": "blue", 'opacity': 0.4}})
view.setStyle({"chain": "A", "resi":'98-109'},{'cartoon':{'color':'#57C4F8', 'opacity': 1.0}})
view.setStyle({"chain": "B"}, {"cartoon": {"color": "green"}})

view.zoomTo()
view.show()

We can also view an animation of the diffusion process for one of the designs.

In [None]:
with open(os.path.join(rfdiffusion_results_dir, "traj/result_0_pX0_traj.pdb")) as t:
    trajectory = t.read()
view = py3Dmol.view(width=600, height=600)
view.addModelsAsFrames(trajectory)
view.setStyle({"chain": "A"}, {"cartoon": {"color": "blue", 'opacity': 0.4}})
view.setStyle({"chain": "A", "resi":'98-109'},{'cartoon':{'color':'#57C4F8', 'opacity': 1.0}})
view.setStyle({"chain": "B"}, {"cartoon": {"color": "green"}})
view.animate({"loop": "backward"})
view.zoomTo()
view.show()

## Install and Run ProteinMPNN
ProteinMPNN is a tool for **inverse protein folding**. In inverse protein folding, the input is a protien tertiary structure, while the output is a sequence (or sequences) that are predicted to fold in the specified structure. Here is a schematic for how it works:
<div style="text-align: left;">
    <img src="img/06.png" alt="A diagram of inverse protein folding" width="700" />
</div>
                        
*image credit: https://huggingface.co/spaces/simonduerr/ProteinMPNN.*        

                                 
 
 
ProteinMPNN will then return   new files in [FASTA format](https://software.broadinstitute.org/software/igv/FASTA). Fasta files are the sequences of the protein in the following format.

\>HEADER|CHAIN X|FURTHER DETAILS AND INFORMATION ABOUT THE SEQUENCE \
XXXXXXXXXXXX #The actual Sequence


For instance the Herceptin Fab antibody that we want to redesign. The FASTA sequence is:
>1N8Z_2|Chain B|Herceptin Fab (antibody) - heavy chain|Mus musculus (10090)
EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKN
TAYLQMNSLRAEDTAVYYC**SRWGGDGFYAMDY**WGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVK
DYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEP



In [None]:
!git clone https://github.com/dauparas/ProteinMPNN.git
!cd ProteinMPNN && git checkout be1d37b6699dcd2283ab5b6fc8cc88774e2c80e9 && rm -rf colab_notebooks examples inputs outputs training .git

Next we will run Protein MPNN, providing to it the RFDiffusion output structures.

In [None]:
rfdiffusion_candidates=[]
import os

rf_dir = os.path.expanduser('~/bioit_ai_v2/RFdiffusion/examples/example_outputs') 

for filename in os.listdir(rf_dir):
    pdb_full_path = os.path.abspath(os.path.join(rf_dir, filename))
    if filename.endswith("pdb"):
        rfdiffusion_candidates.append(pdb_full_path)


In [None]:
import os
output_dir=os.path.expanduser("~/bioit_ai_v2/ProteinMPNN/outputs/example_outputs")
for candidate_pdb in rfdiffusion_candidates:
    mpnn_cmd=f'''python ProteinMPNN/protein_mpnn_run.py \
            --pdb_path {candidate_pdb}  \
            --out_folder {output_dir} \
            --num_seq_per_target 3 \ 
            --pdb_path_chains "A" \
            --sampling_temp "0.1" \
            --seed 37 \
            --batch_size 1'''
    os.system(mpnn_cmd)


In [None]:
mpnn_candidates=[]
mpnn_dir=f"{output_dir}/seqs"
for filename in os.listdir(mpnn_dir):
    pdb_full_path = os.path.abspath(os.path.join(mpnn_dir, filename))
    if filename.endswith("fa"):
        mpnn_candidates.append(pdb_full_path)


In [None]:
#now pass the sequences generate from ProteinMPNN to ESMFold
def extract_pmpnn_sequences(fasta_file):
    seqs = []
    counter=1
    with open(fasta_file, 'r') as input_file:
        for line in input_file:
            if line.startswith('>'):
                continue
            else:
                if counter>1:
                    a_seq = line.strip()
                    seqs.append(a_seq)
                counter+=1
    return (seqs)

# Example usage:

mpnn_sequences = [extract_pmpnn_sequences(i) for i in mpnn_candidates]
import itertools
mpnn_sequences=list(itertools.chain.from_iterable(mpnn_sequences))
print(mpnn_sequences)


## Run Inference on ESMFold
Next, we will see if the designed are actually predicted to fold to the structure. ESMFold is then used to score each of the candidate proteins. ESMFold returns the average predicted local distance difference test (pLDDT) score; which represents the confidence (averaged over all residues) in the predicted structure.

In [None]:
endpoint_name="emsfold-v1-rt-endpoint-202305021683040535" #Enter the ESMFold Endpoint you created in Module one


In [None]:
#now pass these to the ESMFold endpoint, and write the results to the disk
from sagemaker.predictor import Predictor
import sagemaker
!mkdir -p esmfold_predictions
import json
predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker.Session(),
    serializer=sagemaker.serializers.CSVSerializer(),  # Set the serializer depending on your input data format
    deserializer=sagemaker.deserializers.StringDeserializer()
)

esmfold_file_list=[]
for i in range(0,len(mpnn_sequences)):
    esm_output = predictor.predict(mpnn_sequences[i])
    esmfold_file_list.append(f"esmfold_predictions/prediction_{i}.pdb")
    with open(f"esmfold_predictions/prediction_{i}.pdb", "w") as f:
        f.write(json.loads(esm_output)[0])

In [None]:
#now for each file, get the average PLDDT score
from statistics import mean
all_plddts=[]
def get_average_plddt(input_file):
    plddts = []
    
    with open(input_file, 'r') as in_file:
        for line in in_file:
            if line.startswith(('ATOM', 'HETATM')):
                plddt = float(line[60:66].rstrip())
                plddts.append(plddt)

                
    return mean(plddts)

for i in range(0,len(esmfold_file_list)):
    average_plddt=get_average_plddt(esmfold_file_list[i])
    all_plddts.append(average_plddt)

In [None]:
import matplotlib.pyplot as plt
plt.plot.scatter(range(0,len(all_plddts)),all_plddts)

You can see from the result above that most of the designed proteins are have a PLDDT of .8 or greater, representing high confidence in most of the structures.

Here is a screenshot of one example of the designed antibody (blue) superimposed on the orignal antibody (green). The orange and red corresponds to the extracellular domain of HER2. Note that the structure of the designed antibody is similair, but not identical to the original.

![Picture of designed protein](img/03.png)
