# Step 1: Predict Target Protein Structure with `MSA-Search` and `OpenFold2`

### RORc Nuclear Receptor Antagonist Case-Study
Rene, O.; *et al*. "[Minor Structural Change to Tertiary Sulfonamide RORc Ligands Led to Opposite Mechanisms of Action.](https://pubs.acs.org/doi/10.1021/ml500420y)" *ACS Med. Chem. Lett.* **2015**, *6*, 276-281.

**Abstract:** A minor structural change to tertiary sulfonamide RORc ligands led to distinct mechanisms of action. Co-crystal structures of two compounds revealed mechanistically consistent protein conformational changes. Optimized phenylsulfonamides were identified as RORc agonists while benzylsulfonamides exhibited potent inverse agonist activity. Compounds behaving as agonists in our biochemical assay also gave rise to an increased production of IL-17 in human PBMCs whereas inverse agonists led to significant suppression of IL-17 under the same assay conditions. The most potent inverse agonist compound showed >180-fold selectivity over the ROR isoforms as well as all other nuclear receptors that were profiled.

https://pubs.acs.org/cms/10.1021/amclct.2015.6.issue-3/asset/amclct.2015.6.issue-3.xlargecover.jpg

[MSA-Search](https://docs.nvidia.com/nim/bionemo/msa-search/latest/overview.html) Multiple Sequence Alignment (MSA) compares a query amino acid sequence to protein databases, aligning similar sequences to identify conserved regions despite differences in length or motifs. The resulting alignments enhance structural prediction models like AlphaFold2 and OpenFold by leveraging the structural similarity of homologous sequences.

[OpenFold2](https://docs.nvidia.com/nim/bionemo/openfold2/latest/overview.html) is a protein structure prediction model from the [OpenFold Consortium](https://openfold.io/) and the [Alquraishi Laboratory](https://www.aqlab.io/). The model is a PyTorch re-implementation of Google Deepmind’s [AlphaFold2](https://github.com/google-deepmind/alphafold), with support for both training and inference. OpenFold2 demonstrates parity accuracy with AlphaFold2, and improved speed, see the project home for more detail [github.com/aqlaboratory/openfold](https://github.com/aqlaboratory/openfold).

**Note:** Although we already have the crystal structure of the protein-ligand complex [PDB: 4WQP](https://www.rcsb.org/structure/4WQP), we will still use [MSA-Search](https://docs.nvidia.com/nim/bionemo/msa-search/latest/overview.html) and [OpenFold2](https://docs.nvidia.com/nim/bionemo/openfold2/latest/overview.html) to predict the protein structure for demonstration purposes.

## Protein Sequence for RORc

[UNIPROT ID P51449, RORG_HUMAN](https://www.uniprot.org/uniprotkb/P51449/entry):

```
>4WQP_1, Nuclear receptor ROR-gamma, Homo sapiens
MHHHHHHGENLYFQGSAPYASLTEIEHLVQSVCKSYRETCQLRLEDLLRQRSNIFSREEVTGYQRKSMWEMWERCAHHLTEAIQYVVEFAKRLSGFMELCQN
DQIVLLKAGAMEVVLVRMCRAYNADNRTVFFEGKYGGMELFRALGCSELISSIFDFSHSLSALHFSEDEIALYTALVLINAHRPGLQEKRKVEQLQYNLELA
FHHHLCKTHRQSILAKLPPKGKLRSLCSQHVERLQIFQHLHPIVVQAAFPPLYKELFSGNS
```


## 1.1 Set Up the Environment

In [1]:
!pip install pandas numpy matplotlib loguru py3dmol httpx "fastapi[standard]"

Collecting loguru
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting py3dmol
  Downloading py3Dmol-2.4.2-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting fastapi[standard]
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting starlette<0.47.0,>=0.40.0 (from fastapi[standard])
  Downloading starlette-0.46.2-py3-none-any.whl.metadata (6.2 kB)
Collecting fastapi-cli>=0.0.5 (from fastapi-cli[standard]>=0.0.5; extra == "standard"->fastapi[standard])
  Downloading fastapi_cli-0.0.7-py3-none-any.whl.metadata (6.2 kB)
Collecting python-multipart>=0.0.18 (from fastapi[standard])
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting email-validator>=2.0.0 (from fastapi[standard])
  Downloading email_validator-2.2.0-py3-none-any.whl.metadata (25 kB)
Collecting uvicorn>=0.12.0 (from uvicorn[standard]>=0.12.0; extra == "standard"->fastapi[standard])
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting dn

In [2]:
import os, shutil, re, json, logging
from google.colab import userdata
import requests
import py3Dmol
from pathlib import Path

import asyncio
from typing import Any, Dict, Optional
from fastapi import HTTPException
import httpx
from pathlib import Path
from enum import StrEnum

In [5]:
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

STATUS_URL = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/{task_id}"

PUBLIC_URL = "https://health.api.nvidia.com/v1/biology/colabfold/msa-search/predict"

## 1.2 Set Up `output` Directory and `API_KEY`

NOTE: Be sure to follow the steps in the [README](https://github.com/bf-nv/bionemo_tutorials/blob/main/README.md) to embed your NVIDIA `API_KEY` into your Google Colab environment.

In [6]:
def prepare_output_directory(output):
    """
    Prepare the output directory
    output: str, the output directory
    return: None
    """
    # overwrite the output directory
    if os.path.exists(output):
        shutil.rmtree(output)
    os.makedirs(output)

In [9]:
API_KEY = userdata.get('API_KEY')

output_dir = "/content/output/openfold_result"
prepare_output_directory(output_dir)

## 1.3 Set Up and Run `MSA-Search`

In [31]:
async def make_nvcf_call(function_url: str,
                         api_key: str,
                         data: Dict[str, Any],
                         additional_headers: Optional[Dict[str, Any]] = None,
                         NVCF_POLL_SECONDS: int = 10,
                         MANUAL_TIMEOUT_SECONDS: int = 20) -> Dict:
    """
    Make a call to NVIDIA Cloud Functions using long-polling,
    which allows the request to patiently wait if there are many requests in the queue.
    """
    async with httpx.AsyncClient() as client:
        headers = {
            "Authorization": f"Bearer {api_key}",
            "NVCF-POLL-SECONDS": f"{NVCF_POLL_SECONDS}",
            "Content-Type": "application/json"
            }
        if additional_headers is not None:
            headers.update(additional_headers)
        logger.debug(f"Headers: {dict(**{h: v for h, v  in headers.items() if 'Authorization' not in h})}")
        # TIMEOUT must be greater than NVCF-POLL-SECONDS
        logger.debug(f"Making NVCF call to {function_url}")
        logger.debug(f"Data: {data}")
        response = await client.post(function_url,
                                     json=data,
                                     headers=headers,
                                     timeout=MANUAL_TIMEOUT_SECONDS)
        logger.debug(f"NVCF response: {response.status_code, response.headers}")

        if response.status_code == 202:
            # Handle 202 Accepted response
            task_id = response.headers.get("nvcf-reqid")
            while True:
                ## Should return in 5 seconds, but we set a manual timeout in 10 just in case
                status_response = await client.get(STATUS_URL.format(task_id=task_id),
                                                   headers=headers,
                                                   timeout=MANUAL_TIMEOUT_SECONDS)
                if status_response.status_code == 200:
                    return status_response.status_code, status_response
                elif status_response.status_code in [400, 401, 404, 422, 500]:
                    raise HTTPException(status_response.status_code,
                                        "Error while waiting for function:\n",
                                        response.text)
        elif response.status_code == 200:
            return response.status_code, response
        else:
            raise HTTPException(status_code=response.status_code, detail=response.text)

### Query amino acid sequence

In [32]:
# sequence of the protein
sequence = "MHHHHHHGENLYFQGSAPYASLTEIEHLVQSVCKSYRETCQLRLEDLLRQRSNIFSREEVTGYQRKSMWEMWERCAHHLTEAIQYVVEFAKRLSGFMELCQNDQIVLLKAGAMEVVLVRMCRAYNADNRTVFFEGKYGGMELFRALGCSELISSIFDFSHSLSALHFSEDEIALYTALVLINAHRPGLQEKRKVEQLQYNLELAFHHHLCKTHRQSILAKLPPKGKLRSLCSQHVERLQIFQHLHPIVVQAAFPPLYKELFSGNS"

### Run `MSA-Search`

In [33]:
# MSA-Search output file
output_file = Path("msa_output.json")

msa_sequence = (sequence)

# Initial request
## Note: headers are set in make_nvcf_call function
data = {
    "sequence": msa_sequence,
    "e_value": 0.0001,
    "iterations": 1,
    "databases": ["Uniref30_2302"],
    "output_alignment_formats" : ["a3m", "fasta"]
}

print("Making request...")
code, response = await make_nvcf_call(function_url=PUBLIC_URL,
                                      api_key=API_KEY,
                                      data=data)

if code == 200:
    print(f"Request succeeded, returned {code}")
    response_dict = response.json()
    output_file.write_text(json.dumps(response_dict, indent=4))
    ## print the dictionaries in the alignments portion of the response:
    print(f"The returned databases were: {list(response_dict['alignments'].keys())} .")
    ## print the file formats returned:
    print(f"The returned formats were: {list(response_dict['alignments']['Uniref30_2302'].keys())} .")
    ## print the length of the FASTA-formatted alignment:
    print(f"The returned FASTA contained {len(response_dict['alignments']['Uniref30_2302']['fasta']['alignment'])} characters.")

Making request...
Request succeeded, returned 200
The returned databases were: ['Uniref30_2302'] .
The returned formats were: ['a3m', 'fasta'] .
The returned FASTA contained 33708 characters.


## 1.4 Process `MSA-Search` Result

`a3m` file format is required by `OpenFold2`

https://docs.nvidia.com/nim/bionemo/openfold2/latest/example-requests.html#

In [66]:
import re

import re

def extract_uniref_sequences(s, n, query_seq):
    """
    Extracts up to the first n lines starting from the first '>UniRef' occurrence,
    removes everything after the first tab on each line,
    and returns the cleaned lines as a single string, with the first row as '>query_seq\n{sequence}'.
    'sequence' is the string before the first '>UniRef', with newlines removed.
    """
    n = n * 2
    match = re.search(r'>UniRef', s)
    if match:
        # Extract the sequence before the first UniRef
        sequence = s[:match.start()].replace('\n', '')
        # Get everything from the first UniRef onward
        sub = s[match.start():]
        lines = sub.split('\n')
        cleaned_lines = [line.split('\t')[0] for line in lines if line.strip() != '']
        # Add the query_seq as the first row
        output_lines = [f">query_sequence\n{query_seq}"] + cleaned_lines[:n]
        return '\n'.join(output_lines)
    else:
        return None

In [70]:
a3m_dict_msa_search = response_dict['alignments']['Uniref30_2302']['a3m']['alignment']
a3m_dict_msa_search = extract_uniref_sequences(a3m_dict_msa_search, 4, sequence)
a3m_dict_msa_search

'>query_sequence\nMHHHHHHGENLYFQGSAPYASLTEIEHLVQSVCKSYRETCQLRLEDLLRQRSNIFSREEVTGYQRKSMWEMWERCAHHLTEAIQYVVEFAKRLSGFMELCQNDQIVLLKAGAMEVVLVRMCRAYNADNRTVFFEGKYGGMELFRALGCSELISSIFDFSHSLSALHFSEDEIALYTALVLINAHRPGLQEKRKVEQLQYNLELAFHHHLCKTHRQSILAKLPPKGKLRSLCSQHVERLQIFQHLHPIVVQAAFPPLYKELFSGNS\n>UniRef100_A0A6J0U6P0\n----------------SPNSSLTEIEHLTQNVLKSYRETCQLRLEDLQLLRWETFTKEEVGSYQKKSMEEMWERCACHITEAIQYVVEFAKRMSGFMDLCQNDQIVLLKAGAMEVVLVRMCRAFNSENRTVFFEGKYASPELFKSLGCNELINSIFDFAHSLSSLHFSENEIALFTALVLINSNRPWLQEKSKVERLQNNLELAFKHMLRKNHREGILAKLPPKGKLRNLCYQHMEKLRSFRQMYPIIVHAVFPPLYKELFSSD-\n>UniRef100_UPI002303DBCD\n----------------SPNSSLTEIEHLTQNVLKSYRETCQLRLEDLQLLRWEAFTREEVGSYQKKSIEEMWERCACRVTEAIQYVVEFAKRMGGFMELCQNDQIVLLKAGAMEVVLVRMCRAFNSENRTVFFEGKYAGPELFKSLGCNELINSIFDFAHSLCSLHFSENEIALFTALVLINSNRPWLQERSKVERLQNNLELAFKHMLRKNHREGILAKLPPKGKLRSLCYQHMEKLRSFRQMYPIIVHAVFPPLYKELFSSD-\n>UniRef100_A0A1U7SQV7\n----------------SPHSSVTEIEHLTQNVLKSYRETCQLRLEDLQLLRWETFSREEVSTYQKKSMEEMWERCACRITEAIQYVVEFAKRMGGFMELCQNDQIVLLK

## 1.4 Use `OpenFold2` to Predict the Protein Structure

In [71]:
output_file = Path(f"{output_dir}/prediction_output.json")
selected_models = [1, 2]

print(f"length of protein: {len(sequence)} aa")

length of protein: 265 aa


In [72]:
invoke_url = "https://health.api.nvidia.com/v1/biology/openfold/openfold2/predict-structure-from-msa-and-template"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "content-type": "application/json",
    "NVCF-POLL-SECONDS": "300",
}

payload = {
    "sequence": sequence,
    "selected_models": [1, 2],
    "relax_prediction": False,
    "alignments": {
        "uniref90": {
            "a3m": {
                "alignment": a3m_dict_msa_search,
                "format": "a3m",
            }
        },
    },
}

In [75]:
# initiate a session to re-use connections
session = requests.Session()
print("Making request, this may take several seconds...")
response = session.post(invoke_url, headers=headers, json=payload)

if response.status_code == 200:
    output_file.write_text(response.text)
    print(f"Response output to file: {output_file}")
else:
    print(f"Unexpected HTTP status: {response.status_code}")
    print(f"Response: {response.text}")

result = response.json()
pred_pdb_file = result["structures_in_ranked_order"][0]["structure"]

Making request, this may take several seconds...
Response output to file: /content/output/openfold_result/prediction_output.json


## 1.5 Save the `PDB` Result

In [None]:
# Write PDB file
fp = os.path.join(output_dir, f"predicted_protein.pdb")
with open(fp, "w") as f:
    f.write(pred_pdb_file) # the str representation of the PDB file

In [None]:
# Download the predicted protein PDB file, which needs to be uploaded in Step 3 DiffDock inference
from google.colab import files
pred_pdb_file = '/content/output/openfold_result/predicted_protein.pdb'
files.download(pred_pdb_file)

## 1.6 Visualize the Result

We can examine the predicted protein structure along with its pLDDT confidence scores. These scores are stored in the B-factor field of the returned structure data.

To maintain consistency with the [EBI/AlphaFold Database](https://alphafold.ebi.ac.uk/), we’ll use pLDDT scores to indicate confidence in the predicted protein structure. These scores will be visualized using `py3Dmol`, with color coding to highlight regions of varying reliability.

<DIV style="width: 600px;">
<IMG src="https://res.cloudinary.com/dpfqlyh21/image/upload/v1705026011/obsidian/izrfmiepbzpnzm2aoqwh.png">
</DIV>




In [None]:
def load_protein(pdb_file_path, width=800, height=600):

    """
    Load a protein structure from a PDB file and display it using py3Dmol.
    pdb_file_path: str, path to the PDB file
    width: int, width of the viewer in pixels
    height: int, height of the viewer in pixels
    return: py3Dmol.view object
    """

    with open(pdb_file_path) as ifile:
        pdb_data = "".join([x for x in ifile])

    view = py3Dmol.view(width=width, height=height)
    view.addModelsAsFrames(pdb_data)

    for line in pdb_data.split("\n"):
        split = line.split()
        if len(split) == 0 or split[0] != "ATOM":
            continue
        # Assuming the B-factor is at position 10 (you may need to adjust this based on your PDB format)
        b_factor = float(split[10])
        if b_factor > 90:
            color = "blue"
        elif 70 <= b_factor <= 90:
            color = "cyan"
        elif 50 <= b_factor < 70:
            color = "yellow"
        else:
            color = "orange"

        # Atom serial numbers typically start from 1, hence idx should be used directly
        idx = int(split[1])

        # Style should be set per atom id
        view.setStyle({'model': -1, 'serial': idx}, {"cartoon": {'color': color}})
    view.zoomTo()
    return view

In [None]:
view = load_protein(
    pdb_file_path = '/content/output/openfold_result/predicted_protein.pdb',
    width=800,
    height=500
)
view.show()

## 1.7 Visualize the Superimposed Structure

To demonstrate how the `OpenFold2` predicted structure aligns with the experimentally determined one, we’ll use a local PDB file since [PDB: 4WQP](https://www.rcsb.org/structure/4WQP) is a dimer in RCSB, but we need the monomer.

Monomer of 4WQP: [4wqp_monomer.pdb](4wqp_monomer.pdb)

### Establish the `PyMOL` Environment for Structure Alignment

`PyMOL` will be used to align the `OpenFold2` predicted protein structure with the ground truth PDB file.

1) Install `PyMOL` in your Google Colab environment.

2) Verify the installation by following the tutorial provided here: https://github.com/MooersLab/colabOpenSourcePyMOLpySnips?tab=readme-ov-file.



In [None]:
from IPython.utils import io
import tqdm.notebook
import os
"""The PyMOL installation is done inside two nested context managers. This approach
was inspired by Dr. Christopher Schlick's (of the Phenix group at
Lawrence Berkeley National Laboratory) method for installing cctbx
in a Colab Notebook. He presented his work on September 1, 2021 at the IUCr
Crystallographic Computing School. Blaine Mooers adapted Chris's approach here. This revised
approach was presentated at the SciPy2021 conference in July 2021 and published in the
[proceedings](http://conference.scipy.org/proceedings/scipy2021/blaine_mooers.html).
The new approach is easier for beginners to use. The old approach is easier to debug
and could be used as a back-up approach.

Thank you to Professor David Oppenheimer of the University of Florida for suggesting the use mamba and of Open Source PyMOL.
"""
total = 100
with tqdm.notebook.tqdm(total=total) as pbar:
    with io.capture_output() as captured:

        !pip install -q condacolab
        import condacolab
        condacolab.install()
        pbar.update(10)

        import sys
        sys.path.append('/usr/local/lib/python3.10/site-packages/')
        pbar.update(20)

        # Install PyMOL
        %shell mamba install pymol-open-source --yes

        pbar.update(100)

### Code below tests if `PyMol` installation was successful

If the test failed, try to run the installation cell above again

In [None]:
from pymol import cmd
from IPython.display import Image

# download the ground truth pdb "4wqp_monomer.pdb"
!wget https://raw.githubusercontent.com/bf-nv/bionemo_tutorials/refs/heads/main/4wqp_monomer.pdb

true_pdb_file = '/content/4wqp_monomer.pdb' # true crystal structure

def truncate_string(input_string):
    # Find the last occurrence of '/'
    last_slash_index = input_string.rfind('/')
    return input_string[last_slash_index + 1:]

# Using local PDB file since 4WQP is a dimer in RCSB, but we need the monomer
# Load the local PDB file
cmd.load(true_pdb_file)
obj_name = truncate_string(true_pdb_file).split(".")[0]
cmd.orient(obj_name)
cmd.set_view((-0.292365283,    0.898809373,    0.326595277,
               0.506281018,   -0.144253701,    0.850216091,
               0.811294436,    0.413922817,   -0.412875712,
               0.000079364,   -0.000037231, -176.867355347,
              19.699285507,  -33.096054077,    0.763063431,
             139.444869995,  214.293136597,  -20.000000000 ))
cmd.png("test.png")
Image(filename="test.png", unconfined=True)

### Function to align ground-truth PDB with the `OpenFold2` predicted structure

In [None]:
import py3Dmol
from pymol import cmd
import os

def align_protein(pred_pdb_file, true_pdb_file, output_dir, pred_color = 'cyan', true_color = 'green', width=800, height=600):
    """
    Align two protein structures and display them using py3Dmol.
    pred_pdb_file: str, path to the PDB file of the predicted protein
    true_pdb_file: str, path to the PDB file of the true protein
    output_dir: str, path to the output directory to save the aligned PDB files
    pred_color: str, color of the predicted protein in the viewer
    true_color: str, color of the true protein in the viewer
    width: int, width of the viewer in pixels
    height: int, height of the viewer in pixels
    return: py3Dmol.view object
    """

    # always reinitialize!
    cmd.reinitialize()

    cmd.load(pred_pdb_file, 'predicted_protein')
    cmd.load(true_pdb_file, 'true_protein')

    # Align the structures on the backbone atoms
    cmd.align('predicted_protein & backbone', 'true_protein & backbone', quiet=0)

    # Save the aligned structures
    pred_aligned_fp = os.path.join(output_dir, 'aligned_predicted_protein.pdb')
    true_aligned_fp = os.path.join(output_dir, 'aligned_true_protein.pdb')
    cmd.save(pred_aligned_fp, 'predicted_protein')
    cmd.save(true_aligned_fp, 'true_protein')

    # Read PDB files into strings (assuming they are in the current directory)
    with open(pred_aligned_fp, 'r') as file:
        pred_pdb_data = file.read()

    with open(true_aligned_fp, 'r') as file:
        true_pdb_data = file.read()

    # Create a py3Dmol view
    view = py3Dmol.view(width=width, height=height)

    # Add the predicted protein model
    view.addModel(pred_pdb_data, 'pdb')
    # Set a specific color for the predicted protein, e.g., green
    view.setStyle({'model': 0}, {'cartoon': {'color': pred_color}})

    # Add the true protein model
    view.addModel(true_pdb_data, 'pdb')
    # Set a different specific color for the true protein, e.g., blue
    view.setStyle({'model': 1}, {'cartoon': {'color': true_color}})

    view.zoomTo()
    return view

### Visualize the Results

In [None]:
view = align_protein(
    pred_pdb_file = '/content/output/openfold_result/predicted_protein.pdb', # prediction
    true_pdb_file = '/content/4wqp_monomer.pdb', # true crystal structure
    output_dir = '/content/output/openfold_result', # output directory to store the aligned poses so that py3dmol can load back
    pred_color = 'cyan', # color for the predicted structure
    true_color = 'green', # color for the true structure
    width=800,
    height=500
)
view.show()