# Boltz on Vertex AI 

---

This guide walks you through **deploying Botlz‑2** to Google Cloud’s Vertex AI. We’ll cover:

1. **Quick introduction to Botlz‑2** and its scientific background  
2. **Local setup** using Docker  
3. **Pushing the Docker image** to Google’s Artifact Registry  
4. **Spinning up and using** the container locally  
5. **Submitting a Vertex AI job** for scalable cloud execution

Check out the [Boltz-2 notebook](./boltz2.ipynb) for interactive examples.

## Introduction to Botlz-2's Protein-Folding Journey and Cloud Compute

---

Predicting 3D protein structures has been a “grand challenge” since Anfinsen’s classic folding experiments in the 1970s [1]. Major landmarks like CASP (1994) [2], Rosetta (2000s) [3,4,5], AlphaFold 2 (2020) [6,7], and RoseTTAFold (2021) [8] have shaped the field. Botlz‑2 joins this lineage, offering faster, high-quality predictions with a lightweight Boltzmann‑inspired architecture.

By deploying Botlz‑2 on Vertex AI:
- **Reproducible local development** via Docker  
- **Scalable training & inference** on managed cloud infrastructure  
- **Smooth integration** into protein-folding pipelines  

Whether you're an academic or industry researcher, this setup makes Botlz‑2 production-ready in the post-AlphaFold era.


## Notebook Roadmap

---

### Sections
- [Building and Pushing Boltz2 Docker Image to GCP Artifact Registry](#pushing-boltz2-image-to-gcp-artifact-regsitry)
- [Building and Running GCP Docker Container](#building-and-running-the-docker-container)
- [Submitting Job to Vertex AI](#submit-custom-job-to-vertex-ai)

### Prerequisites

Before you begin, ensure you have the following installed on your local machine:

- Docker: [Install Docker](https://docs.docker.com/get-docker/)
- GCP Credentidals: Select appropriate [autheticaction for your use case](https://cloud.google.com/docs/authentication#auth-decision-tree). We will use a secure approach of [authentication with user credentials](https://cloud.google.com/docs/authentication/set-up-adc-local-dev-environment#local-user-cred). Make sure your user has role `Vertex AI User` assigned to grant access to Vertex AI resource.

Please reference the documentation on [setting up you environment](./local-env.md) for more information and tips.


## Pushing Boltz2 Docker Image to GCP Artifact Registry

---

To build a GCP compatible Boltz2 docker image, follow these steps:

1. **Clone the Repository**: Clone the `MLContainerLab` repository to your local machine.

```bash
git clone https://github.com/gabenavarro/MLContainerLab.git
cd MLContainerLab
```

2. **Build the Docker Image**: Use the provided Dockerfile to build the Docker image.

```bash
docker build -f ./assets/build/Dockerfile.boltz2.cu126cp310 -t boltz2:126-310 .
```

3. **Tag the Docker Image**: Add a GCP Artifact registry tag to the image

```bash
docker tag boltz2:126-310 ${DEFAULT-ARTIFACT-REGISTRY}/${PROJECT-ID}/${REPOSITORY-NAME}/boltz2:126-310
```

Replace `${DEFAULT-ARTIFACT-REGISTRY}` with your default region. For example, if using `us-west1-docker.pkg.dev`:

* `docker tag boltz2:126-310 us-west1-docker.pkg.dev/my-project/my-repository/boltz2:126-310`

4. **Push to Registry**:

```bash
docker push ${DEFAULT-ARTIFACT-REGISTRY}/${PROJECT-ID}/${REPOSITORY-NAME}/boltz2:126-310
```

Make sure you have given access to Docker to push images to artifact regsitry. For more information reference the documentation on [setting up you environment](./local-env.md).


## Building and Running the GCP Docker Container

---

The next image to build will allow you to send jobs to Google Veretx AI. To simplify this for most computational scientist, this will be and interactive detached container that you can connect to using VSCode (or your favorite IDE). Lets start with the steps below:

1. **Clone the Repository**: Clone the `MLContainerLab` repository to your local machine if you havent already.

```bash
git clone https://github.com/gabenavarro/MLContainerLab.git
cd MLContainerLab
```

2. **Build the Docker Image**: Use the provided Dockerfile to build the Docker image.

```bash
docker build -f ./assets/build/Dockerfile.gcpvertexiai.cp312 -t vertexai:312 .
```

3. **Run the Docker Container**: Run the Docker container with the necessary configurations. In the first example, we will run the container locally with GPU support. This is the recommended way to run a container while in development mode. For scaling up, we will use the second example which runs the container in the cloud.

```bash
# Run the container with GPU support
docker run -dt \
   -v "$(pwd):/workspace" \
   --name vertexai \
   vertexai:313
```
> Note: The `-v "$(pwd):/workspace"` option mounts the current directory to `/workspace` in the container, allowing you to access your local files from within the container. The `--env` options set environment variables for GPU visibility and Google Cloud credentials.<br>

4. **Access the Container with IDE**: In this example, we will use Visual Studio Code to access the container. You can use any IDE of your choice.

```bash
# In a scriptable manner
CONTAINER_NAME=vertexai
FOLDER=/workspace
HEX_CONFIG=$(printf {\"containerName\":\"/$CONTAINER_NAME\"} | od -A n -t x1 | tr -d '[\n\t ]')
code --folder-uri "vscode-remote://attached-container+$HEX_CONFIG$FOLDER"
```


[1]: https://aklectures.com/lecture/structure-of-proteins/anfinsens-experiment-of-protein-folding "Anfinsen's Experiment of Protein Folding - AK Lectures"
[2]: https://en.wikipedia.org/wiki/CASP "CASP - Wikipedia"
[3]: https://docs.rosettacommons.org/docs/latest/meta/Rosetta-Timeline "History of Rosetta"
[4]: https://pmc.ncbi.nlm.nih.gov/articles/PMC7603796 "Macromolecular modeling and design in Rosetta: recent methods ..."
[5]: https://en.wikipedia.org/wiki/Rosetta%40home "Rosetta@home"
[6]: https://www.nature.com/articles/s41586-021-03819-2 "Highly accurate protein structure prediction with AlphaFold - Nature"
[7]: https://www.wired.com/story/deepmind-alphafold-protein-diseases "DeepMind wants to use its AI to cure neglected diseases"
[8]: https://www.bakerlab.org/2021/07/15/accurate-protein-structure-prediction-accessible "Accurate protein structure prediction accessible to all - Baker Lab"
[9]: https://www.lemonde.fr/en/science/article/2024/10/09/nobel-prize-for-chemistry-2024-artificial-intelligence-garners-more-recognition_6728828_10.html "Nobel Prize for Chemistry 2024: Artificial intelligence garners more recognition"



## Submitting Job to Vertex AI

Now, lets go through an example of submitting the Docker container as a managed job.


1. **Authentication**

First, start by authenticating  your session using user credentials in bash. If you are using a containerized notebook using vscode, make sure to open the terminal in the container and run the following command there.

```bash
# Authenticate credentials
gcloud auth login
# Setup configuration
# Most likely you will want to use option 1 when prompted
gcloud init 
# Lastly setup ADC authetication
gcloud auth application-default login 
```

2.  **Build Files**

First, initialize function below in order to read in file with molecules that you'd like to bind to an input protein of interest.

In [None]:
# Modify to fit your project
PROJECT = "YOUR-PROJECT-NAME"
REGION = "YOUR-REGION"
ARTIFACT_PROJECT = "ARTIFACT-REGISTRY-PROJECT"
REGISTRY_REGION = "YOUR-ARTIFACT-REGION"
BUCKET = "YOUR-BUCKET-NAME"
MACHINE_TYPE = ""
ACCELERATOR_TYPE = ""
IMAGE = f"{REGISTRY_REGION}/{PROJECT}/${ARTIFACT_PROJECT}/boltz2:126-310"
VALID_AMINO_ACIDS = set("ACDEFGHIKLMNPQRSTVWY")
FLEX_START_MACHINES = []

In [None]:
from typing import List
from networkx import project
from pandas import read_csv
from rdkit import Chem
from datetime import datetime
from google.cloud import aiplatform
from google.cloud.aiplatform_v1.types import custom_job as gca_custom_job_compat
import subprocess
import hashlib
from time import sleep
import os
from google.cloud import storage

def check_gcs_file_exists(bucket_name:str, blob_name:str):
    """Checks if a file exists in a GCS bucket.

    Args:
        bucket_name (str): The name of the GCS bucket.
        file_name (str): The name of the file (object) within the bucket.

    Returns:
        bool: True if the file exists, False otherwise.
    """
    try:
        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(blob_name)
        blob_exists = blob.exists()
    except:
        blob_exists = False
    finally:
        storage_client.close()
    
    return blob_exists


def molecules_for_boltz_from_file(
    filename: str
) -> List[str]:
    '''Molecules
    ---

    Creates an interator to return each molecule from a given input. Can accept
        * `.csv` : files with `smiles` columns. 
        * `.sdf`
    
    ### Arg
        * filename (str): path to file with encoded molecules.
        
    ### Return
        * smiles (list): list of molecules in smiles format, required for Boltz-2.
    '''
    # Make sure its correct file type
    assert filename.endswith(".csv") or filename.endswith(".sdf"), "Not .csv or .sdf file"
    if filename.endswith(".sdf"):
        try:
            smiles = [Chem.MolToSmiles(i) for i in Chem.SDMolSupplier(filename)]
        except:
            ValueError( f"Error opening `.sdf` file: {filename}")
    # If csv, make sure it has `smiles columns`
    if filename.endswith(".csv"):
        df = read_csv(filename)
        df.columns = [i.upper() for i in df.columns]
        assert any([i == "SMILES" for i in df.columns]), "No SMILES column found in csv"
        smiles = df["SMILES"].to_list()
    return smiles


def build_yaml_file(
    molecules: List[str],
    protein: str,
    experiment_name: str | None = None,
    cache: str = "../datasets/botlz2-yaml"
) -> List[str]:
    
    # Simple handling
    assert molecules, "No molecules provided"
    assert protein and protein.isupper() and not (set(protein) - VALID_AMINO_ACIDS), \
    "Protein sequence must not be empty, be all uppercase, and contain only standard amino acid characters."
    experiment_name = experiment_name if experiment_name is not None else datetime.today().strftime("%d-%m-%Y--%H-%M-%S")
    os.makedirs(cache, exist_ok=True)
    os.makedirs(f"{cache}/{experiment_name}", exist_ok=True)

    # Write file
    yaml_jobs = []
    for mol in molecules:
        yaml_str = f"""# This is a comment
    version: 1
    sequences:
        - protein:
            id: A 
            sequence: {protein.upper()}
        - ligand:
            id: B
            smiles: '{mol}'
    properties:
        - affinity:
            binder: B
    """
        hash_name = hashlib.sha256(mol.encode('utf-8')).hexdigest()
        with open(f"{cache}/{experiment_name}/{hash_name}.yaml", "w") as f:
            f.write(yaml_str)
            yaml_jobs.append(f"/gcs/{BUCKET}/bolts2/jobs/{experiment_name}/{hash_name}.yaml")

    # Move them to GCP, will only work if authenticated
    cmd = f"gsutil -m cp -r {cache}/{experiment_name} gs://{BUCKET}/boltz2/jobs"
    subprocess.run(cmd,shell=True)
    return yaml_jobs


def build_protein_protein_yaml_file(
    protein_a: str,
    protein_b: str,
    experiment_name: str | None = None,
    cache: str = "../datasets/botlz2-yaml"
) -> List[str]:
    
    # Simple handling
    assert protein_a and protein_a.isupper() and not (set(protein_a) - VALID_AMINO_ACIDS), \
    "Protein A sequence must not be empty, be all uppercase, and contain only standard amino acid characters."
    assert protein_b and protein_b.isupper() and not (set(protein_b) - VALID_AMINO_ACIDS), \
    "Protein A sequence must not be empty, be all uppercase, and contain only standard amino acid characters."
    experiment_name = experiment_name if experiment_name is not None else datetime.today().strftime("%d-%m-%Y--%H-%M-%S")
    os.makedirs(cache, exist_ok=True)
    os.makedirs(f"{cache}/{experiment_name}", exist_ok=True)

    # Write file
    yaml_jobs = []

    yaml_str = f"""# This is a comment
version: 1
sequences:
    - protein:
        id: A 
        sequence: {protein_a.upper()}
    - protein:
        id: B
        sequence: {protein_b.upper()}
"""
    hash_name = hashlib.sha256(str(protein_a.upper()+protein_b.upper()).encode('utf-8')).hexdigest()
    with open(f"{cache}/{experiment_name}/{hash_name}.yaml", "w") as f:
        f.write(yaml_str)
        yaml_jobs.append(f"/gcs/{BUCKET}/boltz2/jobs/{experiment_name}/{hash_name}.yaml")

    # Move them to GCP, will only work if authenticated
    cmd = f"gsutil -m cp -r {cache}/{experiment_name} gs://{BUCKET}/boltz2/jobs"
    print(cmd)
    subprocess.run(cmd,shell=True)
    return yaml_jobs


def submit_vertex_custom_job(yaml_jobs: List[str]):

    aiplatform.init(
        project=PROJECT,
        location=REGION
    )

    for job in yaml_jobs:
        hash_name = job.split("/")[-1].replace(".yaml","").split("_")[-1]
        experiment_name = job.split('/')[-2]

        if not check_gcs_file_exists(
            BUCKET,
            f"boltz2/results/{experiment_name}/boltz_results_{hash_name}/predictions/{hash_name}/{hash_name}_model_0.cif"
        ):
            CMD = [
                "boltz",
                "predict", job,
                "--recycling_steps", "10",
                "--diffusion_samples", "25",
                "--accelerator", "gpu",
                "--out_dir", f"/gcs/{BUCKET}/boltz2/results/{experiment_name}",
                "--cache", f"/gcs/{BUCKET}/boltz2/cache",
                "--use_msa_server"
            ]

            WORKER_POOL_SPECS = [
                {
                    "replica_count": 1,
                    "machine_spec": {
                        "machine_type": MACHINE_TYPE,
                        "accelerator_count": 1,
                        "accelerator_type": ACCELERATOR_TYPE,
                    },
                    "container_spec": {
                        "image_uri": IMAGE, 
                        "command": CMD, 
                    },
                }
            ]

            job = aiplatform.CustomJob(
                display_name=f"Boltz-2 {experiment_name}", 
                worker_pool_specs=WORKER_POOL_SPECS,
                staging_bucket=f"gs://{BUCKET}/vertex_staging",
            )

            if MACHINE_TYPE in FLEX_START_MACHINES:
                job.submit(
                    max_wait_duration=7200,
                    scheduling_strategy=gca_custom_job_compat.Scheduling.Strategy.FLEX_START
                )
            else:
                job.submit()

            sleep(5)
    return



In [None]:
molecule_path = "my-mol-path"
molecules = molecules_for_boltz_from_file(molecule_path)
protein = "YOUR-AA-SEQUENCE"

vertex_jobs = build_yaml_file(
    molecules=molecules,
    protein=protein
)

submit_vertex_custom_job(vertex_jobs)