# DrugFlow on Vertex AI 🚀

---

This guide walks you through **deploying DrugFlow** on Google Cloud’s Vertex AI. We’ll cover:

1. **Quick introduction to DrugFlow** and its scientific background  
2. **Local setup** using Docker  
3. **Pushing the Docker image** to Google Artifact Registry  
4. **Spinning up and using** the container locally  
5. **Submitting a Vertex AI job** for scalable execution in the cloud  

Check out the interactive [DrugFlow notebook](./drugflow.ipynb) for hands-on examples.

---

## Introduction to DrugFlow & the Rise of Generative AI in Drug Discovery

Drug discovery has evolved dramatically—from the chance discovery of **penicillin** to today's data-driven molecular design. Recent advances in **generative AI** are now enabling *inverse drug design*, where models generate molecules tailored to specific protein targets.

### 🧬 Key Milestones in Drug Discovery

| Era             | Highlight                                              | Significance                                                                 |
|-----------------|--------------------------------------------------------|-------------------------------------------------------------------------------|
| 1900s–1970s     | Natural product discovery (e.g., penicillin)          | Serendipitous findings shaped early pharmacology                             |
| 1980s–1990s     | Rational design via docking & crystallography          | Led to HIV-protease and ACE inhibitor drugs                                 |
| 2000s           | High-throughput screening; QSAR                       | Enabled large-scale similarity-based screening                              |
| 2010s           | ML enters bench (DeepChem, ChemBERTa)                 | Models began predicting molecular properties                                 |
| 2020s           | Emergence of generative diffusion models              | From image generation (DDPM, RFdiffusion) to drug design with DrugFlow (ICLR '25) [[1]](#refs-df) |

### 🔬 Why Diffusion-Based Generative Models Matter

Generative models shift drug discovery from screening to designing:

- **SMILES RNNs / VAEs** → early molecule creation, but often invalid or repetitive  
- **GNN-VAEs** (e.g., JT-VAE, GraphAF) → better molecular validity  
- **Diffusion models** (e.g., DDPM, RFdiffusion [ProteinDesign][], DiffSBDD [DiffSBDDGitHub][]) → scaffold-aware, pocket-conditioned generation with improved structural realism  

**DrugFlow** introduces a novel *flow-matching* generative model that jointly learns molecular and protein conformation distributions, providing atom-level uncertainty estimates and enabling preference-based sampling across modalities [[1]](#refs-df).

[ProteinDesign]: https://www.nature.com/articles/s41586-023-06415-8 "De novo protein design with RFdiffusion (Nature 2023)"  
[DiffSBDDGitHub]: https://github.com/arneschneuing/DiffSBDD "DiffSBDD: Equivariant diffusion for structure-based drug design"  
[#refs-df]: https://openreview.net/forum?id=g3VCIM94ke "Multi-domain Distribution Learning for De Novo Drug Design (DrugFlow, ICLR 2025)"  

---

## Notebook Roadmap

---

### Sections

- [Building & Pushing the Docker Image to Artifact Registry](#pushing-drugflow-image-to-gcp-artifact-registry)  
- [Building & Running the Container Locally](#building-and-running-the-docker-container)  
- [Submitting a Vertex AI Custom Job](#submit-custom-job-to-vertex-ai)

### Prerequisites

Make sure you have:

- [Docker](https://docs.docker.com/get-docker/)  
- Google Cloud credentials [configured locally](https://cloud.google.com/docs/authentication#set_up)  
- A user with the `Vertex AI User` role in your project  
- NVIDIA Container Toolkit (if using GPU)  
- (Optional) [VS Code](https://code.visualstudio.com/) with Remote ‑ Containers

See [local-env.md](./local-env.md) for more setup tips.

---

## Pushing DrugFlow Image to GCP Artifact Registry

1. **Clone the repository**

    ```bash
    git clone https://github.com/gabenavarro/MLContainerLab.git
    cd MLContainerLab
    ```

2. **Build the Docker image**

    ```bash
    docker build -f ./assets/build/Dockerfile.drugflow.cu121cp311 \
      -t drugflow:121-311 .
    ```
    > On macOS with M1/M2 chip, add `--platform linux/amd64`.

3. **Tag the image for GCP Artifact Registry**

    ```bash
    docker tag drugflow:121-311 \
      ${DEFAULT_ARTIFACT_REGISTRY}/${PROJECT_ID}/${REPO}/drugflow:121-311
    ```

4. **Push to Artifact Registry**

    ```bash
    docker push ${DEFAULT_ARTIFACT_REGISTRY}/${PROJECT_ID}/${REPO}/drugflow:121-311
    ```

> Ensure your Docker has credentials to push to Registry—see [local-env.md](./local-env.md).

---

## Building & Running the Docker Container Locally

1. **Clone the repo** (if not already done)

    ```bash
    git clone https://github.com/gabenavarro/MLContainerLab.git && cd MLContainerLab
    ```

2. **Build the GCP Vertex AI image**

    ```bash
    docker build -f ./assets/build/Dockerfile.gcpvertexiai.cp312 \
      -t vertexai:312 .
    ```

3. **Run the container interactively**

    ```bash
    docker run -dt \
      --gpus all \
      -v "$(pwd):/workspace" \
      --name vertexai \
      vertexai:312
    ```

4. **(Optional) Attach VS Code**

    ```bash
    CONTAINER=vertexai; FOLDER=/workspace
    HEX=$(printf "{\"containerName\":\"/$CONTAINER\"}" | od -A n -t x1 | tr -d ' \n')
    code --folder-uri "vscode-remote://attached-container+$HEX$FOLDER"
    ```

> The `-v` flag mounts your code inside the container, and GPU and GCP credentials are available within it.

---


## Submitting Job to Vertex AI

Now, lets go through an example of submitting the Docker container as a managed job.


1. **Authentication**

First, start by authenticating  your session using user credentials in bash. If you are using a containerized notebook using vscode, make sure to open the terminal in the container and run the following command there.

```bash
# Authenticate credentials
gcloud auth login
# Setup configuration
# Most likely you will want to use option 1 when prompted
gcloud init 
# Lastly setup ADC authetication
gcloud auth application-default login 
```

2.  **Build Files**

First, initialize function below in order to read in file with molecules that you'd like to bind to an input protein of interest.

In [None]:
# Modify to fit your project
PROJECT = "YOUR-PROJECT-NAME"
REGION = "YOUR-REGION"
ARTIFACT_PROJECT = "ARTIFACT-REGISTRY-PROJECT"
REGISTRY_REGION = "YOUR-ARTIFACT-REGION"
BUCKET = "YOUR-BUCKET-NAME"
MACHINE_TYPE = ""
ACCELERATOR_TYPE = ""
IMAGE = f"{REGISTRY_REGION}/{PROJECT}/${ARTIFACT_PROJECT}/drugflow:121-311"
FLEX_START_MACHINES = []
VALID_AMINO_ACIDS = set("ACDEFGHIKLMNPQRSTVWY")

In [None]:
from typing import List, Dict
from networkx import project
from pandas import read_csv
from rdkit import Chem
from datetime import datetime
from google.cloud import aiplatform
from google.cloud.aiplatform_v1.types import custom_job as gca_custom_job_compat
import subprocess
import hashlib
from time import sleep
import os
from google.cloud import storage
from Bio import SeqIO


def check_gcs_file_exists(bucket_name:str, blob_name:str):
    """Checks if a file exists in a GCS bucket.

    Args:
        bucket_name (str): The name of the GCS bucket.
        file_name (str): The name of the file (object) within the bucket.

    Returns:
        bool: True if the file exists, False otherwise.
    """
    try:
        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(blob_name)
        blob_exists = blob.exists()
    except:
        blob_exists = False
    finally:
        storage_client.close()
    
    return blob_exists


def build_drugflow_jobs(
    pdb_file: str,
    sdf_file: str,
    experiment_name: str | None = None,
    samples: int = 1000,
    steps: List[int] | None = None,
):
    
    experiment_name = experiment_name if experiment_name is not None else datetime.today().strftime("%d-%m-%Y--%H-%M-%S")

    # Get position of pdb and sdf files in GCP
    gcs_pdb = f"/gcs/{BUCKET}/drugflow/jobs/{experiment_name}/{pdb_file.split('/',-1)}"
    gcs_sdf = f"/gcs/{BUCKET}/drugflow/jobs/{experiment_name}/{sdf_file.split('/',-1)}"

    # Move them to GCP, will only work if authenticated
    cmd = f"gsutil cp {pdb_file} gs://{BUCKET}/drugflow/jobs/{experiment_name}"
    subprocess.run(cmd,shell=True)
    cmd = f"gsutil cp {gcs_sdf} gs://{BUCKET}/drugflow/jobs/{experiment_name}"
    subprocess.run(cmd,shell=True)

    # Get AA seq for hash
    pdb_seq = None
    try:
        pdb = [i.seq for i in SeqIO.parse(pdb_file, "pdb-seqres") if len(i.seq)>3 and not (set(i.seq) - VALID_AMINO_ACIDS)]
        assert len(pdb) == 1, f"Provided pdb file {pdb_file} does not have a single sequence, please review"
        pdb_seq = str(pdb[0])
    except:
        print(f"Problem reading pdb file: {pdb_file}")
    
    # Get mol smiles for hash
    mol_smiles = None
    try:
        mol = Chem.MolFromFile(sdf_file)
        mol_smiles = str(Chem.MolToSmiles(mol))
    except:
        print(f"Problem reading sdf file: {sdf_file}")
    
    # Make provided files meet minimum specification
    assert mol_smiles is not None and pdb_seq is not None, "Provided files are not acceptable"

    # Create jobs based on diffusion steps
    if steps is None:
        steps = [5,10,20,40,80]

    jobs = []
    for step in steps:
        jobs.append(
            {
                "hash_name": hashlib.sha256(str(pdb_seq.upper()+mol_smiles.upper()).encode('utf-8')).hexdigest(),
                "pdb_file": gcs_pdb,
                "sdf_file": gcs_sdf,
                "experiment_name": experiment_name,
                "n_steps": step,
                "n_samples": samples,
                
            }
        )

    return jobs


def submit_vertex_custom_job(yaml_jobs: List[Dict[str,str]]):

    aiplatform.init(
        project=PROJECT,
        location=REGION
    )

    for job in yaml_jobs:
        pdb_file = job.get("pdb_file", None)
        sdf_file = job.get("sdf_file", None)
        experiment_name = job.get("experiment_name", None)
        n_steps = job.get("n_steps", 10)
        n_samples = job.get("n_samples", 1000)
        hash_name = job.get("hash_name", None)

        if check_gcs_file_exists(
            BUCKET,
            f"/drugflow/results/{experiment_name}/drugflow_samples.sdf"
        ):
            continue
    
        CMD = [
            "python", "/workspace/src/generate.py",
            "--protein", f"{pdb_file}",
            "--ref_ligand", f"{sdf_file}",
            "--checkpoint", "/models/drugflow/drugflow.ckpt",
            "--output", f"/gcs/{BUCKET}/drugflow/results/{experiment_name}/{hash_name}-{n_steps}steps-{n_samples}samples.sdf",
            "--metrics_output", f"/gcs/{BUCKET}/drugflow/results/{experiment_name}/{hash_name}-{n_steps}steps-{n_samples}samples-metrics.csv",
            "--n_steps", f"{n_steps}",
            "--n_samples", f"{n_samples}",
            "--reduce", "/miniconda/envs/drugflow/bin/reduce"
        ]

        WORKER_POOL_SPECS = [
            {
                "replica_count": 1,
                "machine_spec": {
                    "machine_type": MACHINE_TYPE,
                    "accelerator_count": 1,
                    "accelerator_type": ACCELERATOR_TYPE,
                },
                "container_spec": {
                    "image_uri": IMAGE, 
                    "command": CMD, 
                },
            }
        ]

        job = aiplatform.CustomJob(
            display_name=f"Boltz-2 {experiment_name}", 
            worker_pool_specs=WORKER_POOL_SPECS,
            staging_bucket=f"gs://{BUCKET}/vertex_staging",
        )

        if MACHINE_TYPE in FLEX_START_MACHINES:
            job.submit(
                max_wait_duration=7200,
                scheduling_strategy=gca_custom_job_compat.Scheduling.Strategy.FLEX_START
            )
        else:
            job.submit()

        sleep(5)
    return



In [None]:
vertex_jobs = build_drugflow_jobs(
    "",
    ""
)

submit_vertex_custom_job(vertex_jobs)