# DrugFlow with Docker for Local Development and Cloud Deployment

---

This documentation provides a guide on DrugFlow implemented in Python, designed for both local development and cloud deployment using Docker. It covers the following topics:

1. **Introduction to DrugFlow**: Overview of the DrugFlow and its applications.
2. **Setting Up the Development Environment**: Step-by-step instructions for setting up a local development environment using Docker.
3. **Building and Running the Docker Container**: Instructions for building the Docker image and running the container.
4. **Deploying to the Cloud**: Guidelines for deploying the DrugFlow to a cloud platform using Docker.
5. **Best Practices**: Tips and best practices for working with DrugFlow and Docker.

Here is a critically expanded and historically grounded version of your notebook introduction for **DrugFlow**, enriched with context, key developments, and clear motivation for diffusion models in drug discovery:


## Introduction to DrugFlow and the Evolution of AI in Drug Discovery

---

The search for new medicines has always been a race against time, complexity, and cost. From the chance discovery of **penicillin** in 1928 to today's **AI-generated molecules**, drug discovery has evolved through multiple revolutions—each catalyzed by breakthroughs in biology, chemistry, and computation.

In recent years, **generative AI** has emerged not just as a tool for prediction, but as a creative partner—capable of designing novel drug-like molecules tailored to specific protein targets. At the forefront of this revolution is **DrugFlow**, an AI framework that integrates **diffusion models** with protein structural data to unlock inverse drug design at unprecedented scale and precision.

### 🧬 Historical Milestones in Drug Discovery

| Era             | Milestone                                       | Impact                                                                                                                                                          |
| --------------- | ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1900s–1970s** | **Serendipity & Pharmacognosy**                 | Drugs like aspirin, morphine, and penicillin were discovered empirically from natural sources. SAR (Structure–Activity Relationship) understanding was minimal. |
| **1980s–1990s** | **Rational Drug Design**                        | Enabled by advances in X-ray crystallography, molecular docking, and enzyme kinetics. Notable examples: HIV protease inhibitors, ACE inhibitors.                |
| **2000s**       | **High-Throughput Screening & Cheminformatics** | Millions of compounds screened using robotic systems. Virtual screening and QSAR modeling matured.                                                              |
| **2010s**       | **Machine Learning Enters**                     | DeepChem, MoleculeNet, and ChemBERTa showed that machine learning could outperform traditional QSAR in property prediction.                                     |
| **2020s**       | **Generative AI & Diffusion Models**            | Inspired by AlphaFold2 and RFdiffusion, AI began to create drug candidates directly from protein structures.                                                    |

### 🔬 The Rise of Generative Models in Drug Design

Generative models have transformed molecular discovery from a **screening problem** into an **inverse design challenge**: instead of searching a vast database, we train models to propose molecules that already fit our criteria.

#### Early Architectures

* **SMILES-based RNNs / VAEs**: Modeled molecules as text strings. While effective, they often generated invalid molecules or ones lacking novelty.
* **Graph Neural Networks (GNNs)**: Treated molecules as graphs, capturing chemical connectivity. JT-VAE and GraphAF offered better scaffold control and validity.

#### Diffusion Models: The Game Changer

Diffusion models—originally developed for image synthesis (e.g., **DDPM**) and later adapted to protein backbone generation (e.g., **RFdiffusion**)—have recently demonstrated **state-of-the-art performance in molecule generation**. They iteratively "denoise" random noise into structured, valid molecules while conditioning on protein features, enabling:

* Precise scaffold placement
* Flexible ligand generation
* Incorporation of 3D geometric and spatial constraints

### 🚀 What is DrugFlow?

**DrugFlow** is an open, modular AI framework for **structure-conditioned molecular generation** using **diffusion-based generative modeling**. It is built for rapid prototyping, scalable training, and cloud-ready deployment. Designed for computational chemists, bioinformaticians, and ML researchers alike, DrugFlow provides:

* 🧠 **Generative Backbones**: Includes graph- and coordinate-based diffusion models
* 🧬 **Protein Context Conditioning**: Accepts 3D binding pockets, AlphaFold2 outputs, or sequence embeddings
* ⚙️ **Training & Inference Pipelines**: Supports mixed-precision training, checkpointing, and evaluation
* ☁️ **Cloud Scalability**: Dockerized environments for seamless transition from local dev to GPU clusters

### 💡 Why DrugFlow?

Unlike classical docking or virtual screening, which evaluate billions of compounds **after** generation, DrugFlow focuses on **generating the right molecules** in the first place—guided by structure, logic, and AI-learned priors. Whether you're targeting kinases, GPCRs, or designing novel scaffolds, DrugFlow empowers you to:

* **Design rather than search**
* **Learn from structure, not just data**
* **Scale effortlessly from laptop to cloud**

In the following sections, we’ll guide you through setting up DrugFlow with Docker, running sample tasks on targets like **GSK3-β**, and extending the system to your own proteins or datasets.

## Notebook Roadmap

---

### Sections
- [Building and Running the Docker Container](#building-and-running-the-docker-container)
- [Using DrugFlow](#using-drugflow)
- [Small GSK3B-FRAT1 Study](#small-gsk3b-frat1-study)

### Prerequisites

Before you begin, ensure you have the following installed on your local machine:

- Docker: [Install Docker](https://docs.docker.com/get-docker/)
- A compatible GPU (for DrugFlow)
- NVIDIA drivers (if using GPU)


## Building and Running the Docker Container

---

To build and run the Docker container for DrugFlow, follow these steps:

1. **Clone the Repository**: Clone the DrugFlow repository to your local machine.

```bash
git clone https://github.com/gabenavarro/MLContainerLab.git
cd MLContainerLab
```

2. **Build the Docker Image**: Use the provided Dockerfile to build the Docker image.

```bash
# You can choose any tag you want for the image
# Feel free to play around with the base image, just make sure the host has the same or higher CUDA version
docker build -f ./assets/build/Dockerfile.drugflow.cu121cp311 -t drugflow:121-311 .
```
3. **Run the Docker Container**: Run the Docker container with the necessary configurations. In the first example, we will run the container locally with GPU support. This is the recommended way to run a container while in development mode. For scaling up, we will use the second example which runs the container in the cloud.

```bash
# Run the container with GPU support
docker run -dt \
   --gpus all \
   --shm-size=64g \
   -v "$(pwd)/assets:/workspace/assets" \
   -v "$(pwd)/documentation:/workspace/documentation" \
   -v "$(pwd)/datasets:/workspace/datasets" \
   --name drugflow \
   --env NVIDIA_VISIBLE_DEVICES=all \
   --env GOOGLE_APPLICATION_CREDENTIALS=/workspace/assets/secrets/gcp-key.json \
   --entrypoint /bin/bash \
   drugflow:121-311
```
> Note: The `-v "$(pwd):/workspace"` option mounts the current directory to `/workspace` in the container, allowing you to access your local files from within the container. The `--env` options set environment variables for GPU visibility and Google Cloud credentials.<br>
> Note: The `--gpus all` option allows the container to use all available GPUs. <br>

4. **Access the Container with IDE**: In this example, we will use Visual Studio Code to access the container. You can use any IDE of your choice.

```bash
# In a scriptable manner
CONTAINER_NAME=drugflow
FOLDER=/workspace
HEX_CONFIG=$(printf {\"containerName\":\"/$CONTAINER_NAME\"} | od -A n -t x1 | tr -d '[\n\t ]')
code --folder-uri "vscode-remote://attached-container+$HEX_CONFIG$FOLDER"
```

> Note: The `code` command is used to open Visual Studio Code. Make sure you have the Remote - Containers extension installed in VS Code to access the container directly. <br>
> Note: Make sure you have installed Remote - Containers extension in VS Code.<br>


## Using DrugFlow

---

### Drugflow example

First, lets start by printing out the drugflow respository `generate.py --help` script to see what options are available.

```bash
usage: generate.py [-h] --protein PROTEIN --ref_ligand REF_LIGAND --checkpoint
                   CHECKPOINT [--molecule_size MOLECULE_SIZE]
                   [--output OUTPUT] [--n_samples N_SAMPLES]
                   [--batch_size BATCH_SIZE]
                   [--pocket_distance_cutoff POCKET_DISTANCE_CUTOFF]
                   [--n_steps N_STEPS] [--device DEVICE] [--datadir DATADIR]
                   [--seed SEED] [--filter] [--metrics_output METRICS_OUTPUT]
                   [--gnina GNINA] [--reduce REDUCE]

options:
  -h, --help            show this help message and exit
  --protein PROTEIN     Input PDB file.
  --ref_ligand REF_LIGAND
                        SDF file with reference ligand used to define the
                        pocket.
  --checkpoint CHECKPOINT
                        Model checkpoint file.
  --molecule_size MOLECULE_SIZE
                        Maximum number of atoms in the sampled molecules. Can
                        be a single number or a range, e.g. '15,20'. If None,
                        size will be sampled.
  --output OUTPUT       Output file.
  --n_samples N_SAMPLES
                        Number of sampled molecules.
  --batch_size BATCH_SIZE
                        Batch size.
  --pocket_distance_cutoff POCKET_DISTANCE_CUTOFF
                        Distance cutoff to define the pocket around the
                        reference ligand.
  --n_steps N_STEPS     Number of denoising steps.
  --device DEVICE       Device to use.
  --datadir DATADIR     Needs to be specified to sample molecule sizes.
  --seed SEED           Random seed.
  --filter              Apply basic filters and keep sampling until
                        `n_samples` molecules passing these filters are found.
  --metrics_output METRICS_OUTPUT
                        If provided, metrics will be computed and saved in csv
                        format at this location.
  --gnina GNINA         Path to a gnina executable. Required for computing
                        docking scores.
  --reduce REDUCE       Path to a reduce executable. Required for computing
                        interactions.
```


Next, lets go ahead an run the example kras protein target and ligand.

![kras-ligand](https://storage.googleapis.com/gn-portfolio/images/kras_drugflow_example.gif)

In [1]:
!python /workspace/src/generate.py \
  --protein /workspace/examples/kras.pdb \
  --ref_ligand /workspace/examples/kras_ref_ligand.sdf \
  --checkpoint /models/drugflow/drugflow.ckpt \
  --output /workspace/examples/samples.sdf \
  --metrics_output /workspace/examples/metrics.csv \
  --n_steps 20 \
  --n_samples 10

MDAnalysis.topology.tables has been moved to MDAnalysis.guesser.tables. This import point will be removed in MDAnalysis version 3.0.0
Marginal Prior MB
Will generate 10 samples
Evaluator [gnina] is not included
Evaluator [interactions] is not included
Will use the following evaluators:
- [representation]
- [mol_props]
- [posebusters]
- [medchem]
- [clashes]
- [geometry]
- [ring_count]
- [energy]
- [chembl_ring_systems]
- [reos]
  0%|                                                    | 0/10 [00:00<?, ?it/s]Failed to prepare molecule: Explicit valence for atom # 19 N, 4, is greater than permitted
Failed to prepare molecule: Can't kekulize mol.  Unkekulized atoms: 5 21 24
Failed to prepare molecule: Can't kekulize mol.  Unkekulized atoms: 1 4 7
Failed to prepare molecule: non-ring atom 3 marked aromatic
Failed to prepare molecule: Can't kekulize mol.  Unkekulized atoms: 3 12 17 20 35
100%|███████████████████████████████████████████| 10/10 [00:05<00:00,  1.82it/s]


Below is the example of the output generated from DrugFlow generation.

![DrugFlow Example Output](../assets/images/drugflow/state-sweep-example.gif)


### Benchmarking

Lets go ahead and push the limits of DrugFlow's generative capability and see how well it can be used as part of a drug discovery pipeline. The steps are as follows:

1. Generate binding pocket of known orthosteric drugs with native proteins
2. Use native protein as ligand to generate small molecule binders to target using DrugFlow
3. Rank small molecules using Boltz2 with affinity prediction
4. With library of FDA approved drugs, identify most similar compound to top ranked Boltz molecules.

#### Gnerate Binding Pocket

Below we will be using bcl2-bax complex as the starting complex. We exported the bax ligand, as a sdf file, and kept bcl2 in pdb format to make `generate.py` function to work.

In [1]:
# Example
!python /workspace/src/generate.py \
  --protein /workspace/datasets/boltz2/predict2/boltz_results_boltz2-example/bcl2_target.pdb \
  --ref_ligand /workspace/datasets/boltz2/predict2/boltz_results_boltz2-example/bax_ligand.sdf \
  --checkpoint /models/drugflow/drugflow.ckpt \
  --output /workspace/datasets/boltz2/predict2/boltz_results_boltz2-example/samples.sdf \
  --n_steps 10 \
  --n_samples 500 \
  --reduce /miniconda/envs/drugflow/bin/reduce


@> ProDy is configured: verbosity='none'
MDAnalysis.topology.tables has been moved to MDAnalysis.guesser.tables. This import point will be removed in MDAnalysis version 3.0.0
Marginal Prior MB
Will generate 500 samples
Evaluator [gnina] is not included                                               
Will use the following evaluators:
- [representation]
- [mol_props]
- [posebusters]
- [medchem]
- [clashes]
- [geometry]
- [ring_count]
- [energy]
- [chembl_ring_systems]
- [reos]
- [interactions]
512it [03:18,  2.58it/s]                                                        


Next, lets go ahead and filter out any generated samples with multiple molecules.

In [2]:
from rdkit import Chem
from rdkit.Chem import SDWriter

def filter_single_molecule_states(input_sdf: str, output_sdf: str) -> int:
    """
    Filters and writes only single-molecule states from an SDF file.
    
    Parameters
    ----------
    input_sdf : str
        Path to the input SDF file containing multiple states.
    output_sdf : str
        Path to save the filtered SDF file containing only single-molecule entries.
        
    Returns
    -------
    int
        Number of single-molecule entries written to the output file.
    """
    suppl = Chem.SDMolSupplier(input_sdf, removeHs=False)
    writer = SDWriter(output_sdf)
    count = 0

    for mol in suppl:
        if mol is None:
            continue  # Skip unreadable entries
        # Check number of fragments
        frags = Chem.GetMolFrags(mol, asMols=True)
        if len(frags) == 1:
            writer.write(mol)
            count += 1

    writer.close()
    return count

filter_single_molecule_states(
    input_sdf="/workspace/datasets/boltz2/predict2/boltz_results_boltz2-example/samples.sdf",
    output_sdf="/workspace/datasets/boltz2/predict2/boltz_results_boltz2-example/samples_filtered.sdf"
)

[08:43:20] Explicit valence for atom # 16 O, 3, is greater than permitted
[08:43:20] ERROR: Could not sanitize molecule ending on line 84
[08:43:20] ERROR: Explicit valence for atom # 16 O, 3, is greater than permitted
[08:43:20] Explicit valence for atom # 28 C, 5, is greater than permitted
[08:43:20] ERROR: Could not sanitize molecule ending on line 476
[08:43:20] ERROR: Explicit valence for atom # 28 C, 5, is greater than permitted
[08:43:20] Explicit valence for atom # 13 O, 3, is greater than permitted
[08:43:20] ERROR: Could not sanitize molecule ending on line 625
[08:43:20] ERROR: Explicit valence for atom # 13 O, 3, is greater than permitted
[08:43:20] Explicit valence for atom # 29 C, 5, is greater than permitted
[08:43:20] ERROR: Could not sanitize molecule ending on line 783
[08:43:20] ERROR: Explicit valence for atom # 29 C, 5, is greater than permitted
[08:43:20] Explicit valence for atom # 20 O, 3, is greater than permitted
[08:43:20] ERROR: Could not sanitize molecule e

15

After filtering oput multi-fragment molecules, we end up with only 11 as shown in animation below

![11-fragment-animation](https://storage.googleapis.com/gn-portfolio/images/drugflow_bcl2_generated_structures.gif)

Lets move on to ranking these 11 samples using Boltz2