# AI-PSCI-016: AI-Powered Docking (DiffDock & GNINA)

**AI in Pharmaceutical Sciences: Bench to Bedside**  
VCU School of Pharmacy | VIP Program | Spring 2026

---

**Week 9 | Module: Testing & Evaluation | Estimated Time: 90-120 minutes**

**Prerequisites**: AI-PSCI-001 through AI-PSCI-015 (especially AI-PSCI-015: AutoDock Vina)

---

## 🎯 Learning Objectives

After completing this talktorial, you will be able to:

1. Run DiffDock for blind docking (no predefined binding site)
2. Run GNINA for CNN-enhanced docking with neural network scoring
3. Compare diffusion-based vs CNN-based vs physics-based docking approaches
4. Interpret confidence scores from different methods
5. Decide which docking method is most appropriate for different scenarios

---

## 📚 Background

### The Evolution of Molecular Docking

In AI-PSCI-015, we used **AutoDock Vina** — a physics-based docking method that uses empirical scoring functions. While effective, these methods have limitations:

- **Require a predefined binding site** (you must know where to dock)
- **Rigid protein assumption** (protein doesn't move during docking)
- **Scoring function limitations** (empirical formulas miss some interactions)

AI-powered docking methods aim to overcome these limitations:

### DiffDock: Diffusion Models for Docking

**DiffDock** (Corso et al., 2023) treats molecular docking as a generative modeling problem:

- Uses **diffusion models** (like DALL-E for images, but for molecules)
- Performs **blind docking** — finds the binding site automatically
- Learns from thousands of experimental protein-ligand structures
- Returns **confidence scores** for each predicted pose

**How it works:**
1. Start with ligand in random position/orientation
2. Iteratively "denoise" to find the true binding pose
3. The diffusion process explores multiple binding sites simultaneously

### GNINA: CNN-Enhanced Scoring

**GNINA** (McNutt et al., 2021) enhances traditional docking with neural networks:

- Built on AutoDock Vina's sampling algorithm
- Uses **3D convolutional neural networks** (CNNs) for scoring
- Trained on experimental binding affinity data
- Provides two scores: **CNN_score** (pose quality) and **CNN_affinity** (binding strength)

**Key advantage:** Learns complex patterns from data that physics-based functions miss.

### When to Use Each Method

| Method | Best For | Limitations |
|--------|----------|-------------|
| **AutoDock Vina** | Fast screening, known binding sites | Rigid protein, requires binding box |
| **DiffDock** | Blind docking, novel targets, cryptic sites | Slower, requires GPU |
| **GNINA** | Rescoring poses, improved ranking | Still requires binding box |

### Key Concepts

- **Blind docking**: Finding the binding site without prior knowledge
- **Diffusion models**: Generative AI that learns to reverse a noising process
- **CNN scoring**: Using convolutional neural networks to evaluate poses
- **Confidence score**: Model's estimate of prediction quality (higher = more confident)

---

## 🛠️ Setup

Run this cell to install required packages:

In [None]:
#@title 🛠️ Install required packages (click ▶ to run)
!pip install rdkit -q
!pip install py3Dmol -q
!pip install biopython -q
!pip install requests -q
!pip install pandas -q
!pip install scipy -q
!pip install gradio_client -q  # For DiffDock API
print("✅ All packages installed!")

Import required libraries:

In [None]:
# Core libraries
import os
import time
import requests
import numpy as np
import pandas as pd
from pathlib import Path
import tempfile
import zipfile
import shutil

# Molecular handling
from rdkit import Chem
from rdkit.Chem import AllChem, Draw

# Visualization
import py3Dmol
import matplotlib.pyplot as plt

# Structure analysis
from Bio.PDB import PDBParser, PDBIO, Select
from scipy.spatial.distance import cdist

# DiffDock API
from gradio_client import Client, handle_file

print("✅ All libraries imported successfully!")

---

## 🎯 Target Configuration

Select your drug target. This selection will be used throughout this talktorial.

**Important**: Use the SAME target you used in AI-PSCI-015 so we can compare results!

In [None]:
#@title 🎯 Select Your Drug Target
TARGET = "DHFR" #@param ["DHFR", "ABL1", "EGFR", "AChE", "COX-2", "DPP-4"]

# Complete target configuration with all identifiers
TARGET_CONFIG = {
    "DHFR": {
        "pdb": "2W9S",
        "uniprot": "P0ABQ4",
        "chembl": "CHEMBL202",
        "drug": "Trimethoprim",
        "drug_smiles": "COc1cc(Cc2cnc(N)nc2N)cc(OC)c1OC",
        "ligand_code": "TOP",
        "description": "Dihydrofolate reductase - antibiotic target"
    },
    "ABL1": {
        "pdb": "1IEP",
        "uniprot": "P00519",
        "chembl": "CHEMBL1862",
        "drug": "Imatinib",
        "drug_smiles": "Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1",
        "ligand_code": "STI",
        "description": "Tyrosine kinase - cancer target (CML)"
    },
    "EGFR": {
        "pdb": "1M17",
        "uniprot": "P00533",
        "chembl": "CHEMBL203",
        "drug": "Erlotinib",
        "drug_smiles": "COCCOc1cc2ncnc(Nc3cccc(C#C)c3)c2cc1OCCOC",
        "ligand_code": "AQ4",
        "description": "Receptor tyrosine kinase - cancer target (NSCLC)"
    },
    "AChE": {
        "pdb": "4EY7",
        "uniprot": "P22303",
        "chembl": "CHEMBL220",
        "drug": "Donepezil",
        "drug_smiles": "COc1cc2CC(CC2cc1OC)CN1CCc2ccccc2C1=O",
        "ligand_code": "E20",
        "description": "Acetylcholinesterase - Alzheimer's target"
    },
    "COX-2": {
        "pdb": "3LN1",
        "uniprot": "P35354",
        "chembl": "CHEMBL230",
        "drug": "Celecoxib",
        "drug_smiles": "Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2)cc1",
        "ligand_code": "CEL",
        "description": "Cyclooxygenase-2 - inflammation target"
    },
    "DPP-4": {
        "pdb": "1X70",
        "uniprot": "P27487",
        "chembl": "CHEMBL284",
        "drug": "Sitagliptin",
        "drug_smiles": "Fc1cc(F)c(C[C@H](N)CC(=O)N2CCn3c(nnc3C(F)(F)F)C2)c(F)c1F",
        "ligand_code": "715",
        "description": "Dipeptidyl peptidase-4 - diabetes target"
    }
}

# Get configuration for selected target
config = TARGET_CONFIG[TARGET]

print(f"✅ Target: {TARGET}")
print(f"   Description: {config['description']}")
print(f"   PDB: {config['pdb']} | UniProt: {config['uniprot']}")
print(f"   Reference Drug: {config['drug']}")

---

## 🔬 Guided Inquiry 1: Preparing Structures for AI Docking

### Context

Both DiffDock and GNINA require input structures in specific formats:

- **Protein**: PDB format (clean, single chain preferred)
- **Ligand**: SDF or MOL format for DiffDock, PDBQT for GNINA

We'll prepare our protein and ligand, similar to AI-PSCI-015 but adapted for these tools.

### Your Task

Using your AI assistant, write code to:
1. Download the PDB structure for your target
2. Extract and save the protein (chain A only)
3. Generate the ligand 3D structure from SMILES and save as SDF
4. Display both structures

💡 **Prompting Tips**:
- Ask: "How do I save an RDKit molecule as an SDF file?"
- The SDWriter class in RDKit handles SDF output
- Make sure to add hydrogens before generating 3D coordinates

### Verification

After running your code, confirm:
- [ ] Protein PDB file saved
- [ ] Ligand SDF file saved
- [ ] 3D visualization displays both

📓 **Lab Notebook**: Record the number of atoms in your prepared ligand.

In [None]:
# Your code here



## 🔬 Guided Inquiry 2: Running DiffDock for Blind Docking

### Context

DiffDock is a **blind docking** method — it finds the binding site automatically! This is revolutionary because:

- No need to define a binding box
- Can discover unexpected binding sites
- Handles flexible protein regions implicitly

We'll use the DiffDock web API (via Hugging Face) to run docking.

### Your Task

Using your AI assistant, write code to:
1. Connect to the DiffDock API
2. Submit your protein and ligand for docking
3. Retrieve and parse the results
4. Display the top poses with confidence scores

💡 **Prompting Tips**:
- Ask: "How do I use the gradio_client to call the DiffDock API?"
- The API returns multiple poses ranked by confidence
- Confidence scores range from 0 to 1 (higher = better)

### Verification

After running your code, confirm:
- [ ] DiffDock returned multiple poses
- [ ] Confidence scores are displayed
- [ ] Best pose is visualized with the protein

📓 **Lab Notebook**: Record the confidence score of the top DiffDock pose.

In [None]:
# Your code here



## 🔬 Guided Inquiry 3: Installing and Running GNINA

### Context

GNINA enhances AutoDock Vina with deep learning:

- Uses the **same sampling algorithm** as Vina
- Replaces the scoring function with a **3D CNN**
- Trained on thousands of experimental structures
- Provides **CNN_score** (0-1) and **CNN_affinity** (kcal/mol)

### Your Task

Using your AI assistant, write code to:
1. Download the GNINA binary for Linux
2. Configure the docking parameters (same box as Vina)
3. Run GNINA docking
4. Parse and display the results

💡 **Prompting Tips**:
- Ask: "How do I run GNINA from the command line in Colab?"
- GNINA uses similar parameters to Vina
- The output includes CNN scores in addition to Vina-like scores

### Verification

After running your code, confirm:
- [ ] GNINA binary installed successfully
- [ ] Docking completed with multiple poses
- [ ] Both CNN_score and CNN_affinity are displayed

📓 **Lab Notebook**: Compare GNINA's CNN_affinity to Vina's score from AI-PSCI-015.

In [None]:
# Your code here



In [None]:
# Your code here



## 🔬 Guided Inquiry 4: Visualizing and Comparing Poses

### Context

Now let's visualize the docking results from both AI methods and compare them to what we'd expect from Vina.

Key questions to address:
- Do DiffDock and GNINA find similar binding sites?
- How do the poses compare to the crystal structure?
- Which method gives higher confidence?

### Your Task

Using your AI assistant, write code to:
1. Load the docked poses from each method
2. Visualize them overlaid with the protein
3. Compare to the crystal ligand position
4. Create a summary visualization

💡 **Prompting Tips**:
- Ask: "How do I overlay multiple molecules in py3Dmol with different colors?"
- Use different colors for each method's poses
- The crystal ligand is in the original PDB file

### Verification

After running your code, confirm:
- [ ] Poses from both methods are visualized
- [ ] Crystal ligand is shown for comparison
- [ ] Colors clearly distinguish each method

📓 **Lab Notebook**: Note whether the AI methods found the same binding site as the crystal structure.

In [None]:
# Your code here



## 🔬 Guided Inquiry 5: Quantitative Method Comparison

### Context

Let's compare all three docking methods quantitatively:

1. **AutoDock Vina** (from AI-PSCI-015)
2. **DiffDock** (diffusion-based)
3. **GNINA** (CNN-enhanced)

We'll compare:
- Score/confidence values
- Pose similarity (RMSD to crystal)
- Computational requirements

### Your Task

Using your AI assistant, write code to:
1. Create a comparison table of all methods
2. Calculate RMSD for each method's best pose vs crystal
3. Analyze score correlations
4. Generate a summary visualization (bar chart or similar)

💡 **Prompting Tips**:
- Ask: "How do I create a pandas DataFrame to compare docking methods?"
- Use matplotlib to create comparison charts
- Include both score quality and speed metrics

### Verification

After running your code, confirm:
- [ ] Comparison table created with all methods
- [ ] RMSD values calculated (or estimated)
- [ ] Visualization clearly shows method differences

📓 **Lab Notebook**: Based on the comparison, which method would you recommend for your target?

In [None]:
# Your code here



In [None]:
# Your code here



## 🔬 Guided Inquiry 6: Making Method Recommendations

### Context

Based on your analysis, you now need to make evidence-based recommendations for which docking method to use. This is a critical skill in computational drug discovery!

### Your Task

Using your AI assistant, write code to:
1. Create a decision framework based on use case
2. Generate a final summary report for your target
3. Make specific recommendations with justification

💡 **Prompting Tips**:
- Ask: "How do I create a decision tree for method selection?"
- Consider factors: speed, accuracy, resources, use case
- Include pros and cons for each method

### Verification

After running your code, confirm:
- [ ] Decision framework is clear and logical
- [ ] Summary includes all key metrics
- [ ] Recommendation is justified with evidence

📓 **Lab Notebook**: Document your recommendation and the reasoning behind it.

In [None]:
# Your code here



In [None]:
# Your code here



---

## ✅ Checkpoint

Before moving on to the next talktorial, confirm you can:

- [ ] Explain the difference between physics-based (Vina) and AI-based (DiffDock, GNINA) docking
- [ ] Run DiffDock for blind docking without specifying a binding site
- [ ] Run GNINA with CNN scoring to improve pose ranking
- [ ] Compare docking methods using RMSD, confidence scores, and timing
- [ ] Make evidence-based recommendations for method selection

### Your lab notebook should include:

- [ ] Target and ligand information
- [ ] Scores from each docking method
- [ ] RMSD comparison table
- [ ] Visualization of docked poses
- [ ] Your method recommendation with justification

---

## 🤔 Reflection Questions

Answer these in your lab notebook:

1. **AI vs Physics**: DiffDock uses diffusion models while Vina uses physics-based scoring. What are the fundamental differences in how they approach docking? What are the implications for accuracy and interpretability?

2. **Blind Docking**: DiffDock can find binding sites without being told where to look. How might this capability change drug discovery for novel targets? What are potential pitfalls?

3. **Method Integration**: You've now used three docking methods. Design a practical workflow that combines all three for a drug discovery campaign. When would you use each?

4. **Resistance Mutations**: For your target, how might AI-powered docking help predict or understand drug resistance mutations? What experiments would you design?

---

## 📖 Further Reading

### DiffDock
- [DiffDock Paper](https://arxiv.org/abs/2210.01776) - Corso et al., ICLR 2023
- [DiffDock GitHub](https://github.com/gcorso/DiffDock) - Code and documentation
- [DiffDock Web Server](https://huggingface.co/spaces/simonduerr/diffdock) - Try it online

### GNINA
- [GNINA Paper](https://doi.org/10.1186/s13321-021-00522-2) - McNutt et al., J Cheminform 2021
- [GNINA GitHub](https://github.com/gnina/gnina) - Installation and usage
- [GNINA Tutorial](https://gnina.github.io/gnina/) - Official documentation

### AI in Drug Discovery
- [AI for Drug Discovery Review](https://doi.org/10.1038/s41573-021-00361-0) - Schneider et al., Nat Rev Drug Discov 2021
- [Geometric Deep Learning for Drugs](https://arxiv.org/abs/2106.10234) - Stärk et al., 2022

---

## 🔗 Connection to Research

AI-powered docking is transforming drug discovery:

**Industry Adoption**: Major pharmaceutical companies (Schrödinger, Atomwise, Insilico Medicine) now use AI docking in production pipelines.

**Speed Revolution**: What once took weeks of compute time can now be done in hours, enabling exploration of larger chemical spaces.

**Novel Discoveries**: AI methods have found binding sites that traditional methods missed — including allosteric sites and cryptic pockets.

**Your Target**: The methods you've learned apply directly to your research:
- Screen compound libraries against your target
- Predict how mutations affect drug binding
- Design modified drugs with improved properties

**Next Steps**: In AI-PSCI-017, we'll learn how to rigorously validate these predictions and assess method performance.

---

*AI-PSCI-016 Complete. Proceed to AI-PSCI-017: Model Validation & Performance Metrics.*