# AI-PSCI-011: AlphaFold2 for Structure Prediction

**AI in Pharmaceutical Sciences: Bench to Bedside**  
VCU School of Pharmacy | VIP Program | Spring 2026

---

**Week 6 | Module: AI Tools | Estimated Time: 90-120 minutes**

**Prerequisites**: AI-PSCI-001 through AI-PSCI-010

**GPU Required**: This talktorial requires a GPU runtime. Go to Runtime → Change runtime type → Select GPU (preferably A100).

---

## 🎯 Learning Objectives

After completing this talktorial, you will be able to:

1. Explain how AlphaFold2 revolutionized protein structure prediction
2. Run AlphaFold2 predictions using ColabFold on custom sequences
3. Interpret pLDDT confidence scores to assess prediction reliability
4. Analyze Predicted Aligned Error (PAE) to understand domain relationships
5. Compare AlphaFold2 predictions to experimental structures using RMSD

---

## 📚 Background

### The Protein Folding Problem

For over 50 years, scientists struggled with a fundamental question: **how does a protein's amino acid sequence determine its 3D structure?** This "protein folding problem" was considered one of the grand challenges in biology. Experimental structure determination (X-ray crystallography, cryo-EM, NMR) is slow, expensive, and doesn't work for all proteins.

### AlphaFold2: A Revolution

In 2020, DeepMind's **AlphaFold2** achieved near-experimental accuracy in the CASP14 competition, effectively solving the structure prediction problem for most proteins. Key innovations:

1. **Evoformer**: Processes multiple sequence alignments (MSAs) to extract evolutionary information
2. **Structure Module**: Iteratively refines 3D coordinates using attention mechanisms
3. **Recycling**: Multiple passes through the network improve accuracy

### ColabFold: AlphaFold2 for Everyone

**ColabFold** makes AlphaFold2 accessible by:
- Running efficiently on Google Colab GPUs
- Using MMseqs2 for faster MSA generation
- Providing easy-to-use interfaces

### Why This Matters for Drug Discovery

AlphaFold2 enables:
- **Predicting mutant structures**: How do resistance mutations change drug binding?
- **Modeling novel targets**: Structure-based drug design for proteins without crystal structures
- **Understanding mechanism**: Visualizing how proteins function

### Key Concepts

- **pLDDT (predicted Local Distance Difference Test)**: Confidence score 0-100 for each residue
  - >90: Very high confidence (typically accurate)
  - 70-90: Confident (backbone usually correct)
  - 50-70: Low confidence (may be disordered or incorrect)
  - <50: Very low confidence (likely disordered)

- **PAE (Predicted Aligned Error)**: Matrix showing predicted error between residue pairs
  - Low PAE between domains = they're in correct relative orientation
  - High PAE = uncertain relative positioning (flexible linkers, multi-domain proteins)

- **MSA (Multiple Sequence Alignment)**: Evolutionarily related sequences that inform structure prediction

- **RMSD (Root Mean Square Deviation)**: Measures structural similarity (lower = more similar)

---

## 🛠️ Setup

Run this cell to install required packages. **This takes 2-3 minutes.**

In [None]:
#@title 🛠️ Install Packages (2-3 min)
import os
import sys

# Check for GPU
try:
    gpu_info = !nvidia-smi --query-gpu=name --format=csv,noheader
    if gpu_info and gpu_info[0] and 'NVIDIA' in gpu_info[0]:
        print(f"✅ GPU detected: {gpu_info[0]}")
    else:
        print("⚠️ No GPU detected! Go to Runtime → Change runtime type → GPU")
        print("   AlphaFold2 predictions require GPU and will be very slow without one.")
except Exception as e:
    print("⚠️ Could not detect GPU. Make sure GPU runtime is enabled.")
    print("   Go to Runtime → Change runtime type → GPU (preferably A100)")

# Install ColabFold and dependencies
print("\n📦 Installing ColabFold (this takes 2-3 minutes)...")
!pip install -q "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold"
!pip install -q biopython py3Dmol matplotlib seaborn

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("\n✅ All packages installed!")

Import required libraries:

In [None]:
#@title 📦 Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO
from Bio.PDB import PDBParser, Superimposer, PDBIO
from Bio.PDB.Polypeptide import is_aa
import py3Dmol
import requests
import json
import os
from pathlib import Path
import pickle

# Set display options
pd.set_option('display.max_columns', 10)
plt.rcParams['figure.figsize'] = [10, 8]
plt.rcParams['figure.dpi'] = 100

print("✅ All libraries imported!")

### 📊 Progress Monitor Utility

AlphaFold2 predictions can take 15-30+ minutes. During this time, ColabFold doesn't output much information, which can make it seem like the cell is stuck.

The `AlphaFoldProgressMonitor` below provides **real-time feedback** by:
1. **Tracking elapsed time** - so you know how long the prediction has been running
2. **Monitoring GPU usage** - proof that computation is actively happening
3. **Detecting prediction phases** - so you understand what's happening at each stage

**AlphaFold2 Prediction Phases:**
| Phase | What's Happening | Typical Time |
|-------|------------------|--------------|
| 🔍 MSA Search | Finding similar sequences in databases | 3-8 min |
| 🧠 Model Inference | Running the neural network | 5-15 min |
| ✨ Relaxation | Energy minimization with AMBER | 2-5 min |

In [None]:
#@title 📊 Progress Monitor for Long-Running Predictions
import threading
import subprocess
import time as time_module  # Avoid conflict with time variable
from pathlib import Path

class AlphaFoldProgressMonitor:
    """
    Monitors AlphaFold2/ColabFold prediction progress.

    Provides real-time feedback during long-running predictions by:
    - Tracking elapsed time
    - Monitoring GPU utilization
    - Detecting prediction phases from output files

    Usage:
        monitor = AlphaFoldProgressMonitor(output_dir, sequence_name)
        monitor.start()
        # ... run prediction ...
        monitor.stop()
    """

    def __init__(self, output_dir, sequence_name="prediction", update_interval=30):
        self.output_dir = Path(output_dir)
        self.sequence_name = sequence_name
        self.update_interval = update_interval
        self.stop_event = threading.Event()
        self.thread = None
        self.start_time = None

    def _get_gpu_stats(self):
        try:
            result = subprocess.run(
                ['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total',
                 '--format=csv,noheader,nounits'],
                capture_output=True, text=True, timeout=5
            )
            if result.returncode == 0:
                parts = result.stdout.strip().split(', ')
                if len(parts) >= 3:
                    return int(parts[0]), int(parts[1]), int(parts[2])
        except Exception:
            pass
        return None, None, None

    def _detect_phase(self):
        try:
            files = list(self.output_dir.glob("*"))
            file_names = [f.name.lower() for f in files]
            if any('relaxed' in f and '.pdb' in f for f in file_names):
                return "✅ Complete", "Prediction finished!"
            if any('unrelaxed' in f for f in file_names):
                return "✨ Relaxation", "Running AMBER energy minimization"
            if any('.a3m' in f for f in file_names):
                return "🧠 Model Inference", "Running AlphaFold2 neural network"
            return "🔍 MSA Search", "Finding homologous sequences"
        except Exception:
            return "🔄 Running", "Processing..."

    def _monitor_loop(self):
        last_phase = None
        while not self.stop_event.is_set():
            elapsed = time_module.time() - self.start_time
            mins, secs = int(elapsed // 60), int(elapsed % 60)
            gpu_util, mem_used, mem_total = self._get_gpu_stats()
            phase, phase_desc = self._detect_phase()
            time_str = f"{mins:02d}:{secs:02d}"
            if gpu_util is not None:
                gpu_str = f"GPU: {gpu_util:3d}% | VRAM: {mem_used//1024:.1f}/{mem_total//1024:.1f} GB"
            else:
                gpu_str = "GPU: monitoring unavailable"
            print(f"\r⏳ {time_str} | {gpu_str} | {phase}", end='', flush=True)
            if phase != last_phase:
                print(f"\n   └─ {phase_desc}")
                last_phase = phase
            if phase == "✅ Complete":
                break
            self.stop_event.wait(self.update_interval)

    def start(self):
        self.start_time = time_module.time()
        self.stop_event.clear()
        print(f"📊 Progress monitor started for: {self.sequence_name}")
        print(f"   Updates every {self.update_interval} seconds\n")
        self.thread = threading.Thread(target=self._monitor_loop, daemon=True)
        self.thread.start()

    def stop(self):
        self.stop_event.set()
        if self.thread:
            self.thread.join(timeout=2)
        elapsed = time_module.time() - self.start_time
        print(f"\n\n📊 Monitor stopped. Total time: {int(elapsed//60):02d}:{int(elapsed%60):02d}")

print("✅ AlphaFoldProgressMonitor loaded!")
print("   Use: monitor = AlphaFoldProgressMonitor(output_dir, sequence_name)")
print("   Then: monitor.start() before prediction, monitor.stop() after")

---

## 🎯 Target Configuration

Select your drug target. This selection determines which protein you'll predict.

In [None]:
#@title 🎯 Select Your Drug Target
TARGET = "DHFR" #@param ["DHFR", "ABL1", "EGFR", "AChE", "COX-2", "DPP-4"]

TARGET_CONFIG = {
    "DHFR": {"pdb": "1RX1", "uniprot": "P0ABQ4", "drug": "Trimethoprim", "organism": "E. coli",
             "mutation": "L28R", "mutation_context": "Resistance mutation that reduces trimethoprim binding", "expected_length": 159},
    "ABL1": {"pdb": "1IEP", "uniprot": "P00519", "drug": "Imatinib", "organism": "Human",
             "mutation": "T315I", "mutation_context": "Gatekeeper mutation causing imatinib resistance in CML",
             "expected_length": 1130, "domain_start": 242, "domain_end": 493},
    "EGFR": {"pdb": "1M17", "uniprot": "P00533", "drug": "Erlotinib", "organism": "Human",
             "mutation": "T790M", "mutation_context": "Secondary resistance mutation after erlotinib treatment",
             "expected_length": 1210, "domain_start": 712, "domain_end": 979},
    "AChE": {"pdb": "4EY7", "uniprot": "P22303", "drug": "Donepezil", "organism": "Human",
             "mutation": "Y337A", "mutation_context": "Mutation in catalytic gorge affecting substrate binding", "expected_length": 614},
    "COX-2": {"pdb": "3LN1", "uniprot": "P35354", "drug": "Celecoxib", "organism": "Human",
              "mutation": "V523I", "mutation_context": "Mutation mimicking COX-1, affects selectivity", "expected_length": 604},
    "DPP-4": {"pdb": "1X70", "uniprot": "P27487", "drug": "Sitagliptin", "organism": "Human",
              "mutation": "S630A", "mutation_context": "Catalytic serine mutation abolishing enzyme activity", "expected_length": 766}
}

config = TARGET_CONFIG[TARGET]
print(f"✅ Target: {TARGET}")
print(f"   PDB: {config['pdb']} | UniProt: {config['uniprot']}")
print(f"   Reference Drug: {config['drug']}")
print(f"   Organism: {config['organism']}")
print(f"\n🧬 Mutation to predict: {config['mutation']}")
print(f"   Context: {config['mutation_context']}")

---

## 🔬 Guided Inquiry 1: Fetching and Preparing Your Sequence

### Context

Before running AlphaFold2, we need the protein sequence. We'll fetch the wild-type sequence from UniProt, then introduce a clinically relevant mutation to study how it affects the predicted structure.

### Your Task

Using your AI assistant, write code to:

1. Fetch your target's protein sequence from UniProt
2. Create a mutant sequence by introducing the specified mutation
3. Save both sequences as FASTA files for AlphaFold2 input

💡 **Prompting Tips**:
- Ask: "How do I fetch a protein sequence from UniProt using Python?"
- Ask: "How do I introduce a point mutation into a protein sequence?"
- For kinases (ABL1, EGFR), consider using just the kinase domain to speed up prediction

### Verification

After running your code, confirm:
- [ ] Wild-type sequence length matches expected (see TARGET_CONFIG)
- [ ] Mutant sequence differs at exactly one position
- [ ] Both FASTA files are saved

📓 **Lab Notebook**: Record your target's sequence length. For kinases, note which domain you're using and why.

In [None]:
# Your code here



---

## 🔬 Guided Inquiry 2: Running AlphaFold2 with ColabFold

### Context

Now we'll run AlphaFold2 on our mutant sequence using ColabFold. This is a computationally intensive process that:
1. Searches for homologous sequences (MSA generation)
2. Runs the AlphaFold2 neural network
3. Refines the structure through multiple "recycles"

**Expected runtime:** 10-20 minutes for DHFR (159 aa), longer for larger proteins.

### Understanding the Progress Monitor

During prediction, we use the `AlphaFoldProgressMonitor` class (defined above) to provide real-time feedback. The monitor shows:

```
⏳ 05:30 | GPU: 89% | VRAM: 34.2/80.0 GB | 🧠 Model Inference
   └─ Running AlphaFold2 neural network
```

- **Time elapsed** - how long the prediction has been running
- **GPU utilization** - proof that computation is actively happening (high % = working)
- **VRAM usage** - memory being used on the GPU
- **Current phase** - what AlphaFold2 is doing right now

### Your Task

Using your AI assistant, write code to:

1. Configure ColabFold prediction parameters
2. Start the progress monitor before the prediction
3. Run AlphaFold2 on your mutant sequence
4. Stop the monitor when prediction completes

💡 **Prompting Tips**:
- Ask: "How do I run ColabFold for structure prediction?"
- Ask: "What parameters affect AlphaFold2 prediction speed and accuracy?"
- Consider using `num_recycles=3` (default) for accuracy, fewer for speed

### Verification

After running your code, confirm:
- [ ] Progress monitor showed regular updates during prediction
- [ ] Prediction completes without errors
- [ ] Output directory contains PDB files
- [ ] Prediction took approximately expected time

📓 **Lab Notebook**: Record the prediction runtime and any warnings. Note the GPU type used and which phases you observed.

In [None]:
# Your code here



---

## 🔬 Guided Inquiry 3: Understanding pLDDT Confidence Scores

### Context

The **pLDDT (predicted Local Distance Difference Test)** score tells us how confident AlphaFold2 is in each residue's position. This is crucial for knowing which parts of the structure we can trust.

pLDDT is stored in the B-factor column of the PDB file:
- **>90**: Very high confidence (blue) - trustworthy
- **70-90**: High confidence (cyan) - backbone reliable
- **50-70**: Low confidence (yellow) - use caution
- **<50**: Very low (orange/red) - likely disordered

### Your Task

Using your AI assistant, write code to:

1. Load the predicted structure
2. Extract pLDDT scores from B-factor column
3. Create a plot of pLDDT vs residue number
4. Visualize the structure colored by pLDDT

💡 **Prompting Tips**:
- Ask: "How are pLDDT scores stored in AlphaFold2 PDB files?"
- Ask: "How do I color a protein structure by B-factor in py3Dmol?"
- Look for regions with low pLDDT - these may be flexible loops

### Verification

After running your code, confirm:
- [ ] pLDDT scores range from 0-100
- [ ] Most residues have pLDDT > 70 (for well-structured proteins)
- [ ] Can identify high and low confidence regions

📓 **Lab Notebook**: Record the mean pLDDT and identify any regions with low confidence. What might cause low confidence?

In [None]:
# Your code here



---

## 🔬 Guided Inquiry 4: Analyzing Predicted Aligned Error (PAE)

### Context

The **PAE (Predicted Aligned Error)** matrix tells us how confident AlphaFold2 is in the relative positions of residue pairs. While pLDDT gives local confidence, PAE reveals:

- **Domain boundaries**: High PAE between domains suggests uncertain relative positioning
- **Rigid units**: Low PAE within domains indicates confident relative positions
- **Flexible linkers**: High PAE rows/columns indicate flexible regions

### Your Task

Using your AI assistant, write code to:

1. Load the PAE data from prediction output
2. Create a heatmap visualization of the PAE matrix
3. Identify domain boundaries from the PAE pattern
4. Assess the confidence around the mutation site

💡 **Prompting Tips**:
- Ask: "How do I read PAE data from AlphaFold2 output?"
- Ask: "What do block patterns in PAE matrices mean?"
- Low PAE (blue) = confident, High PAE (red) = uncertain

### Verification

After running your code, confirm:
- [ ] PAE matrix is symmetric (approximately)
- [ ] Diagonal has lowest PAE values
- [ ] Can identify any domain structure from block patterns

📓 **Lab Notebook**: Describe the PAE pattern. Are there distinct domains? What is the PAE around your mutation site?

In [None]:
# Your code here



---

## 🔬 Guided Inquiry 5: Comparing to Experimental Structure

### Context

The ultimate validation of AlphaFold2 is comparison to experimental structures. By superimposing our predicted mutant structure onto the experimental wild-type, we can:

1. **Validate the prediction**: How accurate is AlphaFold2 for our target?
2. **Identify mutation effects**: Does the mutation cause structural changes?
3. **Assess binding site integrity**: Is the drug binding site preserved?

### Your Task

Using your AI assistant, write code to:

1. Download the experimental structure from PDB
2. Superpose the predicted structure onto experimental
3. Calculate RMSD (Root Mean Square Deviation)
4. Visualize both structures overlaid

💡 **Prompting Tips**:
- Ask: "How do I superpose two protein structures using BioPython?"
- Ask: "What is a good RMSD value for structure comparison?"
- Typical RMSD: <1Å = excellent, 1-2Å = good, 2-3Å = acceptable, >3Å = poor

### Verification

After running your code, confirm:
- [ ] RMSD is calculated and reported
- [ ] Structures are properly superimposed
- [ ] Can identify regions of high/low deviation

📓 **Lab Notebook**: Record the overall RMSD and RMSD specifically at the mutation site. Does the mutation appear to change the local structure?

In [None]:
# Your code here



---

## 🔬 Guided Inquiry 6: Assessing Mutation Effects

### Context

Now let's synthesize what we've learned to assess how the mutation might affect protein function and drug binding. This integrates:
- pLDDT confidence at mutation site
- PAE relationships with binding site
- Structural comparison to wild-type

### Your Task

Using your AI assistant, write code to:

1. Summarize all quality metrics for the prediction
2. Analyze the mutation site in detail (local environment, contacts)
3. Predict potential functional impact based on structural analysis
4. Create a final summary report

💡 **Prompting Tips**:
- Ask: "How can I identify amino acid contacts around a mutation?"
- Ask: "What structural features predict mutation impact on drug binding?"
- Consider: Is the mutation in the binding site? Near a catalytic residue?

### Verification

After running your code, confirm:
- [ ] Summary includes all key metrics
- [ ] Mutation environment is characterized
- [ ] Potential impact is assessed

📓 **Lab Notebook**: Based on your analysis, would you expect this mutation to affect drug binding? Why or why not?

In [None]:
# Your code here



---

## ✅ Checkpoint

Before moving on to the next talktorial, confirm you can:

- [ ] Explain what AlphaFold2 is and why it was revolutionary
- [ ] Prepare sequences for AlphaFold2 prediction (including mutations)
- [ ] Run ColabFold for structure prediction
- [ ] Interpret pLDDT scores to assess prediction confidence
- [ ] Analyze PAE matrices to understand domain relationships
- [ ] Compare predictions to experimental structures using RMSD
- [ ] Assess potential mutation effects based on structural analysis

### Your lab notebook should include:

- [ ] pLDDT plot for your mutant prediction
- [ ] PAE heatmap
- [ ] Per-residue RMSD comparison
- [ ] Summary report with quality metrics
- [ ] Notes on mutation impact prediction

---

## 🤔 Reflection Questions

Answer these in your lab notebook:

1. **Method Limitations**: Your prediction was for a single point mutation. What additional factors might affect the actual structure in a cell (temperature, pH, binding partners, post-translational modifications)?

2. **Drug Discovery Application**: How would you use AlphaFold2 predictions in a drug discovery pipeline? What would you trust, and what would require experimental validation?

3. **Comparative Analysis**: How does your target's AlphaFold2 accuracy compare to what you'd expect? Consider: Is it a well-studied protein? Are there close homologs? Is it flexible?

---

## 📖 Further Reading

- [AlphaFold2 Paper (Jumper et al., 2021)](https://www.nature.com/articles/s41586-021-03819-2) - Original Nature paper
- [ColabFold Paper (Mirdita et al., 2022)](https://www.nature.com/articles/s41592-022-01488-1) - Making AF2 accessible
- [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/) - Pre-computed structures
- [Understanding pLDDT and PAE](https://www.ebi.ac.uk/training/online/courses/alphafold/inputs-and-outputs/evaluating-alphafolds-predicted-structures-using-confidence-scores/) - EBI guide

---

## 🔗 Connection to Research

AlphaFold2 has transformed structural biology and drug discovery:

- **Resistance prediction**: Understanding how mutations affect drug binding before they emerge clinically
- **Orphan targets**: Enabling structure-based drug design for proteins without crystal structures
- **Protein design**: Informing the design of new proteins with desired functions
- **Variant interpretation**: Predicting effects of human genetic variants on protein structure

In the next talktorial (AI-PSCI-012), you'll learn about **ESMFold** — a faster alternative to AlphaFold2 that doesn't require MSA generation. This enables rapid screening of many variants.

---

*AI-PSCI-011 Complete. Proceed to AI-PSCI-012: ESMFold for Rapid Structure Prediction.*