# Module 1: RFDiffusion - Introduction

**üìç Notebook 1 of 8**

## üíª GPU Requirements
**‚úÖ No GPU needed for this notebook!** All code runs on CPU.

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand what RFDiffusion is and why it's revolutionary
2. Know the key components of protein design
3. Understand the difference between structure prediction and structure design
4. Set up your environment and verify dependencies
5. Load and visualize protein structures

## üìö Paper Information

**Title:** De novo design of protein structure and function with RFdiffusion  
**Authors:** Joseph L. Watson, David Juergens, Nathaniel R. Bennett, et al.  
**Journal:** Nature (2023)  
**DOI:** [10.1038/s41586-023-06415-8](https://www.nature.com/articles/s41586-023-06415-8)  
**GitHub:** [RosettaCommons/RFdiffusion](https://github.com/RosettaCommons/RFdiffusion)

---

## üåü What is Protein Design?

**Goal:** Create NEW protein sequences that fold into DESIRED structures with SPECIFIC functions.

### Traditional Approaches (Pre-Deep Learning)
- **Rosetta:** Physics-based energy minimization
- **Time:** Hours to days per design
- **Success rate:** ~20-30%

### RFDiffusion Approach (2023)
- **Method:** Generative diffusion model
- **Time:** Minutes per design
- **Success rate:** >50%
- **Innovation:** Learn from nature's designs, then create entirely new ones

## üß¨ The Challenge: Structure Prediction vs. Design

### Structure Prediction (AlphaFold)
**Given:** Amino acid sequence  
**Predict:** 3D structure

```
Sequence ‚Üí [AlphaFold] ‚Üí Structure
MKTII...              ‚Üí Œ±-helix bundle
```

### Structure Design (RFDiffusion)
**Given:** Desired structure or function  
**Generate:** New amino acid sequence

```
Desired Function ‚Üí [RFDiffusion] ‚Üí Structure ‚Üí [ProteinMPNN] ‚Üí Sequence
"Bind to X"                      ‚Üí Novel fold              ‚Üí MARVL...
```

**Key Insight:** We can now DESIGN proteins that have never existed in nature!

## üî• Why RFDiffusion is Revolutionary

### 1. **High Success Rate**
- >50% of designs fold correctly (verified experimentally)
- Previous methods: ~20-30%

### 2. **Speed**
- Minutes instead of hours/days
- Can generate hundreds of candidates quickly

### 3. **Flexibility**
- **Unconditional:** Generate any protein
- **Motif scaffolding:** Design around functional regions
- **Symmetric:** Create oligomers (dimers, trimers, etc.)
- **Binders:** Design proteins that bind to targets

### 4. **Novel Designs**
- Creates proteins unlike anything in nature
- Not limited to known fold families

### 5. **Applications**
- Drug design (enzymes, binders)
- Biomaterials (structural proteins)
- Biosensors
- Vaccines

## üèóÔ∏è How RFDiffusion Works (High Level)

Think of it like image generation (DALL-E), but for proteins:

### Step 1: Training
1. Take millions of real protein structures
2. Gradually add random noise (forward diffusion)
3. Train a neural network to remove the noise (reverse diffusion)

### Step 2: Generation
1. Start with complete random noise
2. Network gradually removes noise
3. Result: A valid protein structure!

```
Random Noise ‚Üí [Denoise] ‚Üí [Denoise] ‚Üí [Denoise] ‚Üí Protein!
  ‚ö°‚ö°‚ö°         ‚ö°‚ö°        ‚ö°          ‚ú® Structure
```

### Key Components:
- **Diffusion Process:** How we add/remove noise
- **SE(3) Equivariance:** Network respects 3D rotations/translations
- **Structure Module:** Neural network architecture
- **Conditioning:** Guide generation toward specific properties

## üîß Setup: Check Prerequisites

Let's verify your environment is ready!

In [None]:
# Check Python and core libraries
import sys
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Check if we're in the right environment
import os
print(f"\nCurrent directory: {os.getcwd()}")
print(f"Expected to be in: .../rfdiffusion_tutorial/")

# Import core scientific libraries
try:
    import numpy as np
    print(f"\n‚úÖ NumPy {np.__version__}")
except ImportError:
    print("\n‚ùå NumPy not installed")

try:
    import torch
    print(f"‚úÖ PyTorch {torch.__version__}")
    print(f"   CPU available: ‚úÖ")
    print(f"   CUDA available: {'‚úÖ' if torch.cuda.is_available() else '‚ùå (not needed for this notebook)'}")
    if torch.backends.mps.is_available():
        print(f"   MPS (Apple Silicon) available: ‚úÖ")
except ImportError:
    print("‚ùå PyTorch not installed")

try:
    from Bio import PDB
    import Bio
    print(f"‚úÖ BioPython {Bio.__version__}")
except ImportError:
    print("‚ùå BioPython not installed")

try:
    import matplotlib
    print(f"‚úÖ Matplotlib {matplotlib.__version__}")
except ImportError:
    print("‚ùå Matplotlib not installed")

print("\n" + "="*50)
print("If you see any ‚ùå above, install missing packages:")
print("pip install numpy torch biopython matplotlib")
print("="*50)

## üß™ Hands-On: Load a Real Protein Structure

Let's load and visualize a real protein structure to understand what we're working with.

In [None]:
# Download a small protein structure from PDB
from Bio.PDB import PDBList, PDBParser
import os

# Create data directory if it doesn't exist
os.makedirs("../data/examples", exist_ok=True)

# Download a simple protein (Villin headpiece - very small!)
print("Downloading protein 1VII from PDB database...")
pdbl = PDBList()
pdb_file = pdbl.retrieve_pdb_file('1VII', file_format='pdb', pdir='../data/examples')
print(f"‚úÖ Downloaded to: {pdb_file}")

# Parse the structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure('1VII', pdb_file)

# Get basic information
model = structure[0]  # Usually just one model
chain = list(model.get_chains())[0]
residues = list(chain.get_residues())

print(f"\nüìä Protein Information:")
print(f"   PDB ID: 1VII (Villin Headpiece)")
print(f"   Number of residues: {len(residues)}")
print(f"   Chain ID: {chain.id}")

# Extract backbone atoms (what RFDiffusion works with!)
backbone_atoms = []
for residue in residues:
    if residue.has_id('CA'):  # CA = alpha carbon (backbone)
        ca_atom = residue['CA']
        backbone_atoms.append(ca_atom.get_coord())

backbone_atoms = np.array(backbone_atoms)
print(f"   Backbone atoms extracted: {len(backbone_atoms)}")
print(f"   Shape: {backbone_atoms.shape}")  # (N_residues, 3) for x,y,z coordinates

print("\nüí° Key Point: RFDiffusion works with these backbone coordinates!")

In [None]:
# Visualize the backbone
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12, 5))

# 3D plot
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot(backbone_atoms[:, 0], backbone_atoms[:, 1], backbone_atoms[:, 2], 
         'o-', markersize=4, linewidth=2, color='#2E86AB')
ax1.set_xlabel('X (√Ö)')
ax1.set_ylabel('Y (√Ö)')
ax1.set_zlabel('Z (√Ö)')
ax1.set_title('Protein Backbone (3D)', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Distance between consecutive residues
ax2 = fig.add_subplot(122)
distances = np.sqrt(np.sum(np.diff(backbone_atoms, axis=0)**2, axis=1))
ax2.plot(distances, linewidth=2, color='#A23B72')
ax2.axhline(y=3.8, color='red', linestyle='--', label='Expected ~3.8√Ö')
ax2.set_xlabel('Residue Index', fontsize=11)
ax2.set_ylabel('Distance to Next CA (√Ö)', fontsize=11)
ax2.set_title('CA-CA Distances', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üìå Notice:")
print("   - Backbone forms a connected chain in 3D space")
print("   - CA-CA distance is ~3.8√Ö (constraint RFDiffusion must respect!)")
print("   - This is the representation RFDiffusion generates!")

## üìö Module Roadmap

Here's what we'll cover in the remaining notebooks:

| Notebook | Title | GPU Needed? | What You'll Learn |
|----------|-------|-------------|-------------------|
| **01** | Introduction | ‚ùå No | Overview, setup, basic concepts |
| **02** | Diffusion Basics | ‚ùå No | Theory, math, simple 1D examples |
| **03** | Protein Representation | ‚ùå No | How to encode proteins |
| **04** | SE(3) Equivariance | ‚ùå No | Geometric constraints |
| **05** | Unconditional Generation | ‚ö†Ô∏è Optional | Generate basic proteins |
| **06** | Motif Scaffolding | ‚ö†Ô∏è Optional | Design around functional regions |
| **07** | Symmetric Design | ‚úÖ Yes | Create symmetric assemblies |
| **08** | Evaluation | ‚ùå No | Assess design quality |

**Legend:**
- ‚ùå No GPU needed - runs on CPU
- ‚ö†Ô∏è Optional - better with GPU but works on CPU
- ‚úÖ GPU recommended - much faster with GPU

## üéì Key Takeaways

1. **RFDiffusion is a breakthrough** in protein design, achieving >50% success rate
2. **Diffusion models** learn to gradually denoise random inputs into valid structures
3. **We work with backbone coordinates** (CA atoms), not full atomic detail
4. **No GPU needed yet** - we'll build up theory first, then implement
5. **Applications are vast** - drugs, materials, biosensors, vaccines

## ‚úÖ Self-Check Questions

Before moving on, make sure you can answer:

1. What's the difference between structure prediction and structure design?
2. Why is RFDiffusion revolutionary compared to previous methods?
3. What are the main components of RFDiffusion?
4. What representation does RFDiffusion use (backbone vs full atom)?
5. When will we need a GPU in this tutorial?

## üìñ Recommended Reading

Before the next notebook:

1. **Quick overview:** [What are diffusion models?](https://yang-song.net/blog/2021/score/) (first 2 sections)
2. **Protein basics:** See `docs/protein_basics.md` in this repo
3. **Optional deep dive:** Read the abstract and intro of the RFDiffusion paper

## ‚è≠Ô∏è Next Notebook

**02_diffusion_basics.ipynb** - Learn the theory behind diffusion models

üí° **No GPU needed for notebook 2!** We'll implement concepts with simple examples.