# üîç Protein Quality Assessment Lab

Welcome to the **Protein Quality Assessment Lab**!

In this tutorial, you'll learn how to evaluate protein structures like a professional structural biologist.

### üéØ Goal
Understand what makes a "good" vs "bad" protein structure by applying validation metrics used by the Protein Data Bank (PDB) and tools like MolProbity.

### üìö What You'll Learn
1. **Ramachandran Analysis**: Check backbone geometry
2. **Clash Detection**: Find steric overlaps
3. **Bond Geometry**: Validate lengths and angles
4. **Rotamer Analysis**: Verify side-chain conformations
5. **Overall Quality Scoring**: Combine metrics

### ‚ö†Ô∏è The Golden Rule
> **Garbage In, Garbage Out**
>
> A beautiful molecular dynamics simulation on a terrible structure is still terrible. Always validate your starting structures!

---

In [None]:
# üõ†Ô∏è SETUP: Install dependencies
import sys
import os

try:
    import google.colab
    IN_COLAB = True
    print("üåê Running in Google Colab")
    !pip install -q synth-pdb matplotlib numpy biotite py3Dmol ipywidgets
except ImportError:
    IN_COLAB = False
    print("üíª Running locally")

print("‚úÖ Setup complete!")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import biotite.structure as struc
import biotite.structure.io.pdb as pdb
import io
from synth_pdb.generator import generate_pdb_content
from synth_pdb.validator import PDBValidator
import ipywidgets as widgets
from IPython.display import display

print("üì¶ All imports successful!")

## Step 1: Generate Test Structures

We'll create two structures:
- **Good Structure**: Properly minimized with realistic geometry
- **Bad Structure**: No energy minimization (raw NeRF output)

This lets us compare quality metrics side-by-side.

In [None]:
# Generate a Good Structure (with minimization)
print("üß¨ Generating GOOD structure (with energy minimization)...")
sequence = "MKFLKFSLLTAVLLSVVFAFSSCGDDDDAKAAAKAAAKAAAKEAAAKEAAAKA"
structure_def = "1-10:alpha,11-20:beta,21-25:random,26-35:alpha,36-45:beta,46-53:alpha"

good_pdb = generate_pdb_content(
    sequence_str=sequence,
    structure=structure_def,
    optimize_sidechains=True,
    minimize_energy=True  # KEY: Energy minimization ON
)

print("‚úÖ Good structure ready!")
print(f"   Length: {len(good_pdb)} characters")

In [None]:
# Generate a Bad Structure (no minimization)
print("‚ö†Ô∏è  Generating BAD structure (no energy minimization)...")

bad_pdb = generate_pdb_content(
    sequence_str=sequence,
    structure=structure_def,
    optimize_sidechains=False,  # KEY: No optimization
    minimize_energy=False        # KEY: No minimization
)

print("‚úÖ Bad structure ready!")
print(f"   Length: {len(bad_pdb)} characters")
print("\nüí° TIP: The 'bad' structure has realistic backbone but poor side-chain placement.")

## Step 2: Ramachandran Plot Analysis

The **Ramachandran Plot** is the most fundamental quality check.

### üìê What is it?
It plots the backbone dihedral angles (œÜ, œà) for each residue. Only certain combinations are sterically allowed:
- **Blue Region**: Alpha helices (œÜ ‚âà -60¬∞, œà ‚âà -45¬∞)
- **Red Region**: Beta sheets (œÜ ‚âà -120¬∞, œà ‚âà +120¬∞)
- **Forbidden Zones**: Atoms would overlap (bad!)

### ‚úÖ Good Structure
- 90%+ residues in favored regions
- No outliers in forbidden zones

### ‚ùå Bad Structure
- Many outliers
- Scattered distribution

In [None]:
def plot_ramachandran(pdb_content, title="Ramachandran Plot"):
    """Extract phi/psi angles and plot Ramachandran diagram."""
    pdb_file = pdb.PDBFile.read(io.StringIO(pdb_content))
    structure = pdb_file.get_structure(model=1)
    
    # Calculate backbone dihedrals
    phi, psi, omega = struc.dihedral_backbone(structure)
    
    # Convert to degrees and filter NaN (termini)
    phi_deg = []
    psi_deg = []
    colors = []
    
    for i in range(len(phi)):
        if not np.isnan(phi[i]) and not np.isnan(psi[i]):
            p = np.degrees(phi[i])
            s = np.degrees(psi[i])
            phi_deg.append(p)
            psi_deg.append(s)
            
            # Color by region
            if -100 < p < -30 and -80 < s < -10:
                colors.append('blue')  # Alpha
            elif -180 < p < -40 and (90 < s < 180 or -180 < s < -160):
                colors.append('red')   # Beta
            elif p > 0:
                colors.append('green') # Left-handed
            else:
                colors.append('orange') # Other/outlier
    
    # Plot
    plt.figure(figsize=(7, 7))
    
    # Background regions (simplified)
    plt.gca().add_patch(plt.Rectangle((-100, -70), 70, 60, 
                                      color='blue', alpha=0.1, label='Alpha favored'))
    plt.gca().add_patch(plt.Rectangle((-180, 90), 140, 90, 
                                      color='red', alpha=0.1, label='Beta favored'))
    
    # Data points
    plt.scatter(phi_deg, psi_deg, c=colors, alpha=0.7, edgecolors='black', s=50)
    
    plt.xlim(-180, 180)
    plt.ylim(-180, 180)
    plt.axhline(0, color='black', linewidth=0.5, alpha=0.3)
    plt.axvline(0, color='black', linewidth=0.5, alpha=0.3)
    plt.xlabel('Phi (œÜ) degrees', fontsize=12)
    plt.ylabel('Psi (œà) degrees', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.2)
    plt.legend(loc='upper right')
    plt.tight_layout()
    
    # Calculate statistics
    total = len(phi_deg)
    favored = sum(1 for c in colors if c in ['blue', 'red'])
    outliers = sum(1 for c in colors if c == 'orange')
    
    return {
        'total': total,
        'favored': favored,
        'favored_pct': 100 * favored / total if total > 0 else 0,
        'outliers': outliers,
        'outlier_pct': 100 * outliers / total if total > 0 else 0
    }

print("‚úÖ Ramachandran plotting function ready!")

In [None]:
# Compare Good vs Bad
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

plt.sca(ax1)
good_stats = plot_ramachandran(good_pdb, "‚úÖ GOOD Structure")

plt.sca(ax2)
bad_stats = plot_ramachandran(bad_pdb, "‚ö†Ô∏è  BAD Structure")

plt.tight_layout()
plt.show()

print("\nüìä RAMACHANDRAN STATISTICS")
print("=" * 50)
print(f"GOOD Structure: {good_stats['favored_pct']:.1f}% in favored regions")
print(f"                {good_stats['outlier_pct']:.1f}% outliers")
print(f"\nBAD Structure:  {bad_stats['favored_pct']:.1f}% in favored regions")
print(f"                {bad_stats['outlier_pct']:.1f}% outliers")
print("\nüí° INTERPRETATION:")
print("   Good structures should have >90% in favored regions.")
print("   Outliers (orange dots) indicate geometry problems.")

## Step 3: Clash Detection

**Steric clashes** occur when atoms are too close together (overlapping van der Waals radii).

### üî¨ Detection Method
For each atom pair:
1. Calculate distance
2. Compare to sum of van der Waals radii
3. If distance < 0.4 √Ö below expected ‚Üí **CLASH!**

### ‚úÖ Good Structure
- Few or no clashes
- Clashes only in flexible loops (acceptable)

### ‚ùå Bad Structure
- Many clashes throughout
- Clashes in core regions (unacceptable)

In [None]:
def detect_clashes(pdb_content, clash_threshold=2.0):
    """Detect steric clashes between atoms."""
    pdb_file = pdb.PDBFile.read(io.StringIO(pdb_content))
    structure = pdb_file.get_structure(model=1)
    
    # Get all heavy atoms (no hydrogens)
    heavy = structure[structure.element != 'H']
    
    clashes = []
    n_atoms = len(heavy)
    
    # Pairwise distance check (simplified - real tools use spatial indexing)
    for i in range(min(n_atoms, 500)):  # Limit for speed
        for j in range(i+1, min(n_atoms, 500)):
            # Skip if same residue or adjacent residues
            if abs(heavy.res_id[i] - heavy.res_id[j]) <= 1:
                continue
            
            # Calculate distance
            dist = np.linalg.norm(heavy.coord[i] - heavy.coord[j])
            
            # Check for clash
            if dist < clash_threshold:
                clashes.append({
                    'atom1': f"{heavy.res_name[i]}{heavy.res_id[i]}:{heavy.atom_name[i]}",
                    'atom2': f"{heavy.res_name[j]}{heavy.res_id[j]}:{heavy.atom_name[j]}",
                    'distance': dist
                })
    
    return clashes

print("‚úÖ Clash detection function ready!")

In [None]:
# Detect clashes in both structures
print("üîç Detecting steric clashes...\n")

good_clashes = detect_clashes(good_pdb)
bad_clashes = detect_clashes(bad_pdb)

print("üìä CLASH STATISTICS")
print("=" * 50)
print(f"‚úÖ GOOD Structure: {len(good_clashes)} clashes detected")
if good_clashes:
    print("   Top 3 clashes:")
    for clash in sorted(good_clashes, key=lambda x: x['distance'])[:3]:
        print(f"     {clash['atom1']} ‚Üî {clash['atom2']}: {clash['distance']:.2f} √Ö")

print(f"\n‚ö†Ô∏è  BAD Structure: {len(bad_clashes)} clashes detected")
if bad_clashes:
    print("   Top 3 clashes:")
    for clash in sorted(bad_clashes, key=lambda x: x['distance'])[:3]:
        print(f"     {clash['atom1']} ‚Üî {clash['atom2']}: {clash['distance']:.2f} √Ö")

print("\nüí° INTERPRETATION:")
print("   <10 clashes = Excellent")
print("   10-50 clashes = Acceptable (may need refinement)")
print("   >50 clashes = Poor quality")

## Step 4: Comprehensive Validation

Now let's use synth-pdb's built-in validator to run a complete quality check.

In [None]:
# Validate Good Structure
print("üîç Validating GOOD structure...\n")
good_validator = PDBValidator(good_pdb)
good_validator.validate_all()
good_violations = good_validator.get_violations()

print(f"\n{'='*50}")
if not good_violations:
    print("‚úÖ No violations! Structure passes all checks.")
else:
    print(f"‚ö†Ô∏è  Found {len(good_violations)} violations:")
    for v in good_violations[:5]:
        print(f"   - {v}")
    if len(good_violations) > 5:
        print(f"   ... and {len(good_violations)-5} more")

In [None]:
# Validate Bad Structure
print("üîç Validating BAD structure...\n")
bad_validator = PDBValidator(bad_pdb)
bad_validator.validate_all()
bad_violations = bad_validator.get_violations()

print(f"\n{'='*50}")
if not bad_violations:
    print("‚úÖ No violations! Structure passes all checks.")
else:
    print(f"‚ö†Ô∏è  Found {len(bad_violations)} violations:")
    for v in bad_violations[:5]:
        print(f"   - {v}")
    if len(bad_violations) > 5:
        print(f"   ... and {len(bad_violations)-5} more")

## Step 5: Interactive Comparison

Use the widget below to toggle between good and bad structures and see the quality differences!

In [None]:
import py3Dmol

# Create toggle widget
structure_selector = widgets.ToggleButtons(
    options=['Good Structure ‚úÖ', 'Bad Structure ‚ö†Ô∏è'],
    description='View:',
    button_style='info'
)

output = widgets.Output()

def show_structure(change):
    with output:
        output.clear_output(wait=True)
        
        if 'Good' in structure_selector.value:
            pdb_data = good_pdb
            title = "‚úÖ GOOD Structure (Minimized)"
            stats = good_stats
            clashes = len(good_clashes)
            violations = len(good_violations)
        else:
            pdb_data = bad_pdb
            title = "‚ö†Ô∏è  BAD Structure (Not Minimized)"
            stats = bad_stats
            clashes = len(bad_clashes)
            violations = len(bad_violations)
        
        # 3D Viewer
        view = py3Dmol.view(width=600, height=400)
        view.addModel(pdb_data, 'pdb')
        view.setStyle({'cartoon': {'color': 'spectrum'}})
        view.zoomTo()
        view.show()
        
        # Quality Report
        print(f"\n{title}")
        print("=" * 50)
        print(f"Ramachandran favored: {stats['favored_pct']:.1f}%")
        print(f"Ramachandran outliers: {stats['outlier_pct']:.1f}%")
        print(f"Steric clashes: {clashes}")
        print(f"Total violations: {violations}")
        
        # Overall grade
        if stats['favored_pct'] > 90 and clashes < 10:
            print("\nüèÜ Overall Grade: EXCELLENT")
        elif stats['favored_pct'] > 80 and clashes < 30:
            print("\nüëç Overall Grade: GOOD")
        elif stats['favored_pct'] > 70:
            print("\n‚ö†Ô∏è  Overall Grade: ACCEPTABLE (needs refinement)")
        else:
            print("\n‚ùå Overall Grade: POOR (major issues)")

structure_selector.observe(show_structure, names='value')

display(structure_selector, output)
show_structure(None)  # Initial display

## üéì Key Takeaways

### What Makes a Good Structure?
1. **>90% Ramachandran favored** - Realistic backbone geometry
2. **<10 steric clashes** - No overlapping atoms
3. **Proper bond lengths/angles** - Chemistry makes sense
4. **Good rotamers** - Side chains in low-energy conformations

### When to Use Quality Checks
- ‚úÖ Before starting MD simulations
- ‚úÖ After homology modeling
- ‚úÖ When downloading structures from PDB (yes, even experimental structures have issues!)
- ‚úÖ After any structure prediction/generation

### Tools for Further Analysis
- **MolProbity** (http://molprobity.biochem.duke.edu/) - Comprehensive validation
- **PROCHECK** - Classic Ramachandran analysis
- **WHATIF** - Detailed geometry checks
- **PDB REDO** - Automated structure refinement

---

## ÔøΩÔøΩ Next Steps

Try these experiments:
1. Generate your own structures with different parameters
2. Download a real PDB structure and validate it
3. Compare crystal structures vs NMR structures vs AlphaFold predictions

Remember: **A validated structure is a trustworthy structure!** üî¨