# üß¨ The Hard Decoy Challenge

**Objective**: Learn how to generate high-quality negative samples for training Protein AI models.

In the world of Protein AI (like AlphaFold-3 or RosettaFold), generating "good" structures is only half the battle. To train robust models, researchers need **Hard Decoys**‚Äîstructures that look physically plausible (correct bond lengths, no overlaps) but are biologically or topologically incorrect.

### Why do we need Hard Decoys?
- **Teaching the Global Minimum**: If a model only ever sees perfect structures, it won't know why they are better than slightly distorted ones.
- **Improving Discriminators**: To train a model to score protein quality, you need a balanced dataset of 'Natives' (Score 1.0) and 'Decoys' (Score 0.0).
- **Robustness**: Hard decoys test if a model is just memorizing patterns or actually understanding biophysics.

### ‚ö†Ô∏è How to Run (Important!)
This notebook requires a specific environment setup. Follow these steps strictly:

1.  **Run All Cells** (`Runtime` -> `Run all` or `Ctrl+F9`).
2.  **Wait for the Crash**: If on Colab, the setup cell will **automatically restart** the session to load libraries. This is normal.
3.  **Local Users**: If you are running locally after editing the library code, **Restart your Kernel** manually to ensure changes take effect.
4.  **Wait 10 Seconds**: Allow the session to reconnect.
5.  **Run All Cells AGAIN**: This time, the setup will detect it is ready ('‚úÖ Dependencies Ready') and proceed typically.

In [None]:
# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path

# Ensure the local synth_pdb source code is prioritized if running from the repo
try:
    current_path = Path(".").resolve()
    repo_root = current_path.parent.parent 
    if (repo_root / "synth_pdb").exists():
        if str(repo_root) not in sys.path:
            sys.path.insert(0, str(repo_root))
            print(f"üìå Added local library to path: {repo_root}")
except Exception:
    pass

if 'google.colab' in str(get_ipython()):
    if not os.path.exists("installed.marker"):
        print("Running on Google Colab. Installing dependencies...")
        get_ipython().run_line_magic('pip', 'install synth-pdb py3Dmol')
        
        with open("installed.marker", "w") as f:
            f.write("done")
        
        print("üîÑ Installation complete. KERNEL RESTARTING AUTOMATICALLY...")
        print("‚ö†Ô∏è Please wait 10 seconds, then Run All Cells again.")
        os.kill(os.getpid(), 9)
    else:
        print("‚úÖ Dependencies Ready.")
else:
    import synth_pdb
    print(f"‚úÖ Running locally. Using synth-pdb version: {synth_pdb.__version__} from {synth_pdb.__file__}")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import py3Dmol
from synth_pdb.batch_generator import BatchedGenerator
from synth_pdb.generator import generate_pdb_content

print("Libraries Loaded.")

## Strategy 1: Torsion Angle Drift (Conformational Noise)

**Objective**: Generate "Near-Native" decoys by adding controlled Gaussian noise to the ideal Ramachandran angles.

In AI training, we use the `--drift` parameter to test model sensitivity to backbone precision.

In [None]:
# Generate an ensemble with increasing amounts of noise
sequence = "LEU-LYS-GLU-LEU-GLU-LYS-GLU-LEU-GLU-LYS-GLU-LEU-GLU-LYS-GLU-LEU" # Zipper fragment
generator = BatchedGenerator(sequence, n_batch=100, full_atom=True)

print("Generating Native (Drift = 0.0)...")
native = generator.generate_batch(drift=0.0)

print("Generating Hard Decoy (Drift = 15.0)...")
hard_decoy = generator.generate_batch(drift=15.0)

print("Generation Complete.")

### üìà Visualization A: The Ramachandran Plot

A **Ramachandran Plot** maps the backbone torsion angles (Phi and Psi) for every residue. 
- **Natives** cluster tightly in favored regions (e.g., the bottom-left for alpha-helices).
- **Decoys** leak into disallowed regions as the noise increases, breaking the "physical law" of protein folding.

In [None]:
def get_rama_angles(pdb_str):
    """Extract phi/psi angles using biotite"""
    import biotite.structure as struc
    import biotite.structure.io.pdb as pdb
    from io import StringIO
    
    text_file = StringIO(pdb_str)
    array = pdb.PDBFile.read(text_file).get_structure(model=1)
    # dihedral_backbone returns (phi, psi, omega) arrays
    phi, psi, omega = struc.dihedral_backbone(array)
    return np.degrees(phi), np.degrees(psi)

def plot_ramachandran(batch, title):
    all_phi = []
    all_psi = []
    # Sample 10 structures from the batch
    for i in range(min(10, batch.coords.shape[0])):
        phi, psi = get_rama_angles(batch.to_pdb(i))
        all_phi.extend(phi[~np.isnan(phi)])
        all_psi.extend(psi[~np.isnan(psi)])
        
    plt.figure(figsize=(6, 6))
    plt.scatter(all_phi, all_psi, alpha=0.5, s=10, color='#667eea')
    plt.xlim(-180, 180)
    plt.ylim(-180, 180)
    plt.axhline(0, color='grey', lw=1, alpha=0.3)
    plt.axvline(0, color='grey', lw=1, alpha=0.3)
    plt.title(f"Ramachandran: {title}")
    plt.xlabel("Phi (Œ¶)")
    plt.ylabel("Psi (Œ®)")
    plt.grid(alpha=0.2)
    plt.show()

plot_ramachandran(native, "Native (Ideal Alpha Helix)")
plot_ramachandran(hard_decoy, "Hard Decoy (15¬∞ Noise)")

### üó∫Ô∏è Visualization B: The Contact Map

A **Contact Map** is a 2D matrix where each pixel $(i, j)$ represents the distance between residue $i$ and $j$. 
- Perfect structures have clear patterns (helixes show a dark line parallel to the diagonal).
- High-drift decoys smear these patterns, showing the model "what not to predict".

In [None]:
def plot_contact_map(batch, title):
    # Get CA atom coordinates for the first model
    c = batch.coords[0]
    atom_names = batch.atom_names
    ca_mask = np.array([name == "CA" for name in atom_names])
    ca_coords = c[ca_mask]
    
    # Calculate pairwise distances
    diff = ca_coords[:, np.newaxis, :] - ca_coords[np.newaxis, :, :]
    dist_matrix = np.sqrt((diff**2).sum(-1))
    
    plt.figure(figsize=(6, 5))
    plt.imshow(dist_matrix, cmap='viridis_r')
    plt.colorbar(label="Distance (√Ö)")
    plt.title(f"Contact Map: {title}")
    plt.xlabel("Residue Index")
    plt.ylabel("Residue Index")
    plt.show()

plot_contact_map(native, "Native Contacts")
plot_contact_map(hard_decoy, "Decoy Contacts (Scattered)")

## Strategy 2: Label Shuffling (Chemical Mismatch)

**Objective**: Create a physically perfect structure that is chemically impossible.

By shuffling residue labels, we create structures where bulky residues are forced into cramped spaces, or hydrophobic residues are exposed to solvent.

In [None]:
import random

def create_shuffled_decoy(batch):
    original_seq = batch.sequence
    shuffled_seq = original_seq.copy()
    random.shuffle(shuffled_seq)
    
    print(f"Native Sequence:  {' '.join(original_seq[:8])}...")
    print(f"Shuffled Decoy:  {' '.join(shuffled_seq[:8])}...")
    return shuffled_seq

shuffled_labels = create_shuffled_decoy(native)
print("\n‚úÖ This structural data now points to a nonsensical chemical identity.")

### üî¨ Strategic Insight: Residue-to-Structure Mismatch
Imagine training a model to predict side-chain orientations (Rotamers). If you provide a Shuffled Decoy, the backbone will suggest a tiny Glycine spot, but the label will say "Tryptophan". 

This forces the model to learn that **Backbone Geometry must match Sidechain Chemistry**.

## Strategy 3: Sequence Threading (Fold Mismatch)

**Objective**: Force a sequence onto a fold it cannot naturally adopt.

Example: Threading a **Poly-Glycine** sequence onto the backbone of a **Poly-Tryptophan** alpha helix.

In [None]:
template_seq = "TRP-TRP-TRP-TRP-TRP-TRP-TRP-TRP-TRP"
thread_seq = "GLY-GLY-GLY-GLY-GLY-GLY-GLY-GLY-GLY"

generator = BatchedGenerator(template_seq, n_batch=1, full_atom=True)
batch = generator.generate_batch()

print(f"Backbone generated for Template: {template_seq}")
print(f"Threaded with Decoy Sequence: {thread_seq}")

view = py3Dmol.view(width=400, height=300)
view.addModel(batch.to_pdb(0), 'pdb')
view.setStyle({'stick': {'radius': 0.15}, 'cartoon': {'color': 'spectrum'}})
view.zoomTo()
view.show()

### üèÜ The Challenge: Mass Dataset Generation

In a production pipeline, you would use these strategies to generate millions of rows:

```python
# Mock Training Loop logic
for i in range(1000):
    is_native = (i % 2 == 0)
    drift = 0.0 if is_native else 10.0
    data = generator.generate_batch(drift=drift)
    # Feed to GNN/Transformer...
```

By generating hard decoys on the fly, you create an infinite stream of diverse training data that prevents your model from overfitting! üöÄ