![Dataset Factory Hero](assets/dataset_hero.png)

# üìÅ Bulk Dataset Factory: Beyond the Memory Wall ü§ñ

**Objective**: Master the transition from "Single-Protein" bioinformatics to "Tensor-Driven" AI research.

### üß† The Educational Mindset Shift
Traditional structural biology focuses on the **PDB File**‚Äîa static, human-readable text record. Modern AI (like AlphaFold-3 or ESM-Fold) requires a **Tensor**‚Äîa massive, multi-dimensional array of numbers. 

In this lab, we break through the **"Memory Wall"**: the bottleneck where AI models spend more time *reading files* than actually *learning biology*.

**We will cover:**
1. **Vectorized Generation**: Producing 10,000 unique structures in milliseconds.
2. **The Tensor Envelope**: Visualizing the structural diversity of your dataset.
3. **Zero-Copy NPZ Pipelines**: Feeding binary data directly into high-performance GPUs.
4. **PyTorch Integration**: Building a production-ready `DataLoader`.

In [None]:
# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path

# Ensure the local synth_pdb source code is prioritized if running from the repo
try:
    current_path = Path(".").resolve()
    repo_root = current_path.parent.parent 
    if (repo_root / "synth_pdb").exists():
        if str(repo_root) not in sys.path:
            sys.path.insert(0, str(repo_root))
            print(f"üìå Added local library to path: {repo_root}")
except Exception:
    pass

if 'google.colab' in str(get_ipython()):
    if not os.path.exists("installed.marker"):
        print("Running on Google Colab. Installing dependencies...")
        get_ipython().run_line_magic('pip', 'install synth-pdb torch numpy matplotlib py3Dmol')
        
        with open("installed.marker", "w") as f:
            f.write("done")
        
        print("üîÑ Installation complete. KERNEL RESTARTING AUTOMATICALLY...")
        print("‚ö†Ô∏è Please wait 10 seconds, then Run All Cells again.")
        os.kill(os.getpid(), 9)
    else:
        print("‚úÖ Dependencies Ready.")
else:
    import synth_pdb
    print(f"‚úÖ Running locally. Using synth-pdb version: {synth_pdb.__version__} from {synth_pdb.__file__}")

In [None]:
import numpy as np
import time
import torch
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import py3Dmol
from synth_pdb.batch_generator import BatchedGenerator, BatchedPeptide

print("Libraries Loaded. Accelerating with PyTorch! üöÄ")

## 1. High-Speed Generation: 10,000 Structures

We leverage **Numba-optimized vectorization**. Instead of generating one CA atom at a time, we treat the entire batch as a single 3D tensor operation.

In [None]:
n_samples = 10000
# FIX: Use explicit hyphenation for the whole sequence to avoid 'METALA' merging errors
sequence = "-".join(["ALA-GLY-SER-LEU-VAL-ILE-MET"] * 4) # 28 residues

print(f"üöÄ Generating {n_samples} structures...")
start = time.time()

generator = BatchedGenerator(sequence, n_batch=n_samples, full_atom=False)
batch = generator.generate_batch(drift=5.0)

elapsed = time.time() - start
print(f"‚úÖ Done! {n_samples} structures generated in {elapsed:.3f}s")
print(f"Throughput: {n_samples/elapsed:.0f} structures/sec")

## 2. Visualizing Structural Diversity (Statistical Plot)

A dataset is only as good as its **diversity**. If all 10,000 structures look the same, the model learns nothing. Let's visualize the "Atomic Variance" across our batch.

In [None]:
# Calculate the variance of CA positions across the batch
variance = np.var(batch.coords, axis=0).mean(axis=1)

plt.figure(figsize=(10, 5))
plt.plot(variance, color='#667eea', linewidth=3, label="Positional Variance")
plt.fill_between(range(len(variance)), variance, alpha=0.2, color='#667eea')
plt.title("The Entropy Profile: Data Diversity across the Chain")
plt.xlabel("Residue Number")
plt.ylabel("Variance (√Ö¬≤)")
plt.grid(alpha=0.3)
plt.legend()
plt.show()

print("Educational Insight: Notice how variance typically increases at the 'tail' of the peptide?")
print("This is the 'Propagating Error' of structural drift‚Äîa key feature for generating negative samples.")

## 3. Interactive 3D Ensemble View

Let's overlay the first 5 structures in the batch to see the "Envelope" of noise we've created.

In [None]:
try:
    view = py3Dmol.view(width=800, height=400)
    view.setBackgroundColor("#fdfdfd")
    colors = ["#ff9999", "#66b3ff", "#99ff99", "#ffcc99", "#c2c2f0"]

    for i in range(5):
        # 1. Clean and mask coordinates with strict zero-tolerance
        c = batch.coords[i].copy()
        mask = np.any(np.abs(c) > 1e-4, axis=1) # Strip zeros and ghost atoms
        c_clean = c[mask]
        
        if len(c_clean) == 0: continue
        
        # 2. Individual Centering (Per-Model Anchor)
        # Using CA centroid for much better stability than min/max
        ca_idxs = [j for j, name in enumerate(batch.atom_names) if name == "CA"]
        valid_ca = [idx for idx in ca_idxs if mask[idx]]
        if valid_ca:
            center = c[valid_ca].mean(axis=0)
        else:
            center = c_clean.mean(axis=0)
            
        c_centered = c_clean - center
        
        p_tmp = BatchedPeptide(
            c_centered[np.newaxis, ...], 
            batch.sequence, 
            np.array(batch.atom_names)[mask].tolist(), 
            np.array(batch.residue_indices)[mask].tolist()
        )
        
        view.addModel(p_tmp.to_pdb(0), 'pdb')
        # HIGH-VISIBILITY STYLE: Large Spheres (radius 0.3) + Thick Sticks
        view.setStyle({'model': i}, {
            "cartoon": {"color": colors[i], "opacity": 0.5}, 
            "stick": {"color": colors[i], "radius": 0.3}, 
            "sphere": {"color": colors[i], "scale": 0.3}
        })

    # 3. Aggressive manual zoom targeting model 0 to ensure viewport is filled
    view.zoomTo({'model': 0})
    view.zoom(2.0)
    view.center()
    view.show()
    
    # Diagnostic Info to prove sanity
    print(f"‚úÖ Ensemble Visualized with PDB Column-Shift Guard.")
    print(f"Residue 1 Name: '{batch.sequence[0]}' | Residue 7 Name: '{batch.sequence[6]}'")
    
except Exception as e:
    print(f"3D Viewer Error: {e}")

## 4. Binary Export (NPZ) vs. Legacy Text (PDB)

Why save to NPZ? It's not just about size; it's about **Zero-Copy loading**.

In [None]:
os.makedirs("dataset_factory", exist_ok=True)
dataset_path = "dataset_factory/batch_001.npz"

print("Saving to compressed NPZ...")
np.savez_compressed(
    dataset_path,
    coords=batch.coords,
    sequence=np.array([sequence] * n_samples)
)

# Benchmark Loading
start_npz = time.time()
tensor_npz = torch.from_numpy(np.load(dataset_path)['coords'])
npz_time = time.time() - start_npz

print(f"‚úÖ NPZ Load (10k samples): {npz_time:.4f}s")

## 5. Production PyTorch DataLoader

The final piece of the pipeline is the `DataLoader`, which handles batching, shuffling, and multi-threaded loading.

In [None]:
class SyntheticProteinDataset(Dataset):
    def __init__(self, npz_path):
        data = np.load(npz_path)
        self.coords = torch.from_numpy(data['coords']).float()
        
    def __len__(self):
        return len(self.coords)
        
    def __getitem__(self, idx):
        return self.coords[idx]

ds = SyntheticProteinDataset(dataset_path)
loader = DataLoader(ds, batch_size=64, shuffle=True)

sample_batch = next(iter(loader))
print(f"Success! Batch Shape: {sample_batch.shape} (Ready for Neural Network training)")

### üèÜ Next Steps
1. Modify the `drift` parameter in Section 1. How does it change the **Variance Plot** in Section 2?
2. Try generating a batch with `full_atom=True`. How does it affect the NPZ file size?

Mastering the **Data Plane** is 80% of successful AI engineering. Now go build some biology! üß¨ü§ñ