![Orientogram Lab Hero](assets/orientogram_hero.png)

# üìê Geometry Factory: The trRosetta 6D Orientogram ‚öõÔ∏è

**Objective**: Understand the "Inter-Residue Reference Frame" and how we translate 3D protein folds into 6-dimensional mathematical signatures for AI training.

### üåü Why 1D distances aren't enough?
Traditional structural models (like early GNNs) often relied solely on **Distance Maps**. While helpful, distance only tells you how close two points are‚Äîit doesn't tell you their relative "twist" or orientation.
---

## üåü The Philosophy: From Distances to Frames

The **trRosetta** (transform-restrained Rosetta) paper revolutionized AI structure prediction by providing **6D relative orientations** between every pair of residues. This allows models to learn the complex 3D assembly of helices and sheets with much higher precision.Imagine trying to describe a dance to someone. If you only tell them the **distance** between the dancers' feet, they can't see the full performance. They don't know if the dancers are facing each other, looking away, or leaning in. They are missing the **Orientations**.

In structural biology, residues (amino acids) are like those dancers. 
1. **Early AI (Coarse-Grained)**: Used simple **Distance Maps** ($L \times L$). They knew where the residues were, but not how they "faced" each other.
2. **Modern AI (Frame-Based)**: Uses **6D Orientations**. Every residue is treated as a "Rigid Body" or local coordinate frame. By measuring the 6 relative values between every pair of residues, we capture the full 3D assembly with mathematical completeness.

This notebook demonstrates how **synth-pdb** generates these advanced descriptors, which powered the breakthroughs in models like **trRosetta** and **AlphaFold**.

In [None]:
# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path

try:
    current_path = Path(".").resolve()
    repo_root = current_path.parent.parent 
    if (repo_root / "synth_pdb").exists():
        if str(repo_root) not in sys.path:
            sys.path.insert(0, str(repo_root))
            print(f"üìå Added local library to path: {repo_root}")
except Exception:
    pass

if 'google.colab' in str(get_ipython()):
    if not os.path.exists("installed.marker"):
        print("Running on Google Colab. Installing dependencies...")
        get_ipython().run_line_magic('pip', 'install synth-pdb numpy matplotlib py3Dmol biotite')
        
        with open("installed.marker", "w") as f:
            f.write("done")
        
        print("üîÑ Installation complete. KERNEL RESTARTING AUTOMATICALLY...")
        os.kill(os.getpid(), 9)
    else:
        print("‚úÖ Dependencies Ready.")
else:
    import synth_pdb
    print(f"‚úÖ Running locally. Using synth-pdb version: {synth_pdb.__version__}")


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from synth_pdb.batch_generator import BatchedGenerator
from synth_pdb.orientogram import compute_6d_orientations

print("Geometric kernels loaded. Ready to compute orientograms. üìê")


## 1. Defining the 6D Descriptors

For any two residues $i$ and $j$, we define the orientation of residue $j$ relative to $i$ using their $C\alpha$ and $C\beta$ positions (plus $N$ to fix the rotation). we calculate 4 primary tensors:

We will generate a peptide with a mixed Alpha/Beta structure and analyze its geometric footprint.1. **$d$ (Distance)**: The straight-line distance between $C\beta_i$ and $C\beta_j$. This is the foundation of the "Contact Map".
2. **$\\omega$ (Omega)**: The dihedral (twist) angle $C\alpha_i - C\beta_i - C\beta_j - C\alpha_j$. It tells us how the backbones of the two residues are rotated relative to each other.
3. **$\\theta$ (Theta)**: The plane angle $C\alpha_i - C\beta_i - C\beta_j$. It describes how residue $i$ "looks at" residue $j$.
4. **$\\phi$ (Phi)**: The polar dihedral $N_i - C\alpha_i - C\beta_i - C\beta_j$. It anchors the orientation to the backbone's local coordinate system.

Let's generate a **Beta Sheet** fold, where these angles are particularly well-defined and structured.

In [None]:
sequence = "ALA-VAL-LEU-ILE-SER-GLY-MET-TRP" * 4 # 32 residues
generator = BatchedGenerator(sequence, n_batch=1, full_atom=False) # Backbone only
batch = generator.generate_batch(conformation='beta') # Beta sheets have distinctive 6D signals

print("Structure Batch Generated.")
print("Computing 6D Orientations...")

orientations = batch.get_6d_orientations()
print(f"Orientations computed for {batch.n_residues} residues.")
print(f"Tensors available: {list(orientations.keys())}")


## 2. Visualizing the Orientogram

Let's look at the 4 primary descriptors for our batch member 0:
1. **Distance ($d$)**: $C\beta - C\beta$ Euclidean distance.
2. **$\\omega$**: Absolute rotation between frames.
3. **$\\theta$**: Orientation angle.
4. **$\\phi$**: Dihedral angle between frames.
Below we plot the four tensors as $L \times L$ heatmaps. 

**Educational Insight**: Note how the 6D tensors capture the diagonal structure of the fold differently than a simple distance map.### How to read these "Images":
- **Symmetry**: $d$ is symmetric ($dist_{i,j} = dist_{j,i}$), but the others might not be! $\\theta$ is specifically defined relative to the "source" residue $i$.
- **Regularity**: The dashed patterns you see are the hallmark of real protein physics. Beta sheets create rhythmic, staggered patterns in these maps because of the alternating "up-down" nature of the amino acid sidechains in a sheet.
- **AI Readiness**: For a Computer Vision model (like a CNN), these are 4 "channels" (like Red, Green, Blue) that describe the protein's essence perfectly.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 11))
plt.subplots_adjust(hspace=0.3, wspace=0.2)

# Using raw strings (r"...") to ensure Python 3.12 compatibility with LaTeX
titles = {
    'dist': r'A. Distance Map (C-beta) ($d$) [$\AA$]', 
    'omega': r'B. Omega Dihedral (Torsion) ($\omega$) [$^\circ$]',
    'theta': r'C. Theta Angle (Plane) ($\theta$) [$^\circ$]',
    'phi': r'D. Phi Dihedral (Polar) ($\phi$) [$^\circ$]'
}
cmaps = {'dist': 'viridis_r', 'omega': 'hsv', 'theta': 'magma', 'phi': 'twilight'}

for i, key in enumerate(['dist', 'omega', 'theta', 'phi']):
    ax = axes[i // 2, i % 2]
    data = orientations[key][0] # First batch member
    
    if key == 'dist':
        im = ax.imshow(data, cmap=cmaps[key], vmax=15.0) # Cap distance for visual clarity
    else:
        # For periodic angles, clarify -180 to 180
        # Angular values wrap from -180 to 180
        im = ax.imshow(data, cmap=cmaps[key], vmin=-180, vmax=180)
        
    ax.set_title(titles[key], fontweight='bold')
    fig.colorbar(im, ax=ax, shrink=0.8)

plt.suptitle("The 6D Orientogram: A 'Computer Vision' View of Protein Structure", fontsize=16, y=0.95)
plt.show()

## 3. Handling the "Invisible" Residue: Glycine

In 6D geometry, you **must** have a $C\beta$ atom to define the residue's orientation frame. But there's a problem: **Glycine (GLY)** has no $C\beta$! Its sidechain is just a single Hydrogen atom.

Glycine is the only amino acid without a side chain (just a Hydrogen). However, AI models require a consistent $C\beta$ node for every residue to maintain a rigid frame.
### The Fix: Virtual Reconstruction
AI models solve this by reconstructing a **Virtual C-beta**. Even though it's not physically there in Glycine, we can calculate where it *would* be if Glycine were an L-Alanine. 

**synth-pdb** automatically reconstructs the "Ideal L-Alanine Position" for any Glycine in your sequence, ensuring your tensors are compatible with model requirements.It uses the positions of $N, C\alpha,$ and $C$ to "project" the virtual $C\beta$ into space using ideal geometry. This ensures your data tensors are always contiguous and complete, even for highly flexible Glycine-rich loops.

In [None]:
gly_res_idx = [i for i, r in enumerate(batch.sequence) if r == "GLY"]
print(f"Analyzing Glycine at indices: {gly_res_idx}")

# Look at distance to neighboring residues for a GLY entry
for idx in gly_res_idx[:1]:
    dist_row = orientations['dist'][0, idx, :]
    print(f"\nüîé Virtual C-beta mapping for GLY {idx+1}:")
    print(f"Distances to neighbors: {dist_row[max(0, idx-2):min(idx+3, len(dist_row))]}")
    print("Note how the values are consistent with the rest of the chain!")
print("‚úÖ Virtual C-beta mapping successful.")


## 4. Why does this exist?

This pipeline exists because **Generating 3D Coordinates is HARD**, but **Generating 2D Tensors is FAST**. 

1. **Training AI**: We generate millions of such tensors from synthetic PDBs. The AI learns the "Language" of these heatmaps.
2. **Prediction**: When we give the AI a new sequence, it predicts these heatmaps. 
3. **Recovery**: We then use a process called "Minimization" or "Folding" to reconstruct the 3D structure that best fits those predicted 6D heatmaps.

By providing these descriptors, **synth-pdb** allows you to bench-test the entire lifecycle of an AI model, from data production to descriptor analysis.

---

### üèÜ Experiment for the User
1. **Ensemble Variance**: Generate a batch with `drift=10.0` and plot the *Standard Deviation* of the distance maps. üìâ
2. **Feature Engineering**: Standardize these tensors (e.g. `log(dist)`) to prepare them as direct inputs for a **Convolutional Neural Network** (CNN) classifier.
Try generating structures with `conformation='alpha'` instead of `'beta'`. 

**Predict**: How will the distance map change? (Hint: Alpha helices stay closer to their immediate neighbors, creating a thick diagonal line!). 

You are now extracting the rich 3D information that powers modern structural AI. **Happy building.** üìêü§ñ**The structural signatures are yours to explore.** üß¨üìêü§ñ