# 🌌 AI Latent Space Explorer
### Visualizing How Protein AI Models "See" Structural Diversity

---

## 🎯 What You'll Learn

Modern protein folding AI models like **AlphaFold** and **trRosetta** don't predict 3D coordinates directly. Instead, they:
1. Predict **inter-residue geometry** (distances and orientations)
2. Use these 2D "maps" to reconstruct 3D structure

**In this tutorial:**
- 🚀 Generate 500 protein conformations in parallel using `BatchedGenerator`
- 📐 Compute **6D trRosetta orientograms** (the AI's "view" of structure)
- 🎨 Use **PCA** to visualize the high-dimensional "latent space" in 2D
- 🔍 Explore individual structures from the latent space galaxy

> **💡 Why This Matters**: Understanding how AI models represent proteins is crucial for:
> - Protein structure prediction
> - Generative protein design
> - Transfer learning in structural biology

---

In [None]:
# 🔧 Environment Detection & Setup
import sys, os
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('🌐 Running in Google Colab')
    try:
        import synth_pdb
        print('   ✅ synth-pdb already installed')
    except ImportError:
        print('   📦 Installing synth-pdb...')
        !pip install -q synth-pdb
        print('   ✅ Installation complete')
    import plotly.io as pio
    pio.renderers.default = 'colab'
else:
    print('💻 Running in local Jupyter environment')
    sys.path.append(os.path.abspath('../../'))

print('✅ Environment configured!')

In [None]:
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
from ipywidgets import interact, IntSlider
import numpy as np
import py3Dmol
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import plotly.graph_objects as go
from synth_pdb import PeptideGenerator
from synth_pdb.batch_generator import BatchedGenerator
import biotite.structure as struc

print('✅ Latent Space Explorer Ready!')

## 📚 Theoretical Foundation

### Protein Representation Learning

**Why not use Cartesian coordinates directly?**

Cartesian coordinates (x, y, z) have several problems:
- **Not rotation-invariant**: Same structure, different orientation = different coordinates
- **Not translation-invariant**: Same structure, different position = different coordinates
- **High dimensional**: N atoms × 3 coordinates = 3N dimensions

**Solution: Inter-residue Geometry**

Instead, AI models use **pairwise geometric relationships** between residues:

| Feature | Symbol | Description | Range |
|---------|--------|-------------|-------|
| **Distance** | d | Cβ-Cβ distance | 0-20 Å |
| **Omega** | ω | Dihedral angle between Cβ-Cβ frames | -180° to +180° |
| **Theta** | θ | Planar angle in first frame | 0° to 180° |
| **Phi** | φ | Planar angle in second frame | 0° to 180° |

These 4 values (plus 2 more for complete orientation) form the **6D orientogram**.

### Dimensionality Reduction: PCA

**Principal Component Analysis** finds the directions of maximum variance in high-dimensional data.

Given data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (n samples, d features):

1. **Center the data**: $\mathbf{X}_{centered} = \mathbf{X} - \bar{\mathbf{X}}$

2. **Compute covariance**: $\mathbf{C} = \frac{1}{n-1}\mathbf{X}_{centered}^T \mathbf{X}_{centered}$

3. **Eigendecomposition**: $\mathbf{C} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$

4. **Project**: $\mathbf{Z} = \mathbf{X}_{centered}\mathbf{V}_k$ (keep top k eigenvectors)

**Why PCA for proteins?**
- Reveals the "principal modes" of structural variation
- Reduces ~1000 dimensions to 2-3 for visualization
- Preserves maximum variance (information)

> **🔬 Alternative**: t-SNE preserves local structure better but is non-linear and slower

---

## 1. Parallel Structure Generation

We'll generate 500 diverse conformations of a 7-residue peptide using `BatchedGenerator`, which uses vectorized NumPy operations for speed.


In [None]:
sequence = "TRP-SER-GLY-ALA-VAL-PRO-ILE"
n_batch = 500

display(HTML(f"""
<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white; padding: 15px; border-radius: 10px;
            font-family: monospace; margin-bottom: 15px;'>
    <b>🚀 Batch Generation</b><br>
    Sequence: {sequence}<br>
    Structures: {n_batch}<br>
    Method: Vectorized NeRF algorithm
</div>
"""))

print(f"Generating {n_batch} structures...")
bg = BatchedGenerator(sequence, n_batch=n_batch)
batch = bg.generate_batch()

print(f"✅ Generated {batch.coords.shape[0]} structures")
print(f"   Shape: {batch.coords.shape} (Batch, Atoms, XYZ)")

## 2. Computing 6D Orientograms

For every pair of residues in every protein, we calculate the 6D geometric relationship. This is **exactly what trRosetta predicts** before reconstructing 3D structure.

**The 6D representation:**
- Distance (d)
- Omega (ω) - dihedral between frames
- Theta (θ) - planar angle in frame 1
- Phi (φ) - planar angle in frame 2
- (Plus 2 more for complete orientation)

This creates a **rotation and translation invariant** representation of structure.


In [None]:
print("Computing 6D orientograms...")
orients = batch.get_6d_orientations()

print(f"✅ Computed orientations for {n_batch} structures")
print(f"   Available features: {list(orients.keys())}")
print(f"   Shape per feature: {orients['dist'].shape} (Batch, Residues, Residues)")

## 3. Dimensionality Reduction with PCA

We flatten the 2D geometry maps into high-dimensional feature vectors and use PCA to project them into 2D for visualization.

**Feature vector construction:**
- Distance maps: 7×7 = 49 features
- Omega maps: 7×7 = 49 features  
- Theta maps: 7×7 = 49 features
- Phi maps: 7×7 = 49 features
- **Total**: 196 dimensions → 2 dimensions via PCA


In [None]:
# Flatten all features into one vector per structure
feature_vector = np.concatenate([
    orients['dist'].reshape(n_batch, -1),
    orients['omega'].reshape(n_batch, -1),
    orients['theta'].reshape(n_batch, -1),
    orients['phi'].reshape(n_batch, -1)
], axis=1)

print(f"Feature vector dimensionality: {feature_vector.shape[1]}")

# Apply PCA
pca = PCA(n_components=2)
latent_points = pca.fit_transform(feature_vector)

# Show variance explained
var_explained = pca.explained_variance_ratio_
print(f"\n✅ PCA complete")
print(f"   PC1 variance: {var_explained[0]:.1%}")
print(f"   PC2 variance: {var_explained[1]:.1%}")
print(f"   Total variance captured: {var_explained.sum():.1%}")

## 4. Interactive Latent Space Galaxy

Each point represents one protein conformation. Points close together have similar geometric properties. This is the "latent space" - a compressed representation of structural diversity.


In [None]:
fig = go.Figure(data=[go.Scatter(
    x=latent_points[:, 0],
    y=latent_points[:, 1],
    mode='markers',
    marker=dict(
        size=8,
        color=np.arange(n_batch),
        colorscale='Viridis',
        showscale=True,
        colorbar=dict(title='Structure ID'),
        line=dict(width=0.5, color='white')
    ),
    text=[f"Structure {i}" for i in range(n_batch)],
    hovertemplate='%{text}<br>PC1: %{x:.2f}<br>PC2: %{y:.2f}<extra></extra>'
)])

fig.update_layout(
    title=dict(
        text=f'Protein Latent Space (PCA of 6D Orientograms)<br><sub>Variance explained: PC1={var_explained[0]:.1%}, PC2={var_explained[1]:.1%}</sub>',
        x=0.5,
        xanchor='center'
    ),
    xaxis_title='Principal Component 1',
    yaxis_title='Principal Component 2',
    width=900, height=600,
    template="plotly_dark",
    hovermode='closest'
)

fig.show()

## 5. Structure Explorer

Use the slider to browse individual structures and see their corresponding distance maps (how the AI "sees" them).


In [None]:
# Output widget for clean updates
out = widgets.Output()

# Slider
slider = IntSlider(min=0, max=n_batch-1, step=1, value=0, description='Structure:', layout=widgets.Layout(width='500px'))

# Track initialization
_initializing = True

def view_from_latent(change=None):
    global _initializing
    if _initializing and change is not None:
        return
    
    index = slider.value
    coords = batch.coords[index]
    
    # Create structure
    pgen = PeptideGenerator(sequence)
    res = pgen.generate()
    
    if res.structure.array_length() == coords.shape[0]:
        res.structure.coord = coords
    
    with out:
        clear_output(wait=True)
        
        print(f'Structure {index} | PC1={latent_points[index,0]:.2f}, PC2={latent_points[index,1]:.2f}\n')
        
        # 3D viewer
        view = py3Dmol.view(width=500, height=400)
        view.addModel(res.pdb, "pdb")
        view.setStyle({'stick': {'colorscheme': 'chainHetatm'}})
        view.setBackgroundColor('#1a1a1a')
        view.zoomTo()
        display(view.show())
        
        # Distance map
        fig, ax = plt.subplots(1, 1, figsize=(4, 4))
        im = ax.imshow(orients['dist'][index], cmap='magma', vmin=0, vmax=20)
        ax.set_title("Distance Map (AI View)", color='white')
        ax.set_xlabel("Residue", color='white')
        ax.set_ylabel("Residue", color='white')
        ax.tick_params(colors='white')
        fig.patch.set_facecolor('#1a1a1a')
        ax.set_facecolor('#1a1a1a')
        plt.colorbar(im, ax=ax, label='Distance (Å)')
        plt.tight_layout()
        plt.show()
        plt.close(fig)  # Important: close figure to prevent memory leak

# Connect slider
slider.observe(view_from_latent, 'value')

# Display UI
display(widgets.VBox([slider, out]))

# Initialize
_initializing = False
view_from_latent()


---

## 🎓 Key Insights

1. **Geometric Representation**: AI models use inter-residue geometry (distances + orientations) instead of raw coordinates
2. **Rotation Invariance**: 6D orientograms are invariant to rotation and translation
3. **Latent Space**: PCA reveals the "principal modes" of structural variation
4. **Dimensionality**: 196D → 2D while preserving ~XX% of variance

## 📖 Further Reading

**Protein Structure Prediction:**
- Jumper et al. (2021). "Highly accurate protein structure prediction with AlphaFold." *Nature* 596:583-589. [DOI: 10.1038/s41586-021-03819-2](https://doi.org/10.1038/s41586-021-03819-2)
- Yang et al. (2020). "Improved protein structure prediction using predicted interresidue orientations." *PNAS* 117:1496-1503. [DOI: 10.1073/pnas.1914677117](https://doi.org/10.1073/pnas.1914677117)

**Protein Representation Learning:**
- Rao et al. (2019). "Evaluating protein transfer learning with TAPE." *NeurIPS* 2019. [arXiv:1906.08230](https://arxiv.org/abs/1906.08230)
- Greener et al. (2018). "Design of metalloproteins and novel protein folds using variational autoencoders." *Sci Rep* 8:16189. [DOI: 10.1038/s41598-018-34533-1](https://doi.org/10.1038/s41598-018-34533-1)

**Dimensionality Reduction:**
- Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." *Phil Mag* 2:559-572.
- van der Maaten & Hinton (2008). "Visualizing data using t-SNE." *JMLR* 9:2579-2605.

---

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; color: white; text-align: center;'>
    <h3>🎉 Exploration Complete!</h3>
    <p>You've mastered protein latent space visualization!</p>
</div>