# Vec2Vec: Colab Reproduction (~50k Examples)

This notebook reproduces the main vec2vec experiment from the paper:
**"Harnessing the Universal Geometry of Embeddings"** (Jha et al., 2025)

## Experiment Details
- **Source Model**: Stella (`stella` - unsupervised embedding model)
- **Target Model**: GTE (`gte` - supervised embedding model)
- **Dataset**: Natural Questions (NQ)
- **Training Size**: ~50k examples (25k per encoder)
- **Architecture**: ResNet MLP with adapters + adversarial training
- **Losses**: Reconstruction + VSP + Cross-chain translation/VSP + GAN losses

This is a scaled-down version of the full experiment designed to run on a single Colab GPU (T4/L4) in a few hours.

In [None]:
# Check GPU and environment
!nvidia-smi

import sys
print(f"\nPython version: {sys.version}")

# Check CUDA availability
try:
    import torch
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA device: {torch.cuda.get_device_name(0)}")
        print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
except ImportError:
    print("PyTorch will be installed in the next step")

In [None]:
# Clone the vec2vec repository
import os

REPO_URL = "https://github.com/rjha18/supervised_disc.git"  # Official vec2vec repo
REPO_DIR = "/content/vec2vec"

if os.path.exists(REPO_DIR):
    print(f"Repository already exists at {REPO_DIR}")
    %cd {REPO_DIR}
    !git pull origin main
else:
    !git clone {REPO_URL} {REPO_DIR}
    %cd {REPO_DIR}

!ls -la

In [None]:
# Install dependencies
# Core packages for vec2vec
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q transformers>=4.29.0 sentence-transformers>=2.2.0
!pip install -q datasets>=2.12.0 huggingface_hub>=0.15.0
!pip install -q accelerate>=0.20.0
!pip install -q wandb
!pip install -q scikit-learn scipy matplotlib
!pip install -q toml

# Verify installation
import torch
import transformers
import sentence_transformers
import accelerate

print(f"\nInstalled versions:")
print(f"  PyTorch: {torch.__version__}")
print(f"  Transformers: {transformers.__version__}")
print(f"  Sentence-Transformers: {sentence_transformers.__version__}")
print(f"  Accelerate: {accelerate.__version__}")
print(f"  CUDA available: {torch.cuda.is_available()}")

## Data & Embedding Preparation

Vec2vec uses the **Natural Questions (NQ)** dataset from the BeIR benchmark. The embeddings are generated on-the-fly during training using the source (Stella) and target (GTE) embedding models.

The data loading pipeline:
1. Loads NQ corpus from HuggingFace datasets
2. Splits into train/validation sets
3. Creates tokenized batches for both encoders
4. Generates embeddings during forward pass

We'll use **25,000 samples per encoder** (50k total), which is sufficient to demonstrate the approach while keeping training time reasonable.

In [None]:
# Test data loading to ensure everything is properly set up
import sys
sys.path.insert(0, '/content/vec2vec')

from utils.streaming_utils import load_streaming_embeddings

# Load the NQ dataset
print("Loading NQ dataset...")
dset = load_streaming_embeddings("nq")
print(f"Dataset loaded: {len(dset)} examples")
print(f"\nSample entry keys: {list(dset[0].keys())}")

# Show a sample
sample = dset[0]
text_key = 'text' if 'text' in sample else list(sample.keys())[0]
print(f"\nSample text (truncated): {sample[text_key][:200]}...")

# Confirm we have enough data
TRAIN_SIZE = 25000  # per encoder
VAL_SIZE = 4096
REQUIRED = TRAIN_SIZE * 2 + VAL_SIZE

if len(dset) >= REQUIRED:
    print(f"\n✓ Dataset has {len(dset)} examples (need {REQUIRED} for this run)")
else:
    print(f"\n⚠ Dataset has {len(dset)} examples, adjusting train size...")
    TRAIN_SIZE = (len(dset) - VAL_SIZE) // 2
    print(f"  New train size: {TRAIN_SIZE} per encoder")

## Training Configuration

We'll use the `unsupervised.toml` config as the base, which is the main experiment config from the paper.

### Key Hyperparameters for Colab Run:
- **num_points**: 25,000 (samples per encoder, 50k total)
- **epochs**: 10 (reduced from 80 for faster iteration)
- **batch_size**: 128 (reduced from 256 for GPU memory)
- **learning_rate**: 2e-5 (default)
- **mixed_precision**: fp16 (for memory efficiency)

### Architecture:
- **Translator**: ResNet MLP with adapters (`style='res_mlp'`)
- **Adapter dimension**: 1024
- **Discriminator**: 5-layer MLP with residuals
- **GAN type**: Least squares GAN

### Losses (with coefficients):
- Reconstruction: 1.0
- VSP (Vector Space Preservation): 1.0
- Cross-chain translation: 10.0
- Cross-chain VSP: 10.0
- Adversarial (generator): 1.0

In [None]:
# Display and modify configuration for Colab run
import toml
import os

# Load base config
config_path = '/content/vec2vec/configs/unsupervised.toml'
with open(config_path, 'r') as f:
    config = toml.load(f)

print("=" * 60)
print("BASE CONFIGURATION (unsupervised.toml)")
print("=" * 60)

# Key settings
print(f"\n[General]")
print(f"  Dataset: {config['general']['dataset']}")
print(f"  Unsup Model: {config['general']['unsup_emb']}")
print(f"  Sup Model: {config['general']['sup_emb']}")
print(f"  Mixed Precision: {config['general']['mixed_precision']}")

print(f"\n[Translator]")
print(f"  Style: {config['translator']['style']}")
print(f"  Adapter Dim: {config['translator']['d_adapter']}")
print(f"  Depth: {config['translator']['depth']}")

print(f"\n[Training]")
print(f"  Batch Size: {config['train']['bs']}")
print(f"  Learning Rate: {config['train']['lr']}")

print(f"\n[Losses]")
print(f"  Reconstruction: {config['train']['loss_coefficient_rec']}")
print(f"  VSP: {config['train']['loss_coefficient_vsp']}")
print(f"  CC Trans: {config['train']['loss_coefficient_cc_trans']}")
print(f"  CC VSP: {config['train']['loss_coefficient_cc_vsp']}")

print("\n" + "=" * 60)
print("COLAB OVERRIDES")
print("=" * 60)

# Define Colab-specific settings
COLAB_CONFIG = {
    'num_points': 25000,       # 25k per encoder (50k total)
    'epochs': 10,              # Reduced for Colab
    'bs': 128,                 # Reduced batch size for T4
    'val_size': 4096,          # Validation set size
    'use_wandb': False,        # Disable W&B for simplicity
    'min_epochs': 5,           # Lower minimum epochs
    'patience': 5,             # Earlier stopping
}

for k, v in COLAB_CONFIG.items():
    print(f"  --{k} {v}")

print("\n" + "=" * 60)

In [None]:
# Launch training
import os
os.chdir('/content/vec2vec')

# Create output directory
OUTPUT_DIR = '/content/vec2vec/outputs/colab_50k'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Build training command
CMD = f"""
python train.py unsupervised \\
    --num_points 25000 \\
    --epochs 10 \\
    --bs 128 \\
    --val_size 4096 \\
    --use_wandb false \\
    --min_epochs 5 \\
    --patience 5 \\
    --save_dir '{OUTPUT_DIR}/{{}}'
"""

print("Training command:")
print(CMD)
print("\n" + "="*60)
print("Starting training... (this may take 1-3 hours on a T4)")
print("="*60 + "\n")

# Run training
!{CMD}

## Evaluation

After training, we evaluate the translator on held-out test data using the following metrics:

1. **Cosine Similarity**: Average cosine similarity between translated embeddings and true target embeddings
2. **Top-1 Accuracy**: Percentage of samples where the translated embedding's nearest neighbor is the correct target
3. **Mean Rank**: Average rank of the correct target among all candidates (lower is better)
4. **VSP (Vector Space Preservation)**: How well pairwise similarities are preserved after translation

The evaluation uses the repo's built-in `eval.py` script which loads the trained checkpoint and computes all metrics.

In [None]:
# Find the checkpoint directory
import os
import glob

OUTPUT_DIR = '/content/vec2vec/outputs/colab_50k'

# Find the most recent checkpoint
checkpoint_dirs = glob.glob(f"{OUTPUT_DIR}/**/model.pt", recursive=True)
if not checkpoint_dirs:
    # Also check default save location
    checkpoint_dirs = glob.glob("/content/vec2vec/finetuning_unsupervised/**/model.pt", recursive=True)

if checkpoint_dirs:
    # Get the directory containing model.pt
    checkpoint_path = os.path.dirname(sorted(checkpoint_dirs, key=os.path.getmtime)[-1])
    print(f"Found checkpoint at: {checkpoint_path}")
    
    # Run evaluation
    print("\n" + "="*60)
    print("Running evaluation...")
    print("="*60 + "\n")
    
    !python eval.py "{checkpoint_path}" --num_points 8000
else:
    print("No checkpoint found. Please run training first.")
    print(f"Searched in: {OUTPUT_DIR}")

In [None]:
# Custom evaluation with detailed metrics
import os
import sys
import glob
import json
import torch
import numpy as np
from collections import defaultdict

sys.path.insert(0, '/content/vec2vec')

OUTPUT_DIR = '/content/vec2vec/outputs/colab_50k'

# Find checkpoint
checkpoint_dirs = glob.glob(f"{OUTPUT_DIR}/**/model.pt", recursive=True)
if not checkpoint_dirs:
    checkpoint_dirs = glob.glob("/content/vec2vec/finetuning_unsupervised/**/model.pt", recursive=True)

if checkpoint_dirs:
    checkpoint_path = os.path.dirname(sorted(checkpoint_dirs, key=os.path.getmtime)[-1])
    
    # Load config from checkpoint
    import toml
    config_file = os.path.join(checkpoint_path, 'config.toml')
    if os.path.exists(config_file):
        config = toml.load(config_file)
        
        # Extract key metrics from the run
        print("\n" + "="*70)
        print("EVALUATION RESULTS")
        print("="*70)
        
        # Create metrics table
        print("\n{:<15} {:<12} {:<10} {:<10} {:<10} {:<10}".format(
            'Model Pair', 'Train Size', 'Test Size', 'Cosine', 'Top-1', 'Rank'
        ))
        print("-"*70)
        
        # These values would come from the eval.py output
        # For now, we'll print placeholder - actual values printed by eval.py above
        model_pair = f"{config.get('unsup_emb', 'stella')}→{config.get('sup_emb', 'gte')}"
        train_size = config.get('num_points', 25000)
        test_size = config.get('val_size', 4096)
        
        print(f"{model_pair:<15} {train_size:<12} {test_size:<10} {'--':<10} {'--':<10} {'--':<10}")
        print("\nNote: Actual metrics are printed by eval.py above.")
        print("      Look for 'trans' metrics showing translation quality.")
        print("="*70)
        
        # Check for any saved results
        results_files = glob.glob(os.path.join(checkpoint_path, '*.json'))
        if results_files:
            print("\nSaved result files:")
            for f in results_files:
                print(f"  - {os.path.basename(f)}")
    else:
        print(f"Config file not found at {config_file}")
else:
    print("No checkpoint found to evaluate.")

In [None]:
# Plot training metrics (if available)
import matplotlib.pyplot as plt
import os
import glob
import json

OUTPUT_DIR = '/content/vec2vec/outputs/colab_50k'

# Find checkpoint
checkpoint_dirs = glob.glob(f"{OUTPUT_DIR}/**/model.pt", recursive=True)
if not checkpoint_dirs:
    checkpoint_dirs = glob.glob("/content/vec2vec/finetuning_unsupervised/**/model.pt", recursive=True)

if checkpoint_dirs:
    checkpoint_path = os.path.dirname(sorted(checkpoint_dirs, key=os.path.getmtime)[-1])
    
    # Look for any log files or metrics
    log_files = glob.glob(os.path.join(checkpoint_path, '*.json'))
    
    # Check for wandb logs (if they were enabled)
    wandb_logs = glob.glob(os.path.join(checkpoint_path, 'wandb', '**', '*.json'), recursive=True)
    
    if log_files or wandb_logs:
        print("Attempting to plot training metrics...")
        # This would load and plot metrics if available
        print("Note: Detailed per-step logging requires W&B to be enabled.")
    else:
        print("No per-step training logs found.")
        print("\nTo enable detailed logging:")
        print("  1. Set --use_wandb true")
        print("  2. Login with: wandb login")
        print("\nThe heatmaps from evaluation provide visual insight into translation quality.")
        
    # Check for any generated heatmaps or figures
    figure_files = glob.glob(os.path.join(checkpoint_path, '*.png'))
    if figure_files:
        print(f"\nFound {len(figure_files)} figure(s):")
        for f in figure_files:
            print(f"  - {os.path.basename(f)}")
        
        # Display the first heatmap if available
        from IPython.display import Image, display
        for fig_path in figure_files[:2]:  # Show first 2
            print(f"\nDisplaying: {os.path.basename(fig_path)}")
            display(Image(filename=fig_path))
else:
    print("No checkpoint found.")

## Summary & Next Steps

### What This Notebook Reproduces

This notebook trains a **vec2vec translator** to map embeddings from Stella (unsupervised) to GTE (supervised) space on the NQ dataset, using:

- **50k training examples** (25k per encoder)
- **ResNet MLP architecture** with adapters
- **Adversarial training** with least-squares GAN
- **Multiple losses**: reconstruction, VSP, cross-chain translation/VSP

This is a scaled-down version of the full experiment (which uses 100k+ examples and 80+ epochs).

---

### Scaling Up

To increase training scale:
```python
# More training data (e.g., 100k per encoder)
--num_points 100000

# More epochs
--epochs 50

# Larger batch size (if GPU memory allows)
--bs 256
```

---

### Changing Model Pairs

To translate between different embedding models:
```python
# GTE to GTR
--unsup_emb gte --sup_emb gtr

# E5 to GTE
--unsup_emb e5 --sup_emb gte

# Stella to SBERT
--unsup_emb stella --sup_emb sbert
```

Available models: `gtr`, `gte`, `stella`, `e5`, `sbert`, `ember`, `gist`, `sentence-t5`, `dpr`, `jina`, etc.

---

### Using Different Datasets

```python
# FineWeb (larger dataset)
--dataset fineweb

# BeIR benchmark datasets
--dataset arguana-corpus
--dataset msmarco-corpus
```

---

### Tips

1. **GAN training is unstable**: Try different seeds if training doesn't converge
2. **Monitor validation metrics**: Early stopping uses mean rank by default
3. **Use W&B for detailed logs**: Set `--use_wandb true` after `wandb login`
4. **Check GPU memory**: Reduce batch size if you get OOM errors
5. **Pre-trained weights**: Download from [GitHub releases](https://github.com/rjha18/vec2vec/releases) for full-scale trained models

---

### Citation

```bibtex
@misc{jha2025harnessinguniversalgeometryembeddings,
      title={Harnessing the Universal Geometry of Embeddings}, 
      author={Rishi Jha and Collin Zhang and Vitaly Shmatikov and John X. Morris},
      year={2025},
      eprint={2505.12540},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.12540}, 
}
```