# Single Embedding Computation (No Padding)

This notebook demonstrates how to generate embeddings **one at a time without padding** using the `/embeddings` endpoint.

## Key Differences from Batch Processing
- ✅ **No Padding**: Each claim processed at its natural length
- ✅ **Single Sample**: One claim per API request
- ✅ **Different Endpoint**: Uses `/embeddings` instead of `/embeddings_batch`
- ⚠️ **Slower**: More API calls but no padding overhead

## When to Use This Approach
- When you need exact embeddings without padding artifacts
- For small datasets where speed isn't critical
- When analyzing how padding affects embeddings
- For debugging or understanding individual claim processing

## Step 1: Setup and Imports

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np
import yaml
import json
import requests
from tqdm import tqdm
import time

# Add the project root to Python path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

from models.config_models import PipelineConfig

print("✅ Imports successful")
print(f"📁 Project root: {project_root}")

## Step 2: Load Configuration and Sample Data

In [None]:
# Load configuration
config_file = "configs/embedding_example_config.yaml"
with open(config_file, 'r') as f:
    config_data = yaml.safe_load(f)

# Create PipelineConfig object
config = PipelineConfig(**config_data)

# Load sample data
data_file = config.resolve_template_string(config.input.dataset_path)
df = pd.read_csv(data_file)

print(f"📊 Dataset shape: {df.shape}")
print(f"📋 Columns: {list(df.columns)}")
print(f"🏷️  Label distribution:")
print(df['label'].value_counts())

# For demonstration, let's use a smaller subset
sample_size = min(10, len(df))  # Process only 10 samples for demo
df_sample = df.head(sample_size).copy()
print(f"\n📌 Using {sample_size} samples for demonstration")

## Step 3: Compare Claim Lengths

Let's examine the natural lengths of our claims to understand the impact of no padding.

In [None]:
# Calculate claim lengths
df_sample['claim_length'] = df_sample['claims'].str.len()
df_sample['word_count'] = df_sample['claims'].str.split().str.len()

print("📏 Claim Length Statistics:")
print(f"  Character count: {df_sample['claim_length'].min()} - {df_sample['claim_length'].max()} (avg: {df_sample['claim_length'].mean():.0f})")
print(f"  Word count: {df_sample['word_count'].min()} - {df_sample['word_count'].max()} (avg: {df_sample['word_count'].mean():.0f})")

print("\n📝 Sample claims with lengths:")
for idx, row in df_sample.iterrows():
    print(f"\n{idx+1}. [Length: {row['claim_length']}, Words: {row['word_count']}]")
    print(f"   {row['claims'][:100]}...")

## Step 4: Configure Single Embedding API

We'll set up the single embedding endpoint (no batching, no padding).

In [None]:
# API Configuration for single embeddings
api_base_url = config.model_api.base_url
single_embedding_endpoint = f"{api_base_url}/embeddings"  # Note: /embeddings not /embeddings_batch

print(f"🌐 API Configuration:")
print(f"  Base URL: {api_base_url}")
print(f"  Single Embedding Endpoint: {single_embedding_endpoint}")
print(f"  Batch Endpoint (for comparison): {api_base_url}{config.model_api.endpoints['embeddings_batch']}")

# Test single embedding endpoint
test_claim = "Regular exercise improves cardiovascular health"
test_payload = {
    "claim": test_claim  # Note: 'claim' not 'claims' for single endpoint
}

try:
    response = requests.post(single_embedding_endpoint, json=test_payload, timeout=10)
    response.raise_for_status()
    result = response.json()
    
    embedding = result.get('embedding', [])
    if embedding:
        print(f"\n✅ Single embedding API test successful")
        print(f"📏 Embedding dimension: {len(embedding)}")
        print(f"🔢 Sample values (first 5): {embedding[:5]}")
    else:
        print(f"❌ No embedding returned. Response: {result}")
except Exception as e:
    print(f"❌ Single embedding API test failed: {e}")
    print(f"Make sure the API supports single embedding endpoint at {single_embedding_endpoint}")

## Step 5: Generate Single Embeddings (No Padding)

Process each claim individually without any padding.

In [None]:
def generate_single_embedding(claim, endpoint, max_retries=3):
    """Generate embedding for a single claim without padding."""
    
    payload = {
        "claim": claim  # Single claim, no padding parameters
    }
    
    for attempt in range(max_retries):
        try:
            response = requests.post(endpoint, json=payload, timeout=30)
            response.raise_for_status()
            result = response.json()
            
            embedding = result.get('embedding')
            if not embedding:
                raise ValueError(f"No embedding in response: {result}")
                
            return embedding
            
        except Exception as e:
            if attempt == max_retries - 1:
                raise RuntimeError(f"Failed after {max_retries} attempts: {e}")
            time.sleep(1 * (attempt + 1))  # Exponential backoff

# Generate embeddings one by one
print("🚀 Generating single embeddings (no padding)...\n")

embeddings = []
processing_times = []

for idx, row in tqdm(df_sample.iterrows(), total=len(df_sample), desc="Processing claims"):
    start_time = time.time()
    
    try:
        embedding = generate_single_embedding(row['claims'], single_embedding_endpoint)
        embeddings.append(embedding)
        
        processing_time = time.time() - start_time
        processing_times.append(processing_time)
        
    except Exception as e:
        print(f"\n❌ Error processing claim {idx}: {e}")
        embeddings.append(None)
        processing_times.append(None)

# Summary statistics
successful = sum(1 for e in embeddings if e is not None)
avg_time = np.mean([t for t in processing_times if t is not None])

print(f"\n✅ Embedding generation complete!")
print(f"📊 Success rate: {successful}/{len(df_sample)} ({successful/len(df_sample)*100:.1f}%)")
print(f"⏱️  Average processing time per claim: {avg_time:.3f} seconds")
print(f"⏱️  Total processing time: {sum(t for t in processing_times if t is not None):.1f} seconds")

## Step 6: Compare with Batch Processing (With Padding)

Let's generate embeddings using the batch endpoint with padding for comparison.

In [None]:
# Generate batch embeddings with padding for comparison
batch_endpoint = f"{api_base_url}{config.model_api.endpoints['embeddings_batch']}"

print("🚀 Generating batch embeddings (with padding) for comparison...\n")

# Prepare batch request with padding parameters
batch_payload = {
    "claims": df_sample['claims'].tolist(),
    "padding_side": config.embedding_generation.padding_side,
    "truncation_side": config.embedding_generation.truncation_side,
    "max_length": config.embedding_generation.max_sequence_length
}

try:
    start_time = time.time()
    response = requests.post(batch_endpoint, json=batch_payload, timeout=60)
    response.raise_for_status()
    result = response.json()
    
    batch_embeddings = result.get('embeddings', [])
    batch_time = time.time() - start_time
    
    print(f"✅ Batch processing complete!")
    print(f"📊 Generated {len(batch_embeddings)} embeddings")
    print(f"⏱️  Total batch processing time: {batch_time:.3f} seconds")
    print(f"⚡ Speedup: {sum(t for t in processing_times if t is not None) / batch_time:.1f}x faster")
    
except Exception as e:
    print(f"❌ Batch processing failed: {e}")
    batch_embeddings = []

## Step 7: Analyze Differences

Compare embeddings generated with and without padding.

In [None]:
# Compare embeddings if both methods succeeded
if embeddings and batch_embeddings and all(e is not None for e in embeddings[:len(batch_embeddings)]):
    
    print("📊 Comparing Single vs Batch Embeddings:\n")
    
    # Calculate differences
    differences = []
    for i, (single_emb, batch_emb) in enumerate(zip(embeddings, batch_embeddings)):
        if single_emb is not None:
            single_array = np.array(single_emb)
            batch_array = np.array(batch_emb)
            
            # Calculate metrics
            cosine_sim = np.dot(single_array, batch_array) / (np.linalg.norm(single_array) * np.linalg.norm(batch_array))
            l2_distance = np.linalg.norm(single_array - batch_array)
            max_diff = np.max(np.abs(single_array - batch_array))
            
            differences.append({
                'claim_length': df_sample.iloc[i]['claim_length'],
                'cosine_similarity': cosine_sim,
                'l2_distance': l2_distance,
                'max_difference': max_diff
            })
    
    # Display results
    diff_df = pd.DataFrame(differences)
    
    print("📈 Similarity Statistics:")
    print(f"  Cosine Similarity: {diff_df['cosine_similarity'].mean():.6f} (±{diff_df['cosine_similarity'].std():.6f})")
    print(f"  L2 Distance: {diff_df['l2_distance'].mean():.6f} (±{diff_df['l2_distance'].std():.6f})")
    print(f"  Max Element Difference: {diff_df['max_difference'].mean():.6f} (±{diff_df['max_difference'].std():.6f})")
    
    print("\n📊 Per-Sample Comparison:")
    display_df = diff_df.copy()
    display_df.index = [f"Claim {i+1}" for i in range(len(display_df))]
    print(display_df.round(6))
    
    # Correlation with claim length
    if len(diff_df) > 3:
        length_corr = diff_df['claim_length'].corr(diff_df['l2_distance'])
        print(f"\n📏 Correlation between claim length and L2 distance: {length_corr:.3f}")
        
        if abs(length_corr) > 0.3:
            print("   → Padding appears to have length-dependent effects on embeddings")
        else:
            print("   → Padding effects appear relatively consistent across claim lengths")
            
else:
    print("❌ Could not compare embeddings - one or both methods failed")

## Step 8: Visualize Embedding Differences

In [None]:
import matplotlib.pyplot as plt

if 'diff_df' in locals() and len(diff_df) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    
    # 1. Cosine similarity distribution
    axes[0, 0].hist(diff_df['cosine_similarity'], bins=10, alpha=0.7, color='blue', edgecolor='black')
    axes[0, 0].set_title('Cosine Similarity Distribution\n(Single vs Batch Embeddings)')
    axes[0, 0].set_xlabel('Cosine Similarity')
    axes[0, 0].set_ylabel('Count')
    axes[0, 0].axvline(diff_df['cosine_similarity'].mean(), color='red', linestyle='--', label=f"Mean: {diff_df['cosine_similarity'].mean():.4f}")
    axes[0, 0].legend()
    
    # 2. L2 distance vs claim length
    axes[0, 1].scatter(diff_df['claim_length'], diff_df['l2_distance'], alpha=0.6, s=50)
    axes[0, 1].set_title('L2 Distance vs Claim Length')
    axes[0, 1].set_xlabel('Claim Length (characters)')
    axes[0, 1].set_ylabel('L2 Distance')
    
    # Add trend line if enough points
    if len(diff_df) > 3:
        z = np.polyfit(diff_df['claim_length'], diff_df['l2_distance'], 1)
        p = np.poly1d(z)
        axes[0, 1].plot(diff_df['claim_length'], p(diff_df['claim_length']), "r--", alpha=0.8)
    
    # 3. Processing time comparison
    single_time = sum(t for t in processing_times if t is not None)
    batch_time_value = batch_time if 'batch_time' in locals() else 0
    
    axes[1, 0].bar(['Single\n(No Padding)', 'Batch\n(With Padding)'], 
                   [single_time, batch_time_value],
                   color=['orange', 'green'], alpha=0.7)
    axes[1, 0].set_title('Processing Time Comparison')
    axes[1, 0].set_ylabel('Time (seconds)')
    
    # Add speedup annotation
    if batch_time_value > 0:
        speedup = single_time / batch_time_value
        axes[1, 0].text(0.5, max(single_time, batch_time_value) * 0.8, 
                       f'Batch is {speedup:.1f}x faster', 
                       ha='center', fontsize=12, fontweight='bold')
    
    # 4. Element-wise difference heatmap (for first embedding)
    if embeddings[0] is not None and batch_embeddings:
        diff_vector = np.abs(np.array(embeddings[0]) - np.array(batch_embeddings[0]))
        # Reshape for visualization (create a 2D view)
        plot_size = int(np.sqrt(len(diff_vector)))
        if plot_size ** 2 <= len(diff_vector):
            diff_matrix = diff_vector[:plot_size**2].reshape(plot_size, plot_size)
            im = axes[1, 1].imshow(diff_matrix, cmap='hot', aspect='auto')
            axes[1, 1].set_title(f'Element-wise Differences (First {plot_size}²={plot_size**2} dims)')
            axes[1, 1].set_xlabel('Dimension')
            axes[1, 1].set_ylabel('Dimension')
            plt.colorbar(im, ax=axes[1, 1])
        else:
            axes[1, 1].text(0.5, 0.5, 'Embedding dimension\nnot suitable for\nsquare visualization', 
                           ha='center', va='center', fontsize=12)
            axes[1, 1].set_xlim(0, 1)
            axes[1, 1].set_ylim(0, 1)
    
    plt.tight_layout()
    plt.savefig('outputs/single_vs_batch_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n💾 Comparison plots saved to 'outputs/single_vs_batch_comparison.png'")

## Step 9: Save Results

In [None]:
# Save the single embeddings (no padding)
if embeddings and any(e is not None for e in embeddings):
    # Prepare output directory
    output_dir = Path(config.resolve_template_string(config.output.embeddings_dir))
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Create output dataframe
    output_data = []
    for i, (idx, row) in enumerate(df_sample.iterrows()):
        if embeddings[i] is not None:
            output_data.append({
                'mcid': row['mcid'],
                'label': row['label'],
                'embedding': json.dumps(embeddings[i])
            })
    
    output_df = pd.DataFrame(output_data)
    
    # Save to CSV
    output_file = output_dir / "single_embeddings_no_padding.csv"
    output_df.to_csv(output_file, index=False)
    
    print(f"✅ Single embeddings saved to: {output_file}")
    print(f"📊 Saved {len(output_df)} embeddings")
    
    # Also save comparison results if available
    if 'diff_df' in locals():
        comparison_file = output_dir / "padding_comparison_results.csv"
        diff_df.to_csv(comparison_file, index=False)
        print(f"📊 Comparison results saved to: {comparison_file}")

## Summary

### 🎯 Key Findings

1. **Processing Method Differences**:
   - **Single (No Padding)**: Each claim processed at natural length
   - **Batch (With Padding)**: All claims padded to same length for efficiency

2. **Performance Trade-offs**:
   - **Speed**: Batch processing is significantly faster (typically 5-10x)
   - **Accuracy**: Single processing may be more accurate for varying-length inputs
   - **Memory**: Single processing uses less memory (no padding overhead)

3. **Embedding Differences**:
   - Embeddings are generally very similar (high cosine similarity)
   - Differences may correlate with claim length
   - Padding can introduce subtle artifacts in the embeddings

### 📝 Recommendations

**Use Single Embeddings (No Padding) when**:
- Processing small datasets
- Need exact embeddings without padding artifacts  
- Analyzing embedding quality or debugging
- Claim lengths vary significantly

**Use Batch Embeddings (With Padding) when**:
- Processing large datasets
- Speed is critical
- Claim lengths are relatively uniform
- Small embedding differences are acceptable

### 🔧 Configuration Notes

To switch between methods in your pipeline:
- **Single**: Use endpoint `/embeddings` with payload `{"claim": "..."}`
- **Batch**: Use endpoint `/embeddings_batch` with padding parameters

The choice depends on your specific use case and requirements!