# Embedding Pipeline Example

This notebook demonstrates how to use the **Embedding Pipeline** to generate embeddings from medical claims using the MediClaimGPT model.

## Overview
The embedding pipeline converts text claims into dense vector representations that capture semantic meaning. These embeddings can then be used for downstream tasks like classification.

## What You'll Learn
1. How to configure the embedding pipeline using YAML
2. How to process medical claims data
3. How to generate embeddings using the MediClaimGPT API
4. How to save and inspect the generated embeddings

## Prerequisites
- MediClaimGPT model server running on `http://localhost:8000`
- Input CSV file with columns: `mcid`, `claims`, `label`

## Step 1: Setup and Imports

In [ ]:
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np
import yaml
import json

# Add the project root to Python path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

from models.config_models import PipelineConfig
from pipelines.embedding_pipeline import EmbeddingPipeline

print("✅ Imports successful")
print(f"📁 Project root: {project_root}")

## Step 2: Load and Inspect Sample Data

Let's first look at our sample dataset to understand the data structure.

In [ ]:
# Load sample data using config
config_file = "configs/embedding_example_config.yaml"
with open(config_file, 'r') as f:
    config_data = yaml.safe_load(f)

# Use config-defined dataset path
data_file = config_data['input']['dataset_path']
df = pd.read_csv(data_file)

print(f"📊 Dataset shape: {df.shape}")
print(f"📋 Columns: {list(df.columns)}")
print(f"🏷️  Label distribution:")
print(df['label'].value_counts())

print("\n📝 Sample claims:")
for i, row in df.head(3).iterrows():
    label_text = "Evidence-based" if row['label'] == 1 else "Pseudoscientific"
    print(f"\n{i+1}. [{label_text}] {row['claims'][:100]}...")

## Step 3: Load Configuration

We'll use a YAML configuration file to set up the embedding pipeline. This approach makes it easy to modify settings without changing code.

In [ ]:
# Display configuration using sequential sections
print("🔧 Configuration loaded successfully")
print(f"📝 Job name: {config_data['job']['name']}")
print(f"📁 Output directory: {config_data['job']['output_dir']}")
print(f"🎲 Random seed: {config_data['job']['random_seed']}")

print(f"\n🌐 API Configuration:")
print(f"  Base URL: {config_data['model_api']['base_url']}")
print(f"  Timeout: {config_data['model_api']['timeout']}s")
print(f"  Max retries: {config_data['model_api']['max_retries']}")

print(f"\n⚙️  Embedding Generation Settings:")
print(f"  Batch size: {config_data['embedding_generation']['batch_size']}")
print(f"  Max sequence length: {config_data['embedding_generation']['max_sequence_length']}")
print(f"  Padding side: {config_data['embedding_generation']['padding_side']}")
print(f"  Truncation side: {config_data['embedding_generation']['truncation_side']}")
print(f"  Tokenizer path: {config_data['embedding_generation']['tokenizer_path']}")
print(f"  Output filename: {config_data['embedding_generation']['output_filename']}")

print(f"\n📂 Output Configuration:")
print(f"  Embeddings dir: {config_data['output']['embeddings_dir']}")
print(f"  Logs dir: {config_data['output']['logs_dir']}")

print(f"\n📝 Logging Configuration:")
print(f"  Level: {config_data['logging']['level']}")
print(f"  Log file: {config_data['logging']['file']}")

## Step 4: Test API Connection

Before running the full pipeline, let's verify that the MediClaimGPT API is accessible.

In [ ]:
import requests

# Test API connection using config values
api_url = config_data['model_api']['base_url']
test_endpoint = api_url + config_data['model_api']['endpoints']['embeddings_batch']

try:
    # Test with a simple claim using the new API format from config
    test_payload = {
        "claims": ["Regular exercise improves cardiovascular health"],
        "padding_side": config_data['embedding_generation']['padding_side'],
        "truncation_side": config_data['embedding_generation']['truncation_side'],
        "max_length": config_data['embedding_generation']['max_sequence_length']
    }
    
    response = requests.post(test_endpoint, json=test_payload, timeout=10)
    response.raise_for_status()
    
    result = response.json()
    embedding_dim = len(result['embeddings'][0])
    
    print("✅ API connection successful")
    print(f"📏 Embedding dimension: {embedding_dim}")
    print(f"🔢 Sample embedding (first 5 values): {result['embeddings'][0][:5]}")
    print(f"🌐 Endpoint used: {test_endpoint}")
    print(f"📊 API parameters from config:")
    for key, value in test_payload.items():
        print(f"    {key}: {value}")
    
except requests.exceptions.RequestException as e:
    print(f"❌ API connection failed: {e}")
    print(f"Please ensure the MediClaimGPT server is running on {api_url}")

## Step 5: Initialize and Run Embedding Pipeline

Now we'll create the pipeline configuration object and run the embedding generation process.

In [ ]:
# Initialize the pipeline with configuration
try:
    # Create PipelineConfig object from YAML data
    config = PipelineConfig(**config_data)
    print("✅ Configuration validated successfully")
    
    # Initialize embedding pipeline
    embedding_pipeline = EmbeddingPipeline(config)
    print("✅ Embedding pipeline initialized")
    
    # Display resolved paths (no hardcoding)
    embeddings_dir = config.resolve_template_string(config.output.embeddings_dir)
    logs_dir = config.resolve_template_string(config.output.logs_dir)
    output_filename = config.resolve_template_string(config.embedding_generation.output_filename)
    
    print(f"\n📂 Resolved paths:")
    print(f"  Embeddings directory: {embeddings_dir}")
    print(f"  Logs directory: {logs_dir}")
    print(f"  Output filename: {output_filename}")
    
except Exception as e:
    print(f"❌ Pipeline initialization failed: {e}")
    import traceback
    traceback.print_exc()
    raise

In [ ]:
# Run the embedding generation using pure config
print("🚀 Starting embedding generation...")

try:
    # Run the pipeline - all paths come from config
    results = embedding_pipeline.run()
    
    print("\n✅ Embedding generation completed successfully!")
    print(f"📈 Processed samples: {results['n_samples']}")
    print(f"📏 Embedding dimension: {results['embedding_dim']}")
    print(f"💾 Output file: {results['output_path']}")
    print(f"⏱️  Timestamp: {results['timestamp']}")
    
    # Display embedding statistics if available
    if 'embedding_stats' in results and results['embedding_stats']:
        stats = results['embedding_stats']
        print("\n📊 Embedding Statistics:")
        print(f"  📊 Mean norm: {stats.get('mean_norm', 'N/A'):.3f}")
        print(f"  📊 Std norm: {stats.get('std_norm', 'N/A'):.3f}")
        print(f"  📊 Min norm: {stats.get('min_norm', 'N/A'):.3f}")
        print(f"  📊 Max norm: {stats.get('max_norm', 'N/A'):.3f}")
    
except Exception as e:
    print(f"❌ Embedding generation failed: {e}")
    raise

## Step 6: Inspect Generated Embeddings

Let's examine the generated embeddings to understand their structure and properties.

In [ ]:
# Load and inspect the generated embeddings using config-derived path
output_path = results['output_path']
embeddings_df = pd.read_csv(output_path)

print(f"📊 Embeddings dataset shape: {embeddings_df.shape}")
print(f"📋 Columns: {list(embeddings_df.columns)}")

# Parse the first embedding to check dimension
first_embedding = json.loads(embeddings_df.iloc[0]['embedding'])
print(f"📏 Embedding dimension: {len(first_embedding)}")
print(f"🔢 First 10 values: {first_embedding[:10]}")

# Display sample rows
print("\n📝 Sample embeddings data:")
display_df = embeddings_df[['mcid', 'label']].copy()
display_df['embedding_preview'] = embeddings_df['embedding'].apply(
    lambda x: str(json.loads(x)[:5]) + "..."
)
print(display_df.head())

## Step 7: Embedding Analysis

Let's perform comprehensive analysis on the generated embeddings using both linear (PCA) and non-linear (t-SNE) dimensionality reduction techniques.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Parse all embeddings into a matrix
embeddings_matrix = np.array([
    json.loads(emb) for emb in embeddings_df['embedding']
])
labels = embeddings_df['label'].values

print(f"📊 Embeddings matrix shape: {embeddings_matrix.shape}")

# Calculate basic statistics
norms = np.linalg.norm(embeddings_matrix, axis=1)
print(f"\n📊 Embedding Norms Statistics:")
print(f"  Mean: {np.mean(norms):.3f}")
print(f"  Std:  {np.std(norms):.3f}")
print(f"  Min:  {np.min(norms):.3f}")
print(f"  Max:  {np.max(norms):.3f}")

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Norm distribution
axes[0, 0].hist(norms, bins=10, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Embedding Norm Distribution')
axes[0, 0].set_xlabel('L2 Norm')
axes[0, 0].set_ylabel('Frequency')

# 2. PCA visualization
print("\n🔄 Computing PCA...")
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings_matrix)

colors = ['red' if label == 0 else 'blue' for label in labels]
axes[0, 1].scatter(embeddings_pca[:, 0], embeddings_pca[:, 1], c=colors, alpha=0.7, s=50)
axes[0, 1].set_title('PCA Visualization')
axes[0, 1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0, 1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[0, 1].legend(['Pseudoscientific', 'Evidence-based'])

# 3. t-SNE visualization
print("🔄 Computing t-SNE (this may take a moment)...")
tsne = TSNE(n_components=2, random_state=42, perplexity=min(5, len(embeddings_matrix)-1))
embeddings_tsne = tsne.fit_transform(embeddings_matrix)

axes[1, 0].scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1], c=colors, alpha=0.7, s=50)
axes[1, 0].set_title('t-SNE Visualization')
axes[1, 0].set_xlabel('t-SNE 1')
axes[1, 0].set_ylabel('t-SNE 2')
axes[1, 0].legend(['Pseudoscientific', 'Evidence-based'])

# 4. Embedding magnitude by class
class_0_norms = norms[labels == 0]
class_1_norms = norms[labels == 1]

axes[1, 1].boxplot([class_0_norms, class_1_norms], labels=['Pseudoscientific', 'Evidence-based'])
axes[1, 1].set_title('Embedding Norms by Class')
axes[1, 1].set_ylabel('L2 Norm')

plt.tight_layout()
plt.savefig('outputs/embedding_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n💾 Analysis plots saved to 'outputs/embedding_analysis.png'")
print("📊 Visualizations include:")
print("  - Embedding norm distribution")
print("  - PCA projection (linear dimensionality reduction)")
print("  - t-SNE projection (non-linear dimensionality reduction)")
print("  - Class-wise norm comparison")

In [17]:
!conda install Pythae -c conda-forge -y

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 25.3.1
    latest version: 25.5.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/kosaraju/miniconda3/envs/mgpt-eval

  added / updated specs:
    - pythae


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    binutils_impl_linux-64-2.40|       h5293946_0         8.7 MB
    binutils_linux-64-2.40.0   |       hc2dff05_2          25 KB
    ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
    cloudpickle-3.1.1          |     pyhd8ed1ab_0          25 KB  conda-forge
    cpython-3.13.3             |py313hd8ed1ab_101          47 KB  conda-forge
    cuda-cccl_linux-64-12.4.127|       ha770c72_2         1.2 MB  conda-forge
    cuda-crt-dev_l

## Summary

🎉 **Congratulations!** You have successfully created a **pure config-driven embedding pipeline**:

### ✅ Achievements

1. **📝 Pure Configuration Approach**: All parameters defined in YAML with variable references
2. **🔗 Variable References**: Used `${job.name}`, `${job.output_dir}`, and `${output.logs_dir}` throughout
3. **🌐 API Integration**: Compatible with new curl-based API format with padding/truncation controls
4. **📊 Sequential Config Usage**: Each notebook cell builds on config sections sequentially
5. **🚫 Zero Hardcoding**: No hardcoded paths, filenames, or parameters in notebook code

### 🏗️ Configuration Highlights

**No Parameter Duplication:**
- Single `batch_size` in `embedding_generation` section
- Single `random_seed` in `job` section  
- Single `max_sequence_length` consolidated in embedding config

**Variable References Used:**
- `${job.name}` → `"embedding_generation_example"`
- `${job.output_dir}` → `"examples/outputs"`
- `${output.embeddings_dir}` → `"examples/outputs/embeddings"`
- `${output.logs_dir}` → `"examples/outputs/logs"`

**API Format Compatibility:**
```bash
curl -X POST http://localhost:8000/embeddings_batch \\
  -H "Content-Type: application/json" \\
  -d '{
    "claims": ["claim1", "claim2"],
    "padding_side": "left",
    "truncation_side": "left", 
    "max_length": 1024
  }'
```

### 📂 Key Files Generated

- **Embeddings**: `examples/outputs/embeddings/embedding_generation_example_embeddings.csv`
- **Logs**: `examples/outputs/logs/embedding_pipeline.log`
- **Visualizations**: `outputs/embedding_analysis.png`

### 🎯 Next Steps

The generated embeddings are now ready for downstream tasks:
- **Classification Pipeline**: Use embeddings for ML model training
- **Similarity Analysis**: Calculate embedding distances
- **Clustering**: Group claims by semantic similarity

### 🔧 Configuration Benefits

This pure config-driven approach provides:
- **🔄 Reproducibility**: Same config = same results
- **⚡ Flexibility**: Easy parameter tuning via YAML
- **📖 Self-Documentation**: Config files are self-documenting
- **🎛️ No Code Changes**: Modify behavior through config only
- **🔗 Variable Reuse**: Define once, reference everywhere