# Embedding Pipeline Example

This notebook demonstrates how to use the **Embedding Pipeline** to generate embeddings from medical claims using the MediClaimGPT model.

## Overview
The embedding pipeline converts text claims into dense vector representations that capture semantic meaning. These embeddings can then be used for downstream tasks like classification.

## What You'll Learn
1. How to configure the embedding pipeline using YAML
2. How to process medical claims data
3. How to generate embeddings using the MediClaimGPT API
4. How to save and inspect the generated embeddings

## Prerequisites
- MediClaimGPT model server running on `http://localhost:8000`
- Input CSV file with columns: `mcid`, `claims`, `label`

## Step 1: Setup and Imports

In [ ]:
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np
import yaml
import json

# Add the project root to Python path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

from models.config_models import PipelineConfig
from pipelines.embedding_pipeline import EmbeddingPipeline

print("✅ Imports successful")
print(f"📁 Project root: {project_root}")

## Step 2: Load and Inspect Sample Data

Let's first look at our sample dataset to understand the data structure.

In [4]:
# Load sample data
data_file = "data/medical_claims_sample.csv"
df = pd.read_csv(data_file)

print(f"📊 Dataset shape: {df.shape}")
print(f"📋 Columns: {list(df.columns)}")
print(f"🏷️  Label distribution:")
print(df['label'].value_counts())

print("\n📝 Sample claims:")
for i, row in df.head(3).iterrows():
    label_text = "Evidence-based" if row['label'] == 1 else "Pseudoscientific"
    print(f"\n{i+1}. [{label_text}] {row['claims'][:100]}...")

📊 Dataset shape: (20, 3)
📋 Columns: ['mcid', 'claims', 'label']
🏷️  Label distribution:
label
1    10
0    10
Name: count, dtype: int64

📝 Sample claims:

1. [Evidence-based] Patients with diabetes should monitor their blood glucose levels regularly to maintain optimal glyce...

2. [Evidence-based] Regular aerobic exercise for 30 minutes daily significantly reduces cardiovascular disease risk...

3. [Evidence-based] Smoking cessation dramatically decreases lung cancer incidence within 5-10 years...


## Step 3: Load Configuration

We'll use a YAML configuration file to set up the embedding pipeline. This approach makes it easy to modify settings without changing code.

In [5]:
# Load configuration from YAML file
config_file = "configs/embedding_example_config.yaml"

with open(config_file, 'r') as f:
    config_data = yaml.safe_load(f)

print("🔧 Configuration loaded successfully")
print(f"📝 Job name: {config_data['job']['name']}")
print(f"🌐 API endpoint: {config_data['model_api']['base_url']}")
print(f"📦 Batch size: {config_data['embedding_generation']['batch_size']}")
print(f"🧠 Tokenizer: {config_data['embedding_generation']['tokenizer_path']}")

# Display key configuration sections
print("\n⚙️  Pipeline stages enabled:")
for stage, enabled in config_data['pipeline_stages'].items():
    status = "✅" if enabled else "❌"
    print(f"  {status} {stage}")

🔧 Configuration loaded successfully
📝 Job name: embedding_generation_example
🌐 API endpoint: http://localhost:8000
📦 Batch size: 8
🧠 Tokenizer: /home/kosaraju/mgpt-serve/tokenizer

⚙️  Pipeline stages enabled:
  ✅ embeddings
  ❌ classification
  ❌ evaluation
  ❌ target_word_eval
  ❌ summary_report
  ❌ method_comparison


## Step 4: Test API Connection

Before running the full pipeline, let's verify that the MediClaimGPT API is accessible.

In [6]:
import requests

# Test API connection
api_url = config_data['model_api']['base_url']
test_endpoint = api_url + config_data['model_api']['endpoints']['embeddings']

try:
    # Test with a simple claim
    test_payload = {
        "claims": ["Regular exercise improves cardiovascular health"]
    }
    
    response = requests.post(test_endpoint, json=test_payload, timeout=10)
    response.raise_for_status()
    
    result = response.json()
    embedding_dim = len(result['embeddings'][0])
    
    print("✅ API connection successful")
    print(f"📏 Embedding dimension: {embedding_dim}")
    print(f"🔢 Sample embedding (first 5 values): {result['embeddings'][0][:5]}")
    
except requests.exceptions.RequestException as e:
    print(f"❌ API connection failed: {e}")
    print("Please ensure the MediClaimGPT server is running on http://localhost:8000")

✅ API connection successful
📏 Embedding dimension: 768
🔢 Sample embedding (first 5 values): [-0.5630092024803162, 0.10324104130268097, -0.797333836555481, 0.17846648395061493, 0.05581314116716385]


## Step 5: Initialize and Run Embedding Pipeline

Now we'll create the pipeline configuration object and run the embedding generation process.

In [ ]:
# Create directories for outputs
os.makedirs("outputs/embeddings", exist_ok=True)
os.makedirs("outputs/logs", exist_ok=True)

# Initialize the pipeline with configuration
try:
    # Create PipelineConfig object from YAML data
    config = PipelineConfig(**config_data)
    print("✅ Configuration validated successfully")
    
    # Initialize embedding pipeline
    embedding_pipeline = EmbeddingPipeline(config)
    print("✅ Embedding pipeline initialized")
    
except Exception as e:
    print(f"❌ Pipeline initialization failed: {e}")
    import traceback
    traceback.print_exc()
    raise

In [None]:
# Run the embedding generation
dataset_path = data_file
output_path = "outputs/embeddings/sample_embeddings.csv"

print("🚀 Starting embedding generation...")
print(f"📊 Input: {dataset_path}")
print(f"💾 Output: {output_path}")

try:
    # Run the pipeline
    results = embedding_pipeline.run(
        dataset_path=dataset_path,
        output_path=output_path
    )
    
    print("\n✅ Embedding generation completed successfully!")
    print(f"📈 Processed samples: {results['n_samples']}")
    print(f"📏 Embedding dimension: {results['embedding_dim']}")
    print(f"💾 Output file: {results['output_path']}")
    print(f"⏱️  Timestamp: {results['timestamp']}")
    
    # Display embedding statistics if available
    if 'embedding_stats' in results and results['embedding_stats']:
        stats = results['embedding_stats']
        print("\n📊 Embedding Statistics:")
        print(f"  📊 Mean norm: {stats.get('mean_norm', 'N/A'):.3f}")
        print(f"  📊 Std norm: {stats.get('std_norm', 'N/A'):.3f}")
        print(f"  📊 Min norm: {stats.get('min_norm', 'N/A'):.3f}")
        print(f"  📊 Max norm: {stats.get('max_norm', 'N/A'):.3f}")
    
except Exception as e:
    print(f"❌ Embedding generation failed: {e}")
    raise

## Step 6: Inspect Generated Embeddings

Let's examine the generated embeddings to understand their structure and properties.

In [None]:
# Load and inspect the generated embeddings
embeddings_df = pd.read_csv(output_path)

print(f"📊 Embeddings dataset shape: {embeddings_df.shape}")
print(f"📋 Columns: {list(embeddings_df.columns)}")

# Parse the first embedding to check dimension
first_embedding = json.loads(embeddings_df.iloc[0]['embedding'])
print(f"📏 Embedding dimension: {len(first_embedding)}")
print(f"🔢 First 10 values: {first_embedding[:10]}")

# Display sample rows
print("\n📝 Sample embeddings data:")
display_df = embeddings_df[['mcid', 'label']].copy()
display_df['embedding_preview'] = embeddings_df['embedding'].apply(
    lambda x: str(json.loads(x)[:5]) + "..."
)
print(display_df.head())

## Step 7: Embedding Analysis

Let's perform some basic analysis on the generated embeddings.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Parse all embeddings into a matrix
embeddings_matrix = np.array([
    json.loads(emb) for emb in embeddings_df['embedding']
])
labels = embeddings_df['label'].values

print(f"📊 Embeddings matrix shape: {embeddings_matrix.shape}")

# Calculate basic statistics
norms = np.linalg.norm(embeddings_matrix, axis=1)
print(f"\n📊 Embedding Norms Statistics:")
print(f"  Mean: {np.mean(norms):.3f}")
print(f"  Std:  {np.std(norms):.3f}")
print(f"  Min:  {np.min(norms):.3f}")
print(f"  Max:  {np.max(norms):.3f}")

# Create visualizations
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Norm distribution
axes[0].hist(norms, bins=10, alpha=0.7, edgecolor='black')
axes[0].set_title('Embedding Norm Distribution')
axes[0].set_xlabel('L2 Norm')
axes[0].set_ylabel('Frequency')

# 2. PCA visualization
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings_matrix)

colors = ['red' if label == 0 else 'blue' for label in labels]
axes[1].scatter(embeddings_pca[:, 0], embeddings_pca[:, 1], c=colors, alpha=0.7)
axes[1].set_title('PCA Visualization')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[1].legend(['Pseudoscientific', 'Evidence-based'])

# 3. Embedding magnitude by class
class_0_norms = norms[labels == 0]
class_1_norms = norms[labels == 1]

axes[2].boxplot([class_0_norms, class_1_norms], labels=['Pseudoscientific', 'Evidence-based'])
axes[2].set_title('Embedding Norms by Class')
axes[2].set_ylabel('L2 Norm')

plt.tight_layout()
plt.savefig('outputs/embedding_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n💾 Analysis plots saved to 'outputs/embedding_analysis.png'")

## Step 8: Save Embeddings for Classification

The generated embeddings are now ready to be used in the classification pipeline. Let's create train/test splits.

In [None]:
from sklearn.model_selection import train_test_split

# Create stratified train/test split
train_df, test_df = train_test_split(
    embeddings_df, 
    test_size=0.2, 
    stratify=embeddings_df['label'], 
    random_state=42
)

# Save splits
train_file = "outputs/embeddings/train_embeddings.csv"
test_file = "outputs/embeddings/test_embeddings.csv"

train_df.to_csv(train_file, index=False)
test_df.to_csv(test_file, index=False)

print(f"📊 Train set: {len(train_df)} samples")
print(f"  Class 0: {sum(train_df['label'] == 0)}")
print(f"  Class 1: {sum(train_df['label'] == 1)}")

print(f"\n📊 Test set: {len(test_df)} samples")
print(f"  Class 0: {sum(test_df['label'] == 0)}")
print(f"  Class 1: {sum(test_df['label'] == 1)}")

print(f"\n💾 Files saved:")
print(f"  Train: {train_file}")
print(f"  Test: {test_file}")

## Summary

🎉 **Congratulations!** You have successfully:

1. ✅ **Configured** the embedding pipeline using YAML
2. ✅ **Loaded** medical claims data
3. ✅ **Generated** embeddings using MediClaimGPT API
4. ✅ **Analyzed** embedding properties and distributions
5. ✅ **Created** train/test splits for classification

## Next Steps

The generated embeddings are now ready for use in:
- **Classification Pipeline**: Train machine learning models to classify medical claims
- **Similarity Analysis**: Find similar claims based on embedding distances
- **Clustering**: Group claims by semantic similarity

## Key Files Generated

- `outputs/embeddings/sample_embeddings.csv`: Complete embeddings dataset
- `outputs/embeddings/train_embeddings.csv`: Training split for classification
- `outputs/embeddings/test_embeddings.csv`: Test split for evaluation
- `outputs/embedding_analysis.png`: Visualization plots
- `outputs/logs/embedding_pipeline.log`: Detailed execution logs

## Configuration Highlights

The YAML configuration approach provides:
- **Reproducibility**: Same results with same config
- **Flexibility**: Easy parameter tuning
- **Documentation**: Self-documenting pipeline settings
- **Version Control**: Track configuration changes