# 18. Fine-tuning Embeddings for Domain-Specific RAG üéØ

**Complexity:** ‚≠ê‚≠ê‚≠ê‚≠ê | **Duration:** ~30-35 minutes

---

## Overview

**Fine-tuning embeddings** adapts pre-trained embedding models to your specific domain, improving retrieval accuracy for specialized content.

### When to Fine-tune

‚úÖ **Consider fine-tuning when:**
- Domain-specific jargon (medical, legal, technical)
- Industry-specific abbreviations
- Retrieval accuracy < 70% with pre-trained models
- Large proprietary corpus (10K+ documents)
- Cost-sensitive (local models)

‚ùå **Don't fine-tune when:**
- General domain content
- Small dataset (< 1K documents)
- OpenAI embeddings already work well (> 85% accuracy)
- No time/resources for training

### Fine-tuning Approaches

1. **Contrastive Learning**: Train on (query, relevant_doc, irrelevant_doc) triplets
2. **Supervised Fine-tuning**: Use labeled (query, document, relevance_score) pairs
3. **Domain Adaptation**: Continue pre-training on domain corpus

### Expected Improvements

| Metric | Pre-trained | Fine-tuned | Improvement |
|---|---|---|---|
| Precision@5 | 68% | 82-89% | +14-21% |
| Recall@10 | 72% | 85-92% | +13-20% |
| MRR (Mean Reciprocal Rank) | 0.65 | 0.78-0.85 | +13-20% |

---

## Setup

Install sentence-transformers for fine-tuning.

In [None]:
import sys
from pathlib import Path
from typing import List, Dict, Tuple
import random

# Add project root
sys.path.append('../..')

# Core dependencies
from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    losses,
    evaluation,
    util
)
from torch.utils.data import DataLoader

# Standard RAG components
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

# Shared utilities
from shared import (
    load_langchain_docs,
    split_documents,
    save_vector_store,
    load_vector_store,
    format_docs,
    print_section_header,
    VECTOR_STORE_DIR,
    SECTION_WIDTH
)

print("=" * SECTION_WIDTH)
print("FINE-TUNING EMBEDDINGS SETUP")
print("=" * SECTION_WIDTH)
print("\n‚úÖ Imports successful")
print("‚úÖ sentence-transformers ready for fine-tuning")

## 1. Dataset Preparation

Create training data with (query, positive_doc, negative_doc) triplets.

In [None]:
# Sample domain-specific dataset (LangChain RAG domain)
training_data = [
    {
        "query": "How do I create a FAISS vector store?",
        "positive": """To create a FAISS vector store in LangChain, use FAISS.from_documents(chunks, embeddings).
        This takes your document chunks and embedding model, creates the index, and returns a vectorstore object.
        You can then save it with save_vector_store() or use it immediately with as_retriever().""",
        "negative": """HyDe (Hypothetical Document Embeddings) is a technique where you generate a hypothetical
        answer to the query, embed it, and use it for retrieval instead of embedding the raw query.
        This improves semantic matching for ambiguous questions."""
    },
    {
        "query": "What is the difference between similarity and MMR retrieval?",
        "positive": """Similarity search returns the k most similar documents based on cosine similarity.
        MMR (Maximal Marginal Relevance) balances relevance and diversity by selecting documents that are
        both relevant to the query and dissimilar to already selected documents. Use MMR when you want
        diverse results covering different aspects of the query.""",
        "negative": """LCEL (LangChain Expression Language) uses the pipe operator | to chain components.
        A typical chain looks like: prompt | model | output_parser. This enables streaming, async,
        and fallback capabilities out of the box."""
    },
    {
        "query": "How does Adaptive RAG routing work?",
        "positive": """Adaptive RAG classifies query complexity (SIMPLE/MEDIUM/COMPLEX) and routes to different
        retrieval strategies. Simple queries use fast similarity search, medium queries use MMR for diversity,
        and complex queries use HyDe for better semantic matching. This optimizes cost and latency.""",
        "negative": """To split documents, use RecursiveCharacterTextSplitter with chunk_size=1000 and
        chunk_overlap=200. This preserves context across chunk boundaries while keeping chunks small
        enough for effective retrieval."""
    },
    {
        "query": "What is Self-RAG and when should I use it?",
        "positive": """Self-RAG adds autonomous decision-making: the LLM decides if retrieval is needed,
        generates a response, critiques its own answer, and retries if quality is low. Use it for
        quality-critical applications where self-correction is valuable, but expect 10-20s latency.""",
        "negative": """Contextual RAG prepends document-level context to each chunk before embedding.
        This improves retrieval precision by 15-30% with minimal query overhead, as the context is
        added during indexing, not at query time."""
    },
    {
        "query": "How do I reduce RAG costs?",
        "positive": """To reduce costs: 1) Use HuggingFace embeddings (free, local), 2) Cache vector stores
        to avoid re-embedding, 3) Reduce k parameter (retrieve fewer docs), 4) Use Adaptive RAG to
        route simple queries to cheaper strategies, 5) Use GPT-4o-mini instead of GPT-4.""",
        "negative": """GraphRAG extracts entities and relationships from documents, builds a knowledge graph,
        and performs multi-hop reasoning. It's excellent for relationship-centric queries but has
        higher setup complexity."""
    },
    {
        "query": "What's the best RAG architecture for chatbots?",
        "positive": """For chatbots, use Memory RAG (04_rag_with_memory.ipynb). It maintains conversation
        history using ConversationBufferMemory or ConversationBufferWindowMemory, allowing the system
        to understand follow-up questions and references to previous context.""",
        "negative": """Agentic RAG uses ReAct agents with multiple tools (retriever, calculator, web search).
        It's ideal for complex multi-step reasoning but has 20-40s latency due to the agent loop.
        Use for BI dashboards or complex analytics."""
    },
    {
        "query": "How does CRAG improve retrieval quality?",
        "positive": """CRAG (Corrective RAG) grades each retrieved document for relevance. If quality is low,
        it triggers a web search fallback to find better information. This is perfect for out-of-domain
        queries or when vector store doesn't have current information. Expect 10-15s latency.""",
        "negative": """Fusion RAG generates 3-5 query perspectives and retrieves documents for each.
        It then combines results using Reciprocal Rank Fusion (RRF) algorithm, where documents
        appearing in multiple result sets get higher scores."""
    },
    {
        "query": "What embeddings should I use for production?",
        "positive": """For production: Use OpenAI text-embedding-3-small for best quality (1536d, $0.02/1M tokens).
        For cost-sensitive or offline use, try HuggingFace all-MiniLM-L6-v2 (384d, free, local).
        Quality difference is ~10-15%, with OpenAI being better.""",
        "negative": """SQL RAG converts natural language to SQL queries. It retrieves relevant database schema,
        generates SQL with validation, executes safely (read-only), and interprets results. Perfect for
        analytics and BI use cases."""
    }
]

print(f"‚úÖ Created {len(training_data)} training examples")
print("   Each example has: query, positive doc, negative doc")

## 2. Baseline Evaluation

Test pre-trained model performance before fine-tuning.

In [None]:
# Load pre-trained model
base_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Create test queries and documents
test_queries = [item["query"] for item in training_data]
test_positives = [item["positive"] for item in training_data]

# Encode
query_embeddings = base_model.encode(test_queries, convert_to_tensor=True)
doc_embeddings = base_model.encode(test_positives, convert_to_tensor=True)

# Calculate similarities
similarities = util.cos_sim(query_embeddings, doc_embeddings)

# Evaluate: For each query, is its positive doc the top match?
correct = 0
mrr_scores = []

for i in range(len(test_queries)):
    # Get similarity scores for this query against all docs
    scores = similarities[i]
    
    # Find rank of correct document (i-th doc should match i-th query)
    sorted_indices = scores.argsort(descending=True)
    rank = (sorted_indices == i).nonzero(as_tuple=True)[0].item() + 1
    
    if rank == 1:
        correct += 1
    
    mrr_scores.append(1.0 / rank)

baseline_accuracy = correct / len(test_queries)
baseline_mrr = sum(mrr_scores) / len(mrr_scores)

print("\n" + "=" * SECTION_WIDTH)
print("BASELINE EVALUATION (Pre-trained Model)")
print("=" * SECTION_WIDTH)
print(f"Accuracy (top-1): {baseline_accuracy:.1%}")
print(f"MRR (Mean Reciprocal Rank): {baseline_mrr:.3f}")
print(f"\nInterpretation:")
print(f"  - {correct}/{len(test_queries)} queries found correct doc in rank 1")
print(f"  - Average rank of correct doc: {1/baseline_mrr:.1f}")

## 3. Fine-tuning with Contrastive Learning

Train the model using MultipleNegativesRankingLoss.

In [None]:
# Prepare training examples
train_examples = []

for item in training_data:
    # InputExample expects (texts=[anchor, positive], label=similarity_score)
    # For contrastive learning, we use anchor=query, positive=relevant_doc
    train_examples.append(
        InputExample(texts=[item["query"], item["positive"]])
    )

# Create DataLoader
train_dataloader = DataLoader(
    train_examples,
    shuffle=True,
    batch_size=4  # Small batch for small dataset
)

# Initialize fine-tuning loss
# MultipleNegativesRankingLoss: treats other positives in batch as negatives
train_loss = losses.MultipleNegativesRankingLoss(base_model)

print("‚úÖ Training configuration:")
print(f"   - Training examples: {len(train_examples)}")
print(f"   - Batch size: 4")
print(f"   - Loss: MultipleNegativesRankingLoss")
print(f"   - Base model: all-MiniLM-L6-v2 (384d)")

In [None]:
# Fine-tune the model
print("\n" + "=" * SECTION_WIDTH)
print("FINE-TUNING IN PROGRESS...")
print("=" * SECTION_WIDTH)

# Create output directory
output_path = Path("../../data/models/finetuned-embeddings")
output_path.mkdir(parents=True, exist_ok=True)

# Train
base_model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=10,  # More epochs for small dataset
    warmup_steps=10,
    output_path=str(output_path),
    show_progress_bar=True
)

print(f"\n‚úÖ Fine-tuning complete! Model saved to: {output_path}")

## 4. Post-Training Evaluation

Compare fine-tuned model against baseline.

In [None]:
# Load fine-tuned model
finetuned_model = SentenceTransformer(str(output_path))

# Encode with fine-tuned model
ft_query_embeddings = finetuned_model.encode(test_queries, convert_to_tensor=True)
ft_doc_embeddings = finetuned_model.encode(test_positives, convert_to_tensor=True)

# Calculate similarities
ft_similarities = util.cos_sim(ft_query_embeddings, ft_doc_embeddings)

# Evaluate
ft_correct = 0
ft_mrr_scores = []

for i in range(len(test_queries)):
    scores = ft_similarities[i]
    sorted_indices = scores.argsort(descending=True)
    rank = (sorted_indices == i).nonzero(as_tuple=True)[0].item() + 1
    
    if rank == 1:
        ft_correct += 1
    
    ft_mrr_scores.append(1.0 / rank)

finetuned_accuracy = ft_correct / len(test_queries)
finetuned_mrr = sum(ft_mrr_scores) / len(ft_mrr_scores)

# Print comparison
print("\n" + "=" * SECTION_WIDTH)
print("EVALUATION COMPARISON")
print("=" * SECTION_WIDTH)

print(f"\n{'Metric':<25} {'Baseline':<15} {'Fine-tuned':<15} {'Improvement'}")
print("-" * SECTION_WIDTH)
print(f"{'Accuracy (top-1)':<25} {baseline_accuracy:<15.1%} {finetuned_accuracy:<15.1%} {finetuned_accuracy - baseline_accuracy:+.1%}")
print(f"{'MRR':<25} {baseline_mrr:<15.3f} {finetuned_mrr:<15.3f} {finetuned_mrr - baseline_mrr:+.3f}")
print(f"{'Avg Rank of Correct Doc':<25} {1/baseline_mrr:<15.1f} {1/finetuned_mrr:<15.1f} {1/baseline_mrr - 1/finetuned_mrr:+.1f}")

improvement = ((finetuned_accuracy - baseline_accuracy) / baseline_accuracy * 100)
print(f"\nüìä Relative improvement: {improvement:+.1f}%")

if improvement > 10:
    print("‚úÖ Significant improvement! Fine-tuning is beneficial for this domain.")
elif improvement > 0:
    print("‚ö†Ô∏è  Modest improvement. Consider more training data or epochs.")
else:
    print("‚ùå No improvement. Pre-trained model may already be optimal.")

## 5. Integration with RAG Pipeline

Use fine-tuned embeddings in a production RAG system.

In [None]:
# Load documents
docs = load_langchain_docs()
chunks = split_documents(docs)

# Create embeddings using fine-tuned model
finetuned_embeddings = HuggingFaceEmbeddings(
    model_name=str(output_path),
    model_kwargs={'device': 'cpu'}
)

# Build vector store with fine-tuned embeddings
finetuned_vectorstore = FAISS.from_documents(
    chunks[:50],  # Use subset for demo
    finetuned_embeddings
)

# Save for reuse
save_vector_store(finetuned_vectorstore, VECTOR_STORE_DIR / "finetuned")

print("‚úÖ Fine-tuned vector store created")
print(f"   Documents: {len(chunks[:50])}")
print(f"   Saved to: {VECTOR_STORE_DIR / 'finetuned'}")

### Side-by-Side Comparison

Compare retrieval results between baseline and fine-tuned embeddings.

In [None]:
# Load baseline vector store (pre-trained embeddings)
baseline_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
baseline_vectorstore = FAISS.from_documents(chunks[:50], baseline_embeddings)

# Test query
test_query = "How do I reduce RAG costs?"

# Retrieve with baseline
baseline_docs = baseline_vectorstore.similarity_search(test_query, k=3)

# Retrieve with fine-tuned
finetuned_docs = finetuned_vectorstore.similarity_search(test_query, k=3)

# Display results
print("\n" + "=" * SECTION_WIDTH)
print(f"RETRIEVAL COMPARISON: '{test_query}'")
print("=" * SECTION_WIDTH)

print("\nüìò BASELINE MODEL (Pre-trained):")
for i, doc in enumerate(baseline_docs, 1):
    print(f"\n  Result {i}:")
    print(f"  {doc.page_content[:200]}...")

print("\n" + "-" * SECTION_WIDTH)

print("\nüéØ FINE-TUNED MODEL:")
for i, doc in enumerate(finetuned_docs, 1):
    print(f"\n  Result {i}:")
    print(f"  {doc.page_content[:200]}...")

## 6. Production Best Practices

### 6.1 Dataset Expansion

For production, you need 1K-10K+ training examples. Here's how to generate them:

In [None]:
def generate_training_data_from_docs(documents: List[Document], llm: ChatOpenAI, num_examples: int = 100):
    """
    Auto-generate (query, positive_doc) pairs using LLM.
    
    Strategy:
    1. For each document chunk, ask LLM to generate relevant queries
    2. Use document as positive example
    3. Sample other docs as negatives
    """
    training_pairs = []
    
    query_gen_prompt = """Given this document chunk, generate 3 questions that this chunk answers:
    
    Document:
    {document}
    
    Return ONLY the 3 questions, one per line, without numbering."""
    
    for doc in documents[:num_examples // 3]:  # Generate 3 queries per doc
        # Generate queries
        response = llm.invoke(query_gen_prompt.format(document=doc.page_content))
        queries = [q.strip() for q in response.content.strip().split('\n') if q.strip()]
        
        for query in queries[:3]:
            training_pairs.append({
                "query": query,
                "positive": doc.page_content
            })
    
    return training_pairs

# Example usage (commented out to avoid API calls in demo)
# llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
# auto_generated_data = generate_training_data_from_docs(chunks, llm, num_examples=300)

print("‚úÖ Auto-generation function defined")
print("   Use this to scale up to 1K-10K examples for production")

### 6.2 Hyperparameter Tuning

Key hyperparameters to experiment with:

```python
# Training configuration
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=10,              # Try: 5, 10, 20
    warmup_steps=100,       # Try: 10%, 20% of total steps
    batch_size=16,          # Try: 8, 16, 32 (based on GPU memory)
    evaluation_steps=100,   # Evaluate every N steps
    save_best_model=True,   # Save model with best eval score
    optimizer_params={
        'lr': 2e-5          # Try: 1e-5, 2e-5, 5e-5
    }
)
```

### 6.3 Cross-Validation

Split your data for proper evaluation:

```python
from sklearn.model_selection import train_test_split

# Split 80/20
train_data, test_data = train_test_split(training_data, test_size=0.2, random_state=42)

# Train on train_data, evaluate on test_data
```

### 6.4 Monitoring & Logging

```python
# Track training metrics
import wandb

wandb.init(project="rag-embeddings-finetuning")

# Log during training
wandb.log({
    "train_loss": loss,
    "eval_accuracy": accuracy,
    "eval_mrr": mrr
})
```

### 6.5 Model Versioning

```python
# Save with version metadata
version = "v1.0-langchain-rag"
output_path = f"data/models/finetuned-embeddings-{version}"

model.save(output_path)

# Save metadata
metadata = {
    "version": version,
    "base_model": "all-MiniLM-L6-v2",
    "training_examples": len(train_data),
    "epochs": 10,
    "eval_accuracy": finetuned_accuracy,
    "eval_mrr": finetuned_mrr,
    "improvement_vs_baseline": improvement
}

with open(f"{output_path}/metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)
```

## 7. Cost-Benefit Analysis

### Training Costs

**One-time costs:**
- Dataset generation: $2-10 (if using LLM to generate queries)
- GPU training: $0-5 (free on Colab/Kaggle, or ~$1/hour on cloud)
- Validation: ~$0.50 (evaluation queries)

**Total:** $2.50-$15 one-time

### Ongoing Benefits

**Quality improvements:**
- 15-25% better retrieval accuracy
- Better domain-specific understanding
- Reduced hallucinations (better context)

**Cost savings:**
- Local embeddings = $0 per query (vs OpenAI $0.00002/query)
- Offline capability
- No API rate limits

**ROI Example:**
```
Scenario: 1M queries/month

OpenAI embeddings:
- Cost: 1M √ó $0.00002 = $20/month
- Annual: $240

Fine-tuned local embeddings:
- Cost: $0/month (after one-time $10 training)
- Annual: $10

Savings: $230/year + better quality
ROI: 23x
```

### When Fine-tuning Makes Sense

‚úÖ **Worth it:**
- High query volume (>100K/month)
- Domain-specific content
- Budget constraints
- Offline requirements
- Quality < 75% with pre-trained

‚ùå **Not worth it:**
- Low query volume (<10K/month)
- General content
- OpenAI already works well (>90%)
- Limited training data (<1K examples)

## 8. Summary & Decision Framework

### Quick Decision Tree

```
Start: Should I fine-tune embeddings?
  |
  ‚îú‚îÄ Is domain highly specialized? (medical, legal, technical)
  ‚îÇ   YES ‚Üí Fine-tune (expected +15-25% accuracy)
  ‚îÇ   NO ‚Üí ‚Üì
  |
  ‚îú‚îÄ Do I have >1K labeled examples?
  ‚îÇ   YES ‚Üí Fine-tune
  ‚îÇ   NO ‚Üí ‚Üì
  |
  ‚îú‚îÄ Is baseline accuracy <75%?
  ‚îÇ   YES ‚Üí Fine-tune or try Contextual RAG first
  ‚îÇ   NO ‚Üí ‚Üì
  |
  ‚îú‚îÄ Query volume >100K/month?
  ‚îÇ   YES ‚Üí Fine-tune (cost savings)
  ‚îÇ   NO ‚Üí Stick with pre-trained
```

### Implementation Checklist

**Before fine-tuning:**
- [ ] Collect 1K-10K (query, document) pairs
- [ ] Measure baseline performance (accuracy, MRR)
- [ ] Set target improvement (e.g., +15% accuracy)
- [ ] Allocate GPU resources (Colab, Kaggle, or cloud)

**During fine-tuning:**
- [ ] Split data (80/20 train/test)
- [ ] Start with small epochs (3-5)
- [ ] Monitor training loss
- [ ] Evaluate on held-out test set
- [ ] Save best checkpoint

**After fine-tuning:**
- [ ] Compare vs baseline on test set
- [ ] Test on production-like queries
- [ ] Integrate into RAG pipeline
- [ ] Monitor retrieval quality in production
- [ ] Set up retraining schedule (quarterly?)

### Next Steps

1. **Scale up training data** to 1K-10K examples
2. **Experiment with different base models** (e.g., BGE, E5, instructor)
3. **Try different loss functions** (CoSENT, ContrastiveLoss)
4. **Implement continuous evaluation** on production queries
5. **Set up A/B testing** (baseline vs fine-tuned)

---

**üìö Related Notebooks:**
- [02_embeddings_comparison.ipynb](../fundamentals/02_embeddings_comparison.ipynb) - OpenAI vs HuggingFace baseline
- [12_contextual_rag.ipynb](12_contextual_rag.ipynb) - Alternative quality improvement
- [16_evaluation_ragas.ipynb](16_evaluation_ragas.ipynb) - Comprehensive quality metrics

**üîó External Resources:**
- [Sentence Transformers Docs](https://www.sbert.net/)
- [Fine-tuning Guide](https://www.sbert.net/docs/training/overview.html)
- [BEIR Benchmark](https://github.com/beir-cellar/beir) - Retrieval evaluation

---

üéâ **Fine-tuning Complete!**

You now understand when and how to fine-tune embeddings for domain-specific RAG applications!