# Caching Embeddings
In **Retrieval-Augmented Generation (RAG)** systems, text data is converted into vector embeddings for semantic search and retrieval.
However, generating embeddings repeatedly for the same text is costly and time-consuming, especially when using APIs such as OpenAI or Azure.

**Caching embeddings:** storing previously computed embeddings locally (or in a database) so that future queries for the same text can reuse them instantly instead of recomputing.

## Use Case
Consider a **customer support chatbot** that answers queries from customers about company policies, refunds, shipping, or product details.
Many customers ask the same or similar questions repeatedly — for example:

- “What’s the refund policy?”

- “How can I return an item?”

- “Do you offer free shipping?”

In such cases, the chatbot needs to generate embeddings for these queries before performing a semantic search.
Without caching, the same embeddings are generated multiple times for identical queries, leading to unnecessary API calls and higher latency.

By implementing **embedding caching**, the system first checks whether an embedding for the given query already exists in the cache (e.g., OpenSearch, Redis, or local JSON).
If found, it reuses that stored embedding instead of generating a new one — resulting in:

- Faster response times

- Lower embedding API costs

- Reduced computational overhead

⚙️ **When It’s Useful**

- Chatbots that handle frequent, repetitive customer questions.

- Scenarios with high query volume and limited variation in phrasing.

- Deployments that scale horizontally, where multiple instances can reuse the same cached embeddings from a shared store.

In [22]:
import os
import json
import hashlib
import pandas as pd
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

In [23]:
# Initialize OpenAI embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [24]:
# Example dataset: customer service knowledge base
data = [
    "Warranty covers accidental screen damage for one year.",
    "Refunds are processed within 7 business days.",
    "Free shipping is available for orders above $50.",
    "Warranty covers accidental screen damage for one year."  # Duplicate
]

df = pd.DataFrame({"text": data})
df

Unnamed: 0,text
0,Warranty covers accidental screen damage for o...
1,Refunds are processed within 7 business days.
2,Free shipping is available for orders above $50.
3,Warranty covers accidental screen damage for o...


In [25]:
def generate_hash(text: str) -> str:
    """
    Generate a consistent hash for each text string.

    :param text: text to be embedded
    """
    return hashlib.md5(text.encode('utf-8')).hexdigest()

#### Caching Mechanism

In [26]:
CACHE_FILE = "embedding_cache.json"

# Load existing cache if available
if os.path.exists(CACHE_FILE):
    with open(CACHE_FILE, "r") as f:
        embedding_cache = json.load(f)
else:
    embedding_cache = {}

In [27]:
def get_or_create_embedding(text):
    """Retrieve embedding from cache or generate if missing."""
    key = generate_hash(text)
    if key in embedding_cache:
        return embedding_cache[key]
    else:
        vector = embeddings.embed_query(text)
        embedding_cache[key] = vector
        return vector

# Compute embeddings for dataset
df["embedding"] = df["text"].apply(get_or_create_embedding)

# Save updated cache
with open(CACHE_FILE, "w") as f:
    json.dump(embedding_cache, f)


#### Vector Store (FAISS)

In [28]:
# Build a FAISS index using cached embeddings

texts = df["text"].tolist()
vectors = df["embedding"].tolist()

# Create text-embedding pairs for FAISS.from_embeddings
text_embeddings = list(zip(texts, vectors))

vectorstore = FAISS.from_embeddings(
    text_embeddings=text_embeddings,
    embedding=embeddings
)

print("✅ FAISS index built successfully with caching enabled.")
print(f"📊 Index contains {len(texts)} documents")

✅ FAISS index built successfully with caching enabled.
📊 Index contains 4 documents


In [30]:
import time

def format_time(ms):
    """Format time in appropriate units (ms, seconds, or minutes)."""
    if ms < 1000:
        return f"{ms:.2f} ms"
    elif ms < 60000:  # Less than 1 minute
        seconds = ms / 1000
        return f"{seconds:.2f} seconds"
    else:
        minutes = ms / 60000
        return f"{minutes:.2f} minutes"

# Test query
query = "What is the policy for free shipping?"

print(f"🔍 Testing query: '{query}'")
print()

# First time - generate new embedding
print("1️⃣ First time (new embedding):")
start_time = time.time()
embedding1 = get_or_create_embedding(query)
time1 = (time.time() - start_time) * 1000
print(f"   Time: {format_time(time1)}")

# Second time - use cached embedding
print("2️⃣ Second time (cached embedding):")
start_time = time.time()
embedding2 = get_or_create_embedding(query)
time2 = (time.time() - start_time) * 1000
print(f"   Time: {format_time(time2)}")

# Show the difference
time_saved = time1 - time2
print(f"\n📈 Results:")
print(f"   Time saved: {format_time(time_saved)}")
print(f"   Speed improvement: {((time1 - time2) / time1) * 100:.1f}%")


🔍 Testing query: 'What is the policy for free shipping?'

1️⃣ First time (new embedding):
   Time: 1.02 seconds
2️⃣ Second time (cached embedding):
   Time: 0.07 ms

📈 Results:
   Time saved: 1.02 seconds
   Speed improvement: 100.0%
