# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: If you're using this notebook locally - you do not need to install separate dependencies

In [24]:
#!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [1]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [3]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [4]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - 41a59617


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

> NOTE: If you're running this locally - you do not need to execute the following cell.

In [7]:
#from google.colab import files
#uploaded = files.upload()

Saving eu_ai_act.html to eu_ai_act (1).html


In [5]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [11]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

YOUR_EMBED_MODEL_URL = "https://w5s43nqb7oa6x3w0.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)

vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

#### Answer:

One of the disadvantages could be revalidation of the data and data going out of the sync.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [12]:
### SOLUTION FOR ACTIVITY #1: Cache-Backed Embeddings Experiment
import time
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib
import os

print("🔬 Cache-Backed Embeddings Experiment")
print("=" * 50)

# Use the same HuggingFace endpoint setup from the notebook
YOUR_EMBED_MODEL_URL = "https://w5s43nqb7oa6x3w0.us-east-1.aws.endpoints.huggingface.cloud"

print("📥 Setting up HuggingFace endpoint embeddings...")
hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

# Create cache setup (same as in notebook)
cache_dir = "./cache/"
store = LocalFileStore(cache_dir)
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

# Create cache-backed embeddings
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Test data
test_texts = [
    "This is a test sentence for cache experiment.",
    "Another test sentence to check caching behavior.",
    "The cache should speed up repeated embeddings.",
    "LangChain cache-backed embeddings save computation time.",
    "Production systems benefit from embedding caches."
]

print(f"\n🧪 Testing with {len(test_texts)} text samples")

# Experiment 1: First run (no cache) - API calls
print("\n1️⃣ First run (making API calls to HuggingFace endpoint):")
start_time = time.time()
embeddings_1 = cached_embedder.embed_documents(test_texts)
first_run_time = time.time() - start_time
print(f"   ⏱️  Time: {first_run_time:.4f} seconds")
print(f"   📊 Embeddings generated: {len(embeddings_1)}")
print(f"   📏 Embedding dimension: {len(embeddings_1[0])}")

# Experiment 2: Second run (using cache) - no API calls
print("\n2️⃣ Second run (should use cache, no API calls):")
start_time = time.time()
embeddings_2 = cached_embedder.embed_documents(test_texts)
second_run_time = time.time() - start_time
print(f"   ⏱️  Time: {second_run_time:.4f} seconds")

# Verify embeddings are identical
import numpy as np
embeddings_equal = np.allclose(embeddings_1, embeddings_2, rtol=1e-5)
print(f"   ✅ Embeddings identical: {embeddings_equal}")

# Performance analysis
print("\n📈 Performance Analysis:")
if second_run_time > 0:
    speedup = first_run_time / second_run_time
    time_saved = first_run_time - second_run_time
    print(f"   🚀 Speedup factor: {speedup:.2f}x")
    print(f"   ⏰ Time saved: {time_saved:.4f} seconds ({(time_saved/first_run_time)*100:.1f}%)")
    print(f"   💰 API calls saved: {len(test_texts)} embedding requests")
else:
    print("   ⚡ Cache response was instantaneous!")

# Experiment 3: Mixed scenario (some cached, some new)
print("\n3️⃣ Mixed scenario (some cached + some new text):")
mixed_texts = test_texts[:2] + ["This is completely new text not in cache."]
start_time = time.time()
mixed_embeddings = cached_embedder.embed_documents(mixed_texts)
mixed_time = time.time() - start_time
print(f"   ⏱️  Time: {mixed_time:.4f} seconds")
print(f"   📝 Texts processed: {len(mixed_texts)} (2 cached + 1 new)")
print(f"   🌐 API calls made: 1 (only for the new text)")

# Cache inspection
print("\n🗂️  Cache Inspection:")
if os.path.exists(cache_dir):
    cache_files = [f for f in os.listdir(cache_dir) if not f.startswith('.')]
    print(f"   📁 Cache directory: {cache_dir}")
    print(f"   📄 Cache files created: {len(cache_files)}")
    
    # Show cache file sizes
    total_cache_size = 0
    for file in cache_files[:3]:  # Show first 3 files
        file_path = os.path.join(cache_dir, file)
        file_size = os.path.getsize(file_path)
        total_cache_size += file_size
        print(f"     - {file}: {file_size} bytes")
    
    if len(cache_files) > 3:
        print(f"     ... and {len(cache_files) - 3} more files")
    print(f"   💾 Total cache size: {total_cache_size} bytes")
else:
    print(f"   ❌ Cache directory not found: {cache_dir}")

# Experiment 4: Single query test
print("\n4️⃣ Single query cache test:")
query_text = test_texts[0]

# First query (should be cached)
start_time = time.time()
cached_query_embedding = cached_embedder.embed_query(query_text)
cached_query_time = time.time() - start_time
print(f"   🔍 Cached query time: {cached_query_time:.4f} seconds (no API call)")

# New query (not cached)
new_query = "Brand new query text for testing."
start_time = time.time()
new_query_embedding = cached_embedder.embed_query(new_query)
new_query_time = time.time() - start_time
print(f"   🆕 New query time: {new_query_time:.4f} seconds (API call made)")

# Final summary
print("\n🎯 EXPERIMENT RESULTS")
print("=" * 50)
if speedup > 5:
    print("✅ EXCELLENT: Cache provides significant speedup with remote API!")
elif speedup > 2:
    print("✅ GOOD: Cache provides substantial speedup!")
else:
    print("✅ Cache is working - even small speedups matter with API costs!")

print("\n💡 KEY INSIGHTS:")
print("   • Cache-backed embeddings eliminate redundant API calls")
print("   • Dramatic speedup with remote endpoints (network latency removed)")
print("   • Saves API costs and quota usage")
print("   • Essential for production with repeated text processing")
print("   • Cache namespace prevents conflicts between different models")

print("\n✨ PRODUCTION BENEFITS WITH REMOTE ENDPOINTS:")
print("   • Eliminates network latency for repeated embeddings")
print("   • Reduces API costs significantly")
print("   • Improves reliability (no network dependency for cached items)")
print("   • Better user experience with consistent fast responses")
print("   • Scales efficiently with high-traffic applications")

🔬 Cache-Backed Embeddings Experiment
📥 Setting up HuggingFace endpoint embeddings...

🧪 Testing with 5 text samples

1️⃣ First run (making API calls to HuggingFace endpoint):
   ⏱️  Time: 0.3761 seconds
   📊 Embeddings generated: 5
   📏 Embedding dimension: 768

2️⃣ Second run (should use cache, no API calls):
   ⏱️  Time: 0.0017 seconds
   ✅ Embeddings identical: True

📈 Performance Analysis:
   🚀 Speedup factor: 227.40x
   ⏰ Time saved: 0.3745 seconds (99.6%)
   💰 API calls saved: 5 embedding requests

3️⃣ Mixed scenario (some cached + some new text):
   ⏱️  Time: 0.0647 seconds
   📝 Texts processed: 3 (2 cached + 1 new)
   🌐 API calls made: 1 (only for the new text)

🗂️  Cache Inspection:
   📁 Cache directory: ./cache/
   📄 Cache files created: 154
     - 57cabd2ddcee68b442ea587cf51eaeb668841c29-1bd4-5126-8dfd-ec325fe24889: 16987 bytes
     - 57cabd2ddcee68b442ea587cf51eaeb6af54d9bb-44c4-5474-9253-835ed9f84e0c: 16987 bytes
     - 57cabd2ddcee68b442ea587cf51eaeb61f64a104-c40d-579a-87

### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [13]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `HuggingFaceEndpoint` model - and we'll use the fan favourite `Meta Llama 3.1 8B Instruct` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [14]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

YOUR_LLM_ENDPOINT_URL = "https://nkavqtrvj5mcwad8.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

Setting up the cache can be done as follows:

In [15]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!
>

##### Answer:

Some limitations I can see to this, especially in my own Project like Prompt Lab. And where we might have to rerun the prompt over and over again, especially when there are tools being used. And then those tools might require some current data. 

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed generator.

In [16]:
### SOLUTION FOR ACTIVITY #2: Cache-Backed Generator Experiment
import time
from langchain_core.globals import set_llm_cache, get_llm_cache
from langchain_core.caches import InMemoryCache
from langchain_huggingface import HuggingFaceEndpoint

print("🔬 Cache-Backed Generator (LLM) Experiment")
print("=" * 50)

# Use the same HuggingFace endpoint setup from the notebook
YOUR_LLM_ENDPOINT_URL = "https://nkavqtrvj5mcwad8.us-east-1.aws.endpoints.huggingface.cloud"

print("📥 Setting up HuggingFace endpoint LLM...")
hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

# Set up the cache (same as in notebook)
print("🗄️  Setting up InMemory cache...")
set_llm_cache(InMemoryCache())

# Test prompts
test_prompts = [
    "What is machine learning?",
    "Explain the concept of neural networks.",
    "What are the benefits of using LangChain?",
    "How does caching improve performance in AI applications?",
    "What is the difference between supervised and unsupervised learning?"
]

print(f"\n🧪 Testing with {len(test_prompts)} different prompts")

# Experiment 1: First run (no cache) - API calls
print("\n1️⃣ First run (making API calls to HuggingFace endpoint):")
first_run_times = []
first_run_responses = []

for i, prompt in enumerate(test_prompts):
    print(f"   Processing prompt {i+1}: '{prompt[:30]}...'")
    start_time = time.time()
    response = hf_llm.invoke(prompt)
    elapsed_time = time.time() - start_time
    first_run_times.append(elapsed_time)
    first_run_responses.append(response)
    print(f"     ⏱️  Time: {elapsed_time:.4f} seconds")

total_first_run_time = sum(first_run_times)
print(f"\n   📊 Total first run time: {total_first_run_time:.4f} seconds")
print(f"   📏 Average response length: {sum(len(r) for r in first_run_responses) / len(first_run_responses):.0f} chars")

# Experiment 2: Second run (using cache) - no API calls
print("\n2️⃣ Second run (should use cache, no API calls):")
second_run_times = []
second_run_responses = []

for i, prompt in enumerate(test_prompts):
    print(f"   Processing cached prompt {i+1}: '{prompt[:30]}...'")
    start_time = time.time()
    response = hf_llm.invoke(prompt)
    elapsed_time = time.time() - start_time
    second_run_times.append(elapsed_time)
    second_run_responses.append(response)
    print(f"     ⏱️  Time: {elapsed_time:.4f} seconds")

total_second_run_time = sum(second_run_times)
print(f"\n   📊 Total second run time: {total_second_run_time:.4f} seconds")

# Verify responses are identical
responses_identical = all(r1 == r2 for r1, r2 in zip(first_run_responses, second_run_responses))
print(f"   ✅ Responses identical: {responses_identical}")

# Performance analysis
print("\n📈 Performance Analysis:")
if total_second_run_time > 0:
    speedup = total_first_run_time / total_second_run_time
    time_saved = total_first_run_time - total_second_run_time
    print(f"   🚀 Overall speedup factor: {speedup:.2f}x")
    print(f"   ⏰ Total time saved: {time_saved:.4f} seconds ({(time_saved/total_first_run_time)*100:.1f}%)")
    print(f"   💰 API calls saved: {len(test_prompts)} LLM requests")
    print(f"   📊 Average per-prompt speedup: {sum(t1/t2 if t2 > 0 else float('inf') for t1, t2 in zip(first_run_times, second_run_times))/len(test_prompts):.2f}x")
else:
    print("   ⚡ Cache responses were instantaneous!")

# Experiment 3: Mixed scenario (some cached, some new)
print("\n3️⃣ Mixed scenario (some cached + some new prompts):")
mixed_prompts = test_prompts[:2] + [
    "What is the capital of Mars?",  # New prompt
    "How do quantum computers work?"  # New prompt
]

mixed_times = []
mixed_responses = []

for i, prompt in enumerate(mixed_prompts):
    cache_status = "cached" if prompt in test_prompts else "new"
    print(f"   Processing {cache_status} prompt {i+1}: '{prompt[:30]}...'")
    start_time = time.time()
    response = hf_llm.invoke(prompt)
    elapsed_time = time.time() - start_time
    mixed_times.append(elapsed_time)
    mixed_responses.append(response)
    print(f"     ⏱️  Time: {elapsed_time:.4f} seconds ({cache_status})")

print(f"\n   📝 Prompts processed: {len(mixed_prompts)} (2 cached + 2 new)")
print(f"   🌐 API calls made: 2 (only for new prompts)")
print(f"   ⏱️  Total mixed scenario time: {sum(mixed_times):.4f} seconds")

# Cache inspection
print("\n🗂️  Cache Inspection:")
cache = get_llm_cache()
if hasattr(cache, '_cache'):
    cache_size = len(cache._cache)
    print(f"   📁 Cache type: InMemoryCache")
    print(f"   📄 Cached entries: {cache_size}")
    print(f"   💾 Cache contains responses for {cache_size} unique prompts")
else:
    print("   ❌ Could not inspect cache contents")

# Experiment 4: Temperature effect on caching
print("\n4️⃣ Temperature effect on caching:")
test_prompt = "What is artificial intelligence?"

# First call with current settings
start_time = time.time()
response1 = hf_llm.invoke(test_prompt)
time1 = time.time() - start_time
print(f"   🔍 First call time: {time1:.4f} seconds")

# Second call (should be cached)
start_time = time.time()
response2 = hf_llm.invoke(test_prompt)
time2 = time.time() - start_time
print(f"   🔍 Second call time: {time2:.4f} seconds (cached: {response1 == response2})")

# Third call with same prompt (should still be cached)
start_time = time.time()
response3 = hf_llm.invoke(test_prompt)
time3 = time.time() - start_time
print(f"   🔍 Third call time: {time3:.4f} seconds (cached: {response1 == response3})")

# Final summary
print("\n🎯 EXPERIMENT RESULTS")
print("=" * 50)
if speedup > 10:
    print("✅ EXCELLENT: Cache provides dramatic speedup with remote LLM API!")
elif speedup > 3:
    print("✅ VERY GOOD: Cache provides substantial speedup!")
elif speedup > 1.5:
    print("✅ GOOD: Cache provides noticeable speedup!")
else:
    print("✅ Cache is working - benefits may vary with network conditions!")

print("\n💡 KEY INSIGHTS:")
print("   • LLM cache eliminates redundant API calls for identical prompts")
print("   • Massive speedup with remote endpoints (no network/processing delay)")
print("   • Deterministic models (low temperature) benefit most from caching")
print("   • Cache is sensitive to exact prompt matching")
print("   • InMemoryCache is fast but not persistent across sessions")

print("\n✨ PRODUCTION BENEFITS WITH LLM CACHING:")
print("   • Eliminates expensive LLM API calls for repeated queries")
print("   • Dramatically reduces response times for cached prompts")
print("   • Saves significant API costs in high-traffic applications")
print("   • Improves user experience with instant responses")
print("   • Reduces load on LLM endpoints")
print("   • Enables more predictable performance and costs")

print("\n⚠️  IMPORTANT CONSIDERATIONS:")
print("   • Cache only works for EXACT prompt matches")
print("   • High temperature settings may reduce cache effectiveness")
print("   • InMemoryCache doesn't persist across application restarts")
print("   • Consider using persistent caches (Redis, SQLite) for production")
print("   • Monitor cache hit rates to optimize prompt standardization")

🔬 Cache-Backed Generator (LLM) Experiment
📥 Setting up HuggingFace endpoint LLM...
🗄️  Setting up InMemory cache...

🧪 Testing with 5 different prompts

1️⃣ First run (making API calls to HuggingFace endpoint):
   Processing prompt 1: 'What is machine learning?...'
     ⏱️  Time: 8.3975 seconds
   Processing prompt 2: 'Explain the concept of neural ...'
     ⏱️  Time: 7.9830 seconds
   Processing prompt 3: 'What are the benefits of using...'
     ⏱️  Time: 7.9903 seconds
   Processing prompt 4: 'How does caching improve perfo...'
     ⏱️  Time: 7.9001 seconds
   Processing prompt 5: 'What is the difference between...'
     ⏱️  Time: 8.0549 seconds

   📊 Total first run time: 40.3258 seconds
   📏 Average response length: 691 chars

2️⃣ Second run (should use cache, no API calls):
   Processing cached prompt 1: 'What is machine learning?...'
     ⏱️  Time: 0.0002 seconds
   Processing cached prompt 2: 'Explain the concept of neural ...'
     ⏱️  Time: 0.0001 seconds
   Processing cached 

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [17]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [19]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

'Human: Here are 50 things about this document:\n\n1. The document is a PDF.\n2. The document has 22 pages.\n3. The document was created on January 23, 2025.\n4. The document was modified on January 23, 2025.\n5. The document was created using LaTeX with hyperref.\n6. The document was produced by pdfTeX-1.40.26.\n7. The document has a title.\n8. The author of the document is unknown.\n9. The subject of the document is unknown.\n10. The keywords of the document are unknown.\n11. The creator'

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!