# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: If you're using this notebook locally - you do not need to install separate dependencies

In [24]:
#!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [1]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [2]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIE6 Session 16 Geeta - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [3]:
print(os.environ["LANGCHAIN_PROJECT"])

AIE6 Session 16 Geeta - b2b213a6


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

> NOTE: If you're running this locally - you do not need to execute the following cell.

In [None]:
#from google.colab import files
#uploaded = files.upload()

In [4]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

In [7]:
len(docs)

73

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

<span style="color:green"> Learnings

<span style="color:green">The Two Optimizations Working Together:

- <span style="color:green">Batching (batch_size=32): Reduces 73 API calls → 3 batch API calls
- <span style="color:green">Caching (CacheBackedEmbeddings): Reduces subsequent runs to 0 calls

<span style="color:green">Performance Comparison (API calls):
| Approach | First Run | Second Run | Third Run |
|----------|-----------|------------|-----------|
| No batching, no caching | 73 calls | 73 calls | 73 calls |
| Batching, no caching | 3 calls | 3 calls | 3 calls |
| Batching + caching | 3 calls | 0 calls | 0 calls |

In [10]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

YOUR_EMBED_MODEL_URL = "https://otuib8n7n9k6htjs.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

# Get a fresh empty collection.
collection_name = f"pdf_to_parse_{uuid.uuid4()}"

# open connection to a local in memory qdrant instance.
client = QdrantClient(":memory:")

# the size must match the size of the embeddings we're using.
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Normally, this is where we would just load all our documents into the vector store.
# Instead, we'll do caching to avoid extra costs. 
# Our caching system will sit between our currently empty vector store and the embedding model.
# Caching embeddings means we don't pay the cost of generating them from scratch every time.

# Create a safe namespace by hashing the model URL (a unique identifier for our embedding model).
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

# Create a folder called cache in the current working directory. This is where 
# our cached embeddings will be stored.
store = LocalFileStore("./cache/")

# Create a CacheBackedEmbeddings object.
# This object will cache our embeddings and store them in the folder we just created.
# So we are going to use this object as our embedding model. It will check the cache first,
# and if it doesn't find the embedding, it will generate it and store it in the cache.
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)


# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)


<span style="color:green"> So, after this we will have 73 files in cache/ Their names will be the namespace-hash+uuid-for-this-chunk. Files are kind of big! 17KB each. just storing the 768 floats would be about 6KB.

In [None]:
vectorstore.add_documents(docs) # this takes about 2.5 seconds for the 73 chunks

['a7eb3d9675804ac89a5496093ade9a9f',
 '26375ee218624f49989d8630c057bd0f',
 '11651a8d34e74c459743960bd0f7162c',
 '966eab6ef8cf480ba069667f56390411',
 '5860acc53a3b40cfa842899e574de759',
 '9023b7f90ff64a3b9dd76e2fb0f7050d',
 'cf046687233b4bb8aa9bf627e3ff05a1',
 '27a27a653d6e43f0bbbc4d6d17396c11',
 '35355c3c582f4d0eb31665e013ec7fb7',
 '2772b9f4cb844081a096180894fc5cb5',
 'ff030416c64b4fb683d33142c47c06a8',
 '218b9d7edcd34e8a8c3bcdda4e3e9e7f',
 '7c9acc54a9714c5e97cf781b7311602e',
 'f68dc8b1e38e4839a3bb4d93bcf97877',
 '85de4f25d65d493ab34d7c8cd6e43c0e',
 'b1381e9d7b554f00b13a83df6d92f5d5',
 '6ebf463e2fbd481cb80024b060a4d69f',
 '2822c8dcf313498a8a1966072eda592d',
 'bf59bb50011e41dea2c8ad549caf740e',
 '21104b256c0d4b7f9d34a92e7d0ea8c9',
 '35b6ffb4db844130b1f46fe2565dd410',
 '5bec3dfb49794927b03934dcdd0c7668',
 'ae83c069c4e54771ae96b6e2a8fc14de',
 '7030ddf827ee4d32ac21075fedd88891',
 '4ac36f9362994369916bad6bec0b7b2a',
 '66a1d904bc3246d5841f2f00d6d40770',
 'ed5c09ea7cb2490591226cf67b3b3821',
 

In [13]:
vectorstore.add_documents(docs) # this takes 0 seconds!!

['d30e33f42a82469f87a5b1534e1deca8',
 '40e4b50d7c3241249c2c1fb88beee306',
 '529f92a58fe54d8e9e21060eac792e2b',
 '84f8c1a87e154444b711eb24bef18d3f',
 '28d02004517b448ca271f0b74c4ed793',
 'bc3e44d03b2e47ef888f8e6ba3957c8d',
 '3f748a3b76e0437bb055bd59f28f24c4',
 'e17992ebf4a540c2a48825c73e11f914',
 'b1a070aa55d34e7a80087ef5fb48ed73',
 '56c15a9484164aebbf04d42ddedc7825',
 '6ccfc57f463c4e6fa71d164d2715696b',
 '1b7bf341a8394b79ae2229c22e0c0281',
 '689f33aa96db4be7bc19ab065833b695',
 '1154ac113f3b40b28d5a0de1aa301d28',
 '031bc374397b4173b12c5fae8f87676a',
 '4e09b83e7b524ec5a007c0761342e6f2',
 '33a25055cdb6405aa1054e308b421dd9',
 'fe7bd28da82b41988c877a54b4402f4c',
 '3b1c2f1a69e148f1b74826952ece482e',
 '7387b52c3c6947f3b582db51bc512f41',
 '9ef8a78365a443c4977fe92ee59ea622',
 'ece6521fa9cd41e8bee6ed3f23eb35e6',
 '7acebb1cf8d04b8dba66de9c16659be0',
 'f31eccff71764a0e96e833b93c02f311',
 'ca0507520e24430cb31b867a28289d70',
 '9a9a4436ab384070abc516c80dcc63b1',
 'a3b28ccddc6a4b618068d7adffb23021',
 

In [14]:
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

##### <span style="color:green"> Answer

- <span style="color:green"> Storage space! 17 KB for 73 chunks! It will be like 170GB or so for 10 million documents. Best to be aware of that.
- <span style="color:green"> Exact text match. "cat sat on mat" and Cat sat on the mat" will be treated as different.
- <span style="color:green"> Invalidated if our embedding model gets updated or something?
- <span style="color:green"> LEAST USEFUL: What if we have content thats always changing - news stuff? or like real time data? Also, not good for small one-time use. If we were aggregating news of the day - this would make no difference. cache hit rate would be 0.
- <span style="color:green"> MOST USEFUL: same data, eval cycles, testing cycles - great!! Also, like a FAQ type usecase, especially with multi users - many of the same questions get asked! Perfect for that. Small updates to large documents - most chunks stay the same.

> NOTE: There is no single correct answer here!

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

# <span style="color:green"> A simple Cache Hit Rate Test

### <span style="color:green">First Run (Mixed): 0.131 seconds
-  **2 new texts** → Required API calls to Hugging Face
-  **2 cached texts** → Instant retrieval from local files
-  **Cache updated** → New embeddings saved for future use

### <span style="color:green">Second Run (All Cached): 0.001 seconds  
-  **All 4 texts** → Found in cache immediately
-  **Zero API calls** → No network requests needed
-  **File reads only** → Lightning-fast local access

<span style="color:green">**Speedup**: **150x faster**


In [15]:
### YOUR CODE HERE
import time

# Cache Hit Rate Test
print("=== Cache Hit Rate Test ===")

# Test 1: Mix of new and already cached texts
print("\\n1. Testing mix of new and cached texts:")

# These should already be in cache (from when we added docs to vectorstore)
cached_texts = [docs[0].page_content, docs[1].page_content]

# These are completely new texts
new_texts = [
    "This is a completely new text that has never been embedded before",
    "Another brand new sentence for testing cache misses"
]

# Mix them together
mixed_texts = new_texts + cached_texts

print(f"Testing {len(new_texts)} new texts + {len(cached_texts)} cached texts")

# Time the mixed embedding
start_time = time.time()
mixed_vectors = cached_embedder.embed_documents(mixed_texts)
mixed_time = time.time() - start_time

print(f"Mixed embedding time: {mixed_time:.3f} seconds")

# Test 2: Now embed the same mixed texts again (all should be cached)
print("\\n2. Re-embedding the same texts (should all be cached):")

start_time = time.time()
cached_vectors = cached_embedder.embed_documents(mixed_texts)
cached_time = time.time() - start_time

print(f"All cached embedding time: {cached_time:.3f} seconds")
print(f"Speedup: {mixed_time/cached_time:.1f}x faster")

# Test 3: Check cache file count
print("\\n3. Cache file verification:")
cache_files = [f for f in os.listdir("./cache") if f.startswith("111101ee")]
print(f"Total cache files: {len(cache_files)}")
print(f"Expected: {len(docs)} + {len(new_texts)} = {len(docs) + len(new_texts)}")

=== Cache Hit Rate Test ===
\n1. Testing mix of new and cached texts:
Testing 2 new texts + 2 cached texts
Mixed embedding time: 0.131 seconds
\n2. Re-embedding the same texts (should all be cached):
All cached embedding time: 0.001 seconds
Speedup: 150.0x faster
\n3. Cache file verification:
Total cache files: 75
Expected: 73 + 2 = 75


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [16]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existence of context.
"""

# rag_message_list = [
#     {"role" : "system", "content" : rag_system_prompt_template},
# ]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `HuggingFaceEndpoint` model - and we'll use the fan favourite `Meta Llama 3.1 8B Instruct` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [17]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

YOUR_LLM_ENDPOINT_URL = "https://npbfssd0ai6866is.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

Setting up the cache can be done as follows:

In [18]:
from langchain_core.caches import InMemoryCache

# Ok, this sets a GLOBAL cache that affects all LLM objects in LangChain. So, this does
# apply now to our hf_llm object.
set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!
##### <span style="color:green"> Answer
## <span style="color:green">Limitations

| Limitation | Problem |
|------------|---------|
| **Exact Match Only** | Minor text differences = cache miss |
| **Memory Constraints** | InMemoryCache can exhaust RAM |
| **No Persistence** | Cache lost on restart |
| **Context Blind** | Same prompt, different contexts = same response |
| **Kills Creativity** | Blocks randomness/variety |

## <span style="color:green">When Caching is **LEAST Useful**

Creative content (poems, stories etc), real-time data, personalized responses, one-time queries.

## <span style="color:green">When Caching is **MOST Useful**

multi-user FAQ style usage! Dev/testing/eval cycles, educational stuff, deterministic responses needed use cases.

(this answer is looking similar to the previous one! I am missing some LLM specific excamples of usefulness maybe? But everything I thought of applies to both embedding and llms - except for creativity tasks I guess)


## <span style="color:green"> Golden Rule

**Cache when**: Responses should be **consistent** and **reusable**

**Don't cache when**: Users expect **variety** or **personalization**

The key is matching caching strategy to user expectations!


> NOTE: There is no single correct answer here!

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed generator.

# <span style="color:green"> LLM Cache Performance Test

## <span style="color:green">Experiment
Tested the same prompt twice to measure cache performance:
- Prompt: "What is machine learning?"
- First call: API request to Hugging Face endpoint
- Second call: Cache retrieval

## <span style="color:green">Results

| Metric | Value |
|--------|-------|
| First call time | 6.633 seconds |
| Second call time | 0.000 seconds |
| Performance improvement | 14,901x faster |
| Response accuracy | Identical responses |

There was probably some cold start penalty as well since my instance goes to sleep a lot! Although I tried again with "what is silicon valley culture?" and the speedup was the same!


In [21]:
### YOUR CODE HERE
import time

print("=== LLM Cache Test ===")
print()

# Test the same prompt twice
test_prompt = "What is silicon valley culture?"

print("1. First call (should hit API):")
start_time = time.time()
response1 = hf_llm.invoke(test_prompt)
time1 = time.time() - start_time
print(f"Time: {time1:.3f} seconds")
print(f"Response: {response1[:100]}...")  # First 100 chars
print()

print("2. Second call (should hit cache):")
start_time = time.time()
response2 = hf_llm.invoke(test_prompt)
time2 = time.time() - start_time
print(f"Time: {time2:.3f} seconds")
print(f"Response: {response2[:100]}...")  # First 100 chars
print()

print("3. Results:")
print(f"First call: {time1:.3f} seconds")
print(f"Second call: {time2:.3f} seconds")
if time2 > 0:
    print(f"Speedup: {time1/time2:.1f}x faster")
else:
    print("Second call was essentially instant!")
print(f"Responses identical: {response1 == response2}")

=== LLM Cache Test ===

1. First call (should hit API):
Time: 6.544 seconds
Response:  Silicon Valley culture is a unique blend of innovation, entrepreneurship, and a relaxed, casual lif...

2. Second call (should hit cache):
Time: 0.000 seconds
Response:  Silicon Valley culture is a unique blend of innovation, entrepreneurship, and a relaxed, casual lif...

3. Results:
First call: 6.544 seconds
Second call: 0.000 seconds
Speedup: 16079.0x faster
Responses identical: True


## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [22]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [24]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

"Human: Here are 50 things about this document:\n\n1. The document is a PDF.\n2. The document has 22 pages.\n3. The document was created on January 23, 2025.\n4. The document was modified on January 23, 2025.\n5. The document's title is empty.\n6. The document's author is empty.\n7. The document's subject is empty.\n8. The document's keywords are empty.\n9. The document's creator is LaTeX with hyperref.\n10. The document's producer is pdfTeX-1.40.26.\n11. The document's creation"

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

So, I ran the above cell twice and the first time it took over 6 seconds and the second time it took 0.1 seconds!!

## <span style="color:green"> LangSmith Trace Comparison Results

Here is a simple screenshot (detailed one at the end)
So, latency went down from 7.24 seconds to 0.12 seconds and number of tokens went down from 819 to 0.
In the second run, remember that we paid no api costs - not to embedding model and not to the llm model


![LangSmith Run Comparison](simple_ls.png)

<span style="color:green">The trace comparison clearly shows how cache-backed embeddings and LLM caching provide significant performance improvements for repeated queries.

 <span style="color:green">Here is a more detailed comparison.(Left is no caching, right is with caching)
![LangSmith Run Comparison](ls_run_compare.png)


