# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [1]:
!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opentelemetry-proto 1.27.0 requires protobuf<5.0,>=3.19, but you have protobuf 5.28.2 which is incompatible.[0m[31m
[0m

We'll need an OpenAI API Key:

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

And the LangSmith set-up:

In [3]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Week 8 Assignment 1 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [8]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Week 8 Assignment 1 - edb3ff66


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

We'll define our chunking strategy.

We'll chunk our uploaded PDF file.

In [4]:
!pip install pymupdf



In [5]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter


pdf = "https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf"
pages = PyMuPDFLoader(file_path=pdf).load() # aload available in Langchain 0.3

print("Chunking...")
combined_text = "\n".join([doc.page_content for doc in pages])
combined_document = Document(page_content=combined_text)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Split the combined document
docs = await text_splitter.atransform_documents([combined_document])

Chunking...


In [6]:

for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

In [7]:
docs[0]

Document(metadata={'source': 'source_0'}, page_content='NIST Trustworthy and Responsible AI  \nNIST AI 600-1 \nArtificial Intelligence Risk Management \nFramework: Generative Artificial \nIntelligence Profile \n \n \n \nThis publication is available free of charge from: \nhttps://doi.org/10.6028/NIST.AI.600-1 \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \nNIST Trustworthy and Responsible AI  \nNIST AI 600-1 \nArtificial Intelligence Risk Management \nFramework: Generative Artificial \nIntelligence Profile \n \n \n \nThis publication is available free of charge from: \nhttps://doi.org/10.6028/NIST.AI.600-1 \n \nJuly 2024 \n \n \n \n \nU.S. Department of Commerce  \nGina M. Raimondo, Secretary \nNational Institute of Standards and Technology  \nLaurie E. Locascio, NIST Director and Under Secretary of Commerce for Standards and Technology')

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [8]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings

# Typical Embedding Model
core_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Typical QDrant Client Set-up
collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Adding cache!
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings, store, namespace=core_embeddings.model
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [None]:
### YOUR CODE HERE

### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [10]:
from langchain_core.globals import set_llm_cache
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4o-mini")

Setting up the cache can be done as follows:

In [11]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

In [None]:
### YOUR CODE HERE

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [12]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | chat_model
    )

Let's test it out!

In [13]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

AIMessage(content="1. The document discusses the importance of documenting, reporting, and sharing information about GAI (Generative AI) incidents.\n2. It emphasizes that such practices can help mitigate harmful outcomes.\n3. Relevant AI actors can trace the impacts of incidents to their sources through better reporting.\n4. Greater awareness and standardization of GAI incident reporting is promoted in the document.\n5. The document suggests that improved transparency can enhance GAI risk management in the AI ecosystem.\n6. It highlights the need for AI actors to be aware of their roles in reporting AI incidents.\n7. Organizations are encouraged to develop guidelines for publicly available incident reporting.\n8. The guidelines should include information about AI actor responsibilities.\n9. This would assist AI system operators in identifying GAI incidents throughout the AI lifecycle.\n10. The document stresses the importance of documenting and reviewing third-party inputs.\n11. It ref

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

------
## Advanced Build ##

In [14]:
!pip install redis

Collecting redis
  Downloading redis-5.1.1-py3-none-any.whl.metadata (9.1 kB)
Collecting async-timeout>=4.0.3 (from redis)
  Using cached async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Downloading redis-5.1.1-py3-none-any.whl (261 kB)
Using cached async_timeout-4.0.3-py3-none-any.whl (5.7 kB)
Installing collected packages: async-timeout, redis
  Attempting uninstall: async-timeout
    Found existing installation: async-timeout 4.0.2
    Uninstalling async-timeout-4.0.2:
      Successfully uninstalled async-timeout-4.0.2
Successfully installed async-timeout-4.0.3 redis-5.1.1


In [22]:
import os
from uuid import uuid4
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.redis import Redis as RedisVectorStore
from langchain.cache import RedisSemanticCache
from langchain.globals import set_llm_cache
from redis import Redis
from langchain_openai import OpenAI

# Set up Redis (update host and port if your Redis is not running locally)
# Make sure to do 'brew services start redis' first if running locally 
redis_host = "localhost"
redis_port = 6379
redis_client = Redis(host=redis_host, port=redis_port)


In [24]:
# Set up semantic cache
set_llm_cache(RedisSemanticCache(
    redis_url=f"redis://{redis_host}:{redis_port}",
    embedding=core_embeddings
))

In [25]:
# Function to measure execution time
def timed_completion(prompt):
    start_time = time.time()
    result = llm.invoke(prompt)
    end_time = time.time()
    return result, end_time - start_time