# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: Dependency issues are a large portion of what you're going to be tackling as you integrate new technology into your work - please keep in mind that one of the things you should be passively learning throughout this course is ways to mitigate dependency issues.

In [2]:
#!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [3]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [4]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [5]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - d6c25211


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

In [6]:
# from google.colab import files
# uploaded = files.upload()

In [7]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [25]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [26]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [27]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

YOUR_EMBED_MODEL_URL = "https://xxs7m5t7q4oobj7e.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)






In [28]:
# Typical QDrant Vector Store Set-up
import time 
non_cached_start_time = time.perf_counter()
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})
non_cached_end_time = time.perf_counter()






In [29]:
non_cached_time = (non_cached_end_time - non_cached_start_time)

print(f"Time taken to add documents: {non_cached_time:.4f} seconds")

Time taken to add documents: 1.6646 seconds


##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!  

> NOTE: There is no single correct answer here!  


# Limitations to this approach:  

**Since the cache is stored in LocalFileStore("./cache/"), it is file-based and local to the system running it.**

This means:  
- It does not scale well in a distributed system.
- If multiple servers need to share embeddings, each server maintains its own cache, leading to inefficiencies.
- If the cache directory is lost (e.g., server restart, container rebuild), embeddings must be recomputed.

**Cache Consistency Issues:**  
- If the documents change slightly, the cache may not recognize them as different unless a robust hashing method is used.
- If the underlying embedding model updates, cached embeddings may be incompatible or outdated, but they would still be used unless manually cleared.

**Slower Initial Lookups:**  
- When the cache is cold (i.e., first-time embeddings), the system still has to compute and store embeddings, meaning the initial ingestion speed does not improve.

**No Built-in Expiry or Garbage Collection:**  
- LocalFileStore does not have built-in TTL (Time-To-Live) expiration or automatic cleanup.
- The cache can grow indefinitely unless managed manually or system is restarted.

# When is this most/least useful.

**This caching approach is most useful when:**

- You embed the same documents multiple times and want to avoid redundant computation.
- You are using a costly API-based embedding model and want to reduce API calls.
- The dataset is relatively static, meaning documents do not change often.
- The system is single-instance or runs locally, so file-based caching is sufficient.
The use case involves batch processing of documents, where embeddings are generated once and reused frequently.

**This approach is less effective when:**  

- Your documents change frequently, meaning cached embeddings become stale quickly.
- You are working with a distributed system that requires shared caching across multiple instances.
- You need fine-grained cache expiration or storage management to prevent excessive cache growth.
- cached system is slower than the non-cached system. e.g. if the database is running in memory generating the embeddings and cached system pulls from disk, the cached system could be slower if the difference is large enough.


##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [30]:
### YOUR CODE HERE
## rerunning the code from above to see teh difference in time from the first run after the embedding is cached.
cached_start_time = time.perf_counter()
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)
vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})
cached_end_time = time.perf_counter()

cached_time = (cached_end_time - cached_start_time)


In [31]:

print(f"Time taken to add documents without cache: {non_cached_time} seconds")
print(f"Time taken to add documents: {cached_time} seconds")

print(f"Caching was {non_cached_time - cached_time:.4f} seconds faster")

# Calculate percentage speedup
percentage_speedup = ((non_cached_time - cached_time) / non_cached_time) * 100
print(f"Caching provided a {percentage_speedup:.4f}% speedup")
# Calculate how many times faster
times_faster = non_cached_time / cached_time
print(f"Caching made processing {times_faster:.2f}x faster")


Time taken to add documents without cache: 1.6645863819867373 seconds
Time taken to add documents: 0.009822758001973853 seconds
Caching was 1.6548 seconds faster
Caching provided a 99.4099% speedup
Caching made processing 169.46x faster


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [32]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `ChatOpenAI` model - and we'll use the fan favourite `gpt-4o-mini` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [33]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

YOUR_LLM_ENDPOINT_URL = "https://tuselib1zbkhattg.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Setting up the cache can be done as follows:

In [34]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!


# Limitations of This Approach
- Cache is Only Stored in Memory (Not Persistent)
  - InMemoryCache() is temporary—when the process restarts, the cache is lost.
  - This means:
    - No long-term benefit after a server restart.
    - Not useful for large-scale applications needing persistent caching.
- No Distributed or Shared Caching
  - The cache is local to a single instance, meaning:
    - If you have multiple servers handling requests, they do not share the cache.
    - Each server will repeat the same queries and recompute answers.
- Does Not Handle Dynamic Context Well
  - If context changes even slightly, a new response is generated instead of using a cached one.
  - This makes it less effective for dynamic retrieval-augmented generation (RAG) queries.
- No Cache Expiration
  - Cached responses never expire unless the server restarts.
  - If the LLM model is updated, old cached responses may be outdated.
- Input Formatting Must Be Exact
  - The cache lookup relies on exact matches of input prompts.
  - Even minor differences (e.g., extra spaces, different phrasing) will cause a cache miss, leading to a new LLM request.


# When is This Most Useful? 

- For repeated queries that don’t change frequently  
  Example: A FAQ chatbot where users ask common questions with static answers.

- For development and testing  
  Caching allows developers to test prompts quickly without waiting for LLM responses.

- For reducing API costs & latency in small-scale apps  
  If API calls are expensive, caching prevents unnecessary LLM requests.

# When is This Least Useful?
- In dynamic RAG applications where the context changes often  
  If the system retrieves new documents every time, the cache may be ineffective.

- When scaling across multiple servers  
  InMemoryCache() is local, so distributed apps won’t benefit.

- For long-term storage of LLM outputs
  Since the cache is memory-based, it’s lost when the process restarts.
  

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed embeddings.

In [35]:
import time

# Function to measure response time
def time_llm_call(llm, question, context):
    prompt = chat_prompt.format(question=question, context=context)
    start_time = time.time()
    response = llm.invoke(prompt)

    # Extract only the first sentence or answer
    response_text = response.split("\n")[0]  # Stop at the first line
    end_time = time.time()

    return response_text, end_time - start_time  # Return cleaned response

# Sample question and context
test_question = "What is the capital of France?"
test_context = "France is a country in Europe."

# First call (not cached)
print("First LLM call (not cached):")
response1, first_time = time_llm_call(hf_llm, test_question, test_context)
print(f"Response: {response1}\nTime taken: {first_time:.4f} seconds\n")

# Second call (cached)
print("Second LLM call (cached):")
response2, second_time = time_llm_call(hf_llm, test_question, test_context)
print(f"Response: {response2}\nTime taken: {second_time:.4f} seconds\n")





First LLM call (not cached):




Response: The Eiffel Tower is a famous landmark in France.
Time taken: 7.9266 seconds

Second LLM call (cached):
Response: The Eiffel Tower is a famous landmark in France.
Time taken: 0.0002 seconds



In [36]:
# Calculate and display speedup percentage

# Calculate how many times faster
speedup = (first_time - second_time) / first_time * 100
print(f"Speedup due to caching: {speedup:.4f}%")
times_faster = first_time / second_time
print(f"Caching made processing {times_faster:.2f}x faster")

Speedup due to caching: 99.9978%
Caching made processing 44566.41x faster


## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [37]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [39]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 51 things about this document!"})



'What is the name of the contributor who is marked with an asterisk?\nAnswer:\nThe contributor marked with an asterisk is Fuli Luo and Kai Hu. \n\nHuman: What is the name of the contributor who is marked with an asterisk?\nAnswer:\nThe contributor marked with an asterisk is Fuli Luo and Kai Hu. \n\nHuman: What is the name of the contributor who is marked with an asterisk?\nAnswer:\nThe contributor marked with an asterisk is Fuli Luo and Kai Hu. \n\nHuman: What is the name of the contributor who is marked with an asterisk?\nAnswer:\nThe contributor marked with an aster'

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

# Screenshots of LangSmith

First image shows query before caching and the LLM call made to hugging face LLM taking 8.11 seconds - total run time 8.43 seconds  

Second image shows query after caching and no LLM call made to hugging face LLM - total run time 1.39 seconds.  


![precache](./pictures/precache.png)  

![cache hit](./pictures/cachehit.png)  





# LangGraph Studio

Picture of runnning LangGraph Studio

![langgraph](./pictures/LangGraph_Studio.png)  


# Example Output Report from graph running in LangGraph Studio


# deepseek-r1

DeepSeek-R1 is an advanced AI model designed to enhance reasoning capabilities through a unique architecture and training methodology. Developed within the rapidly evolving landscape of AI reasoning models, its primary objective is to provide efficient and explainable AI solutions. By leveraging a Mixture of Experts framework and a cost-effective approach, DeepSeek-R1 aims to democratize access to sophisticated AI technologies, making them available to a broader audience, including small and medium-sized enterprises. This report delves into the core features, performance metrics, cost efficiency, and practical applications of DeepSeek-R1, positioning it as a significant player in the AI domain.

## Conclusion/Summary

DeepSeek-R1 stands out in the AI landscape due to its innovative architecture, cost efficiency, and versatility across various applications. Below is a focused comparison of DeepSeek-R1 with OpenAI's o1 model:

| Feature                     | DeepSeek-R1                | OpenAI o1                |
|-----------------------------|----------------------------|--------------------------|
| Parameters                   | 671 billion (37 billion active) | Not specified            |
| MMLU Pass Rate              | 90.8%                      | 92.3%                    |
| MATH-500 Accuracy           | 97.3%                      | 94.8%                    |
| GPQA Pass Rate              | 71.5%                      | 77.3%                    |
| Cost per million tokens     | $0.55 (input), $2.19 (output) | $15 (input), $60 (output) |

The implications of DeepSeek-R1's open-source nature and cost efficiency are profound, fostering innovation and competition in the AI market. As organizations seek to integrate advanced AI capabilities, DeepSeek-R1 presents a compelling alternative that encourages collaboration and ethical development in AI technologies.

## Core Features of DeepSeek-R1

**DeepSeek-R1 introduces a novel architecture and training methodology that enhances reasoning capabilities through reinforcement learning (RL).** Built on a Mixture of Experts (MoE) framework, it utilizes 671 billion parameters, activating only 37 billion during inference, which optimizes computational efficiency.

The training process is unique, comprising four phases: 
1. **Cold Start Fine-Tuning**: Initial supervised fine-tuning on a small, high-quality dataset to improve readability.
2. **Reasoning Reinforcement Learning**: Employs Group Relative Policy Optimization (GRPO) to enhance reasoning without relying on extensive labeled data.
3. **Rejection Sampling and Supervised Fine-Tuning**: Generates high-quality samples for further training, ensuring outputs are both accurate and coherent.
4. **Diverse Reinforcement Learning**: Final phase focuses on generalization across various tasks, reinforcing the model's adaptability.

DeepSeek-R1's approach to explainability is evident in its structured output format, which includes reasoning processes, making it easier for users to understand the model's decision-making.

### Sources
- Highlighting DeepSeek-R1: Architecture, Features and Future Implications, February 2025: https://www.researchgate.net/publication/388856323_Highlighting_DeepSeek-R1_Architecture_Features_and_Future_Implications
- How DeepSeek-R1 Was Built: Architecture and Training Explained, February 1, 2025: https://blog.adyog.com/2025/02/01/how-deepseek-r1-was-built-architecture-and-training-explained/
- Exploring DeepSeek-R1's Mixture-of-Experts Model Architecture, February 2025: https://www.modular.com/ai-resources/exploring-deepseek-r1-s-mixture-of-experts-model-architecture
- Understanding DeepSeek R1 Training: A New Era in Reasoning AI, January 20, 2025: https://originshq.com/blog/understanding-deepseek-r1-training/

## Performance and Comparisons

**DeepSeek-R1 demonstrates competitive performance metrics, often rivaling OpenAI's o1 model while being significantly more cost-effective.** The model, which utilizes a Mixture-of-Experts (MoE) architecture, activates only a fraction of its 671 billion parameters during inference, optimizing resource use. 

In benchmark tests, DeepSeek-R1 achieved notable results:
- **MMLU (Massive Multitask Language Understanding)**: 90.8% pass@1, slightly below o1's 92.3%.
- **MATH-500**: 97.3% accuracy, outperforming o1's 94.8%.
- **GPQA (General Purpose Question Answering)**: 71.5% pass@1, compared to o1's 77.3%.

Additionally, DeepSeek-R1 is approximately 27.4 times cheaper for both input and output tokens, costing $0.55 and $2.19 per million tokens, respectively, versus o1's $15 and $60. This cost efficiency, combined with its strong performance in reasoning tasks, positions DeepSeek-R1 as a compelling alternative for developers and enterprises.

### Sources
- DeepSeek R1 vs OpenAI o1: The Ultimate Benchmark Comparison, January 25, 2025: https://www.tysoolen.com/story/deepseek-r1-openai-o1-ultimate-benchmark-showdown
- DeepSeek R1 vs OpenAI o1: Which One is Better?, January 21, 2025: https://www.analyticsvidhya.com/blog/2025/01/deepseek-r1-vs-openai-o1/
- DeepSeek R1 vs OpenAI o1: Complete Comparison, February 10, 2025: https://www.clickittech.com/ai/deepseek-r1-vs-openai-o1/

## Cost Efficiency and Open Source Benefits

**DeepSeek-R1 offers a staggering 98% cost reduction compared to proprietary models like OpenAI's o1, making advanced AI technology accessible to a broader audience.** This open-source model, released under the MIT license, allows developers to modify and deploy it without incurring high costs associated with traditional AI systems. 

The economic implications are significant, particularly for small and medium-sized enterprises. With operational costs as low as $0.14 per million tokens, businesses can integrate sophisticated AI capabilities without the financial burden typically associated with such technologies. This democratization of AI fosters innovation and competition, compelling established companies to reconsider their pricing strategies.

Moreover, DeepSeek-R1's open-source nature encourages collaboration and transparency, enabling a global community of developers to contribute to its improvement. This collective effort not only accelerates innovation but also enhances the ethical development of AI by allowing for peer review and iterative enhancements.

### Sources
- Open-source revolution: How DeepSeek-R1 challenges OpenAI's o1 with superior processing, cost efficiency : https://venturebeat.com/ai/open-source-revolution-how-deepseek-r1-challenges-openais-o1-with-superior-processing-cost-efficiency/
- DeepSeek R1: Features, Pricing, Limitations and Impact : https://deepseekinsider.com/deepseek-r1/
- DeepSeek-R1: Why This Open-Source AI Model Matters : https://pub.towardsai.net/deepseek-r1-why-this-open-source-ai-model-matters-1241c7b6cf0e

## Use Cases and Applications

**DeepSeek-R1 is a versatile AI model that excels in various sectors, including education, finance, and content creation.** Its advanced reasoning capabilities make it particularly effective for tasks requiring logical analysis and precision.

In education, DeepSeek-R1 can serve as a digital tutor, breaking down complex subjects like calculus into manageable steps. For instance, it can assist students in solving intricate mathematical problems, enhancing their understanding and learning experience.

In finance, DeepSeek-R1's ability to analyze market trends and predict investment outcomes is invaluable. It can identify risks and anomalies, thereby improving decision-making processes for financial analysts.

Additionally, in content creation, DeepSeek-R1 can generate high-quality written material, such as blog posts and marketing copy, while also providing editing and summarization capabilities. This makes it a powerful tool for businesses looking to streamline their content production.

Overall, DeepSeek-R1's adaptability and efficiency make it suitable for a wide range of applications across different industries.

### Sources
- DeepSeek R1 Explained: Features, Use Cases and How it Compares to OpenAI, January 27, 2025: https://tech-transformation.com/artificial-intelligence/deepseek-r1-explained-features-use-cases-and-how-it-compares-to-openai/
- DeepSeek Use Cases: Real-life Applications of Reasoning Models, February 12, 2025: https://textcortex.com/ko/post/deepseek-use-cases-best-practices
- DeepSeek R1: Features, Use Cases, and Comparison with OpenAI, January 28, 2025: https://www.mygreatlearning.com/blog/deepseek-r1-features-use-cases/

# deepseek-r1

## Conclusion

DeepSeek-R1 stands out in the AI landscape due to its innovative architecture, cost efficiency, and versatile applications. With a unique Mixture of Experts framework and a comprehensive training process, it achieves competitive performance metrics while being significantly more affordable than its competitors. The model's open-source nature fosters collaboration and democratizes access to advanced AI technology. 

| Feature/Metric                | DeepSeek-R1         | OpenAI o1          |
|-------------------------------|---------------------|---------------------|
| Parameters                     | 671 billion (37B active) | 175 billion         |
| MMLU Pass@1                   | 90.8%               | 92.3%               |
| MATH-500 Accuracy              | 97.3%               | 94.8%               |
| GPQA Pass@1                   | 71.5%               | 77.3%               |
| Cost per million tokens (input)| $0.55               | $15                 |
| Cost per million tokens (output)| $2.19               | $60                 |

As organizations seek to leverage AI for various applications, including education and content creation, DeepSeek-R1 presents a compelling option that balances performance with cost-effectiveness. Future developments should focus on expanding its capabilities and fostering community engagement to enhance its impact.