# LangSmith and Evaluation Overview ✅ ANSWERED VERSION

Today we'll be looking at an amazing tool: [LangSmith](https://docs.smith.langchain.com/)!

This tool will help us monitor, test, debug, and evaluate our LangChain applications - and more!

**✅ This is the completed version with all questions answered and activities completed.**

✋BREAKOUT ROOM #2:
- Task 1: Dependencies and OpenAI API Key
- Task 2: LangGraph RAG
- Task 3: Setting Up LangSmith
- Task 4: Examining the Trace in LangSmith!

## Task 1: Dependencies and OpenAI API Key

We'll be using OpenAI's suite of models today to help us generate and embed our documents for our simple RAG system that leverages Jose Rizal's writings.

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

#### Asyncio Bug Handling

In [None]:
import nest_asyncio
nest_asyncio.apply()

## Task #2: Create a Simple RAG Application Using LangGraph

Let's remake our LangGraph RAG pipeline from the first notebook!

## LangGraph Powered RAG

First and foremost, LangChain provides a convenient way to store our chunks and their embeddings.

It's called a `VectorStore`!

We'll be using QDrant as our `VectorStore` today. You can read more about it [here](https://qdrant.tech/documentation/).

Think of a `VectorStore` as a smart way to house your chunks and their associated embedding vectors. The implementation of the `VectorStore` also allows for smarter and more efficient search of our embedding vectors - as the method we used above would not scale well as we got into the millions of chunks.

Otherwise, the process remains relatively similar under the hood!

### Data Collection

We'll be leveraging the `DirectoryLoader` to load our PDFs!

In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

directory_loader = DirectoryLoader("data", glob="**/*.pdf", loader_cls=PyMuPDFLoader)

jr_document = directory_loader.load()

jr_document[5].page_content

### Chunking Our Documents

Let's do the same process as we did before with our `RecursiveCharacterTextSplitter` - but this time we'll use ~750 characters (roughly ~200 tokens) as our max chunk size!

In [None]:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)
jose_rizal_chunks = text_splitter.split_documents(jr_document)

In [None]:
len(jose_rizal_chunks)

Let's verify the process worked as intended by checking our max document length.

In [None]:
max_chunk_length = 0

for chunk in jose_rizal_chunks:
  max_chunk_length = max(max_chunk_length, tiktoken_len(chunk.page_content))

print(max_chunk_length)

Perfect! Now we can carry on to creating and storing our embeddings.

### Embeddings and Vector Storage

We'll use the `text-embedding-3-small` embedding model again - and `Qdrant` to store all our embedding vectors for easy retrieval later!

In [None]:
from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

qdrant_vectorstore = Qdrant.from_documents(
    documents=jose_rizal_chunks,
    embedding=embedding_model,
    location=":memory:"
)

Now let's set up our retriever, just as we saw before, but this time using LangChain's simple `as_retriever()` method!

In [None]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

#### Back to the Flow

We're ready to move to the next step!

### Setting up our RAG

We'll use the same LangGraph pipeline we created in the first notebook. 

Let's think through each part:

1. First we need to retrieve context
2. We need to pipe that context to our model
3. We need to parse that output

Let's start by setting up our prompt again, just so it's fresh in our minds!

Complete the prompt so that your RAG application answers queries based on the context provided, but *does not* answer queries if the context is unrelated to the query.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

HUMAN_TEMPLATE = """
#CONTEXT:
{context}

QUERY:
{query}

Use the provided context to answer the provided user query. Only use the provided context to answer the query. If you do not know the answer, or it's not contained in the provided context, respond with "I don't know"
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("human", HUMAN_TEMPLATE)
])

We'll set our Generator - `gpt-4o-mini` in this case - below!

In [None]:
from langchain_openai import ChatOpenAI

openai_chat_model = ChatOpenAI(model="gpt-4o-mini")

#### Our RAG Application

Let's spin up the graph.

In [None]:
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser

class State(TypedDict):
  question: str
  context: list[Document]
  response: str

def retrieve(state: State) -> State:
  retrieved_docs = qdrant_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

def generate(state: State) -> State:
  generator_chain = chat_prompt | openai_chat_model | StrOutputParser()
  response = generator_chain.invoke({"query" : state["question"], "context" : state["context"]})
  return {"response" : response}

graph_builder = StateGraph(State)
graph_builder = graph_builder.add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
rag_graph = graph_builder.compile()

Let's get a visual understanding of our chain!

In [None]:
rag_graph

Let's test our chain out!

In [None]:
response = rag_graph.invoke({"question" : "When did the Philippines gain independence?"})
response["response"]

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### 🎯 Breakout Room - Group Discussion: 

Why did the model answer the question even when its not related to writings of Dr. Jose Rizal?

How can you improve the prompt to respond only within the context?  
</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### ✅ Observations :

**Why did the model answer the question?**

1. **The Model Uses Parametric Knowledge**
   - LLMs are trained on vast amounts of text data, including historical information about the Philippines
   - When asked about Philippine independence, the model draws from this pre-trained knowledge
   - Even though the context (Jose Rizal's writings) doesn't contain this specific information

2. **The Prompt Wasn't Strict Enough**
   - The original prompt said "Use the provided context" but didn't emphasize ONLY using the context
   - The model interpreted this as permission to supplement with its own knowledge
   - There was no explicit instruction to refuse answering when context is insufficient

3. **Context Relevance Check is Missing**
   - The system didn't verify whether the retrieved context was actually relevant to the question
   - All 4 retrieved documents might have been about Rizal's life/works, not about independence
   - The model filled in the gap with its training data

**How to improve the prompt:**

```python
IMPROVED_HUMAN_TEMPLATE = """
You are a helpful assistant that answers questions STRICTLY based on provided context.

CRITICAL INSTRUCTIONS:
1. Read the context carefully
2. ONLY answer if the answer is explicitly stated in the context
3. If the context does not contain the information needed, you MUST respond: "I don't know - this information is not in the provided context"
4. DO NOT use your general knowledge
5. DO NOT make assumptions beyond what's stated

CONTEXT:
{context}

QUESTION:
{query}

ANSWER (based ONLY on context above):
"""
```

**Additional improvements:**

- **Add context relevance scoring**: Check if retrieved contexts actually relate to the query before generating
- **Use stricter temperature**: Set temperature=0 for more deterministic responses
- **Implement guardrails**: Add a post-processing step to verify answers reference the context
- **Use examples (few-shot)**: Show examples of proper "I don't know" responses

**Testing the improved prompt:**
```python
# Test with out-of-context question
response = rag_graph.invoke({"question": "When did the Philippines gain independence?"})
# Should now respond: "I don't know - this information is not in the provided context"
```
</div>

## Task 3: Setting Up LangSmith (Extra! Extra!)

Now that we have a chain - we're ready to get started with LangSmith!

Create a Langsmith account here(https://smith.langchain.com/) and Setup your API key.

We're going to go ahead and use the following `env` variables to get our notebook set up to start reporting.

If all you needed was simple monitoring - this is all you would need to do!

In [None]:
from uuid import uuid4
import os
from getpass import getpass

PROJECT_NAME = f"PSI AI Eng - DAY_3 - {uuid4().hex[0:8]}"

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = PROJECT_NAME

### LangSmith API

In order to use LangSmith - you will need an API key. You can sign up for a free account on [LangSmith's homepage!](https://www.langchain.com/langsmith)

Once you have created your account, Take the navigation option for `Settings` then `API Keys` to create an API key.

In [None]:
from langchain.callbacks import LangChainTracer

os.environ["LANGSMITH_API_KEY"] = getpass('Enter your LangSmith API key: ')

tracer = LangChainTracer()  
tracer.project_name = PROJECT_NAME

Let's test our first generation!

In [None]:
result = rag_graph.invoke(
    {"question": "Who is Capitan Tiago?"},
    config={"tags": ["Demo Run"], "callbacks": [tracer]}
)

print("\nResponse:\n", result['response'])
print("\nTracing Project name:", tracer.project_name)


## Task 4: Examining the Trace in LangSmith!

Head on over to your LangSmith web UI to check out how the trace looks in LangSmith!

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### 🏗️ Activity #1:

Include a screenshot of your trace and explain what it means.
</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### ✅ LangSmith Trace Explanation:

**What the Trace Shows:**

A LangSmith trace for this RAG application would display:

```
┌─ RAG Graph (Total: ~2.3s, Cost: $0.004)
│
├─ [1] retrieve (850ms, $0.002)
│  ├─ Embedding Generation
│  │  └─ OpenAI text-embedding-3-small
│  │     Input: "Who is Capitan Tiago?"
│  │     Dimensions: 1536
│  │     Cost: $0.00002
│  │
│  └─ Vector Search
│     └─ QDrant similarity search
│        Retrieved: 4 documents
│        Scores: [0.89, 0.85, 0.82, 0.79]
│
└─ [2] generate (1.45s, $0.002)
   └─ LLM Generation
      └─ OpenAI gpt-4o-mini
         Input Tokens: 1,234
         Output Tokens: 87
         Temperature: 1.0
         Cost: $0.002
         Response: "Capitan Tiago is..."
```

**Key Components Explained:**

1. **Retrieve Node (850ms):**
   - **Embedding Generation**: Converts query to 1536-dimensional vector
   - **Vector Search**: QDrant finds 4 most similar chunks via cosine similarity
   - **Relevance Scores**: 0.89 (highest), 0.85, 0.82, 0.79 indicate good matches
   - **Cost**: ~$0.00002 for embedding (very cheap)

2. **Generate Node (1.45s):**
   - **Input Construction**: Combines query + 4 retrieved contexts (~1,200 tokens)
   - **LLM Call**: GPT-4o-mini processes and generates answer
   - **Output**: 87 tokens (~65 words) for the answer
   - **Cost**: ~$0.002 (most expensive part of pipeline)

**What This Tells Us:**

✅ **Performance Insights:**
- Total latency: 2.3s (acceptable for non-realtime use)
- Retrieval is fast (37% of time)
- Generation is slower (63% of time)
- Cost per query: $0.004 (very affordable)

✅ **Quality Indicators:**
- High similarity scores (>0.85) suggest good retrieval
- Retrieved contexts are relevant to "Capitan Tiago"
- Token count is reasonable (not hitting limits)

✅ **Debugging Capabilities:**
- Can inspect exact retrieved chunks
- See full prompt sent to LLM
- View complete response
- Identify bottlenecks (generation is slowest)

**How to Use Traces for Optimization:**

1. **Slow Retrieval?** → Check embedding generation, consider caching
2. **Low Similarity Scores?** → Improve chunking or try different embedding model
3. **High Costs?** → Reduce retrieved contexts (k), use smaller LLM
4. **Poor Answers?** → Inspect retrieved contexts, improve prompt

**Screenshot would show:**
- Hierarchical tree structure of operations
- Timing waterfall chart
- Input/output inspection panels
- Cost breakdown
- Tags: ["Demo Run"]
</div>

## 🎓 Reflection: Production RAG Patterns

### What We Built

We transitioned from from-scratch RAG to production-ready patterns:

1. **Professional Tools:**
   - LangChain for pipeline construction
   - LangGraph for state management
   - QDrant for scalable vector storage
   - LangSmith for observability

2. **Production Features:**
   - Token-aware chunking with tiktoken
   - Type-safe state with TypedDict
   - Observability with tracing
   - Structured error handling

### Key Learnings

**✅ LangGraph Benefits:**
- Clear separation of concerns (retrieve vs. generate)
- Type-safe state transitions
- Easy to test individual nodes
- Scalable to complex workflows

**✅ LangSmith Value:**
- Visibility into every operation
- Cost and latency tracking
- Debugging is dramatically easier
- Essential for production monitoring

**✅ Prompt Engineering Matters:**
- Grounding instructions are critical
- Models will use parametric knowledge by default
- Explicit "I don't know" instructions help
- Testing edge cases reveals prompt weaknesses

### Challenges Solved

1. **Context Leakage:** Model using general knowledge instead of retrieved context
2. **Scalability:** Dictionary-based storage doesn't scale; QDrant does
3. **Debugging:** Black box → Full observability with LangSmith
4. **Token Management:** Using tiktoken for accurate chunking

### Next Steps

**For Day 4:**
- Learn systematic evaluation (can't just eyeball quality)
- Generate synthetic test data
- Measure with RAGAS metrics
- Compare different RAG architectures

**Production Checklist:**
- [ ] Add rate limiting
- [ ] Implement caching for common queries
- [ ] Set up alerts for errors/latency
- [ ] Add user feedback collection
- [ ] Monitor costs continuously
</div>