# Lesson 9.4: LangSmith for Evaluation and Monitoring

---

In previous lessons, we built LLM applications with LangChain and LangGraph, and learned about the importance of evaluation and monitoring. This lesson will focus in-depth on **LangSmith**, the platform built by LangChain, providing a comprehensive set of tools to effectively develop, debug, test, and monitor your LLM applications.

## 1. In-depth Introduction to LangSmith

**LangSmith** is a SaaS (Software as a Service) platform specifically designed to support the entire development lifecycle of LLM-powered applications. It addresses the inherent challenges of working with LLMs, such as stochasticity, difficulty in debugging complex chains, and lack of observability.

* **What is LangSmith?**
    LangSmith provides an intuitive interface and APIs to log, visualize, and analyze "traces" of your LLM application runs. Each trace is a detailed record of every step your application takes, from LLM calls to tool usage, RAG retrieval, and custom logic.
* **Why is LangSmith necessary?**
    * **Complex Debugging:** LLM applications often have multiple interacting components (LLMs, Prompts, Chains, Agents, Tools, Retrievers). LangSmith helps you "see" what's happening internally, pinpointing exactly where errors or unexpected behaviors occur.
    * **Understanding Agent Behavior:** Especially useful for Agents and LangGraph graphs, where control flow can be complex and iterative. LangSmith visualizes ReAct loops, Agent decisions, and tool results.
    * **Systematic Evaluation:** Provides tools to create evaluation datasets, run automated and manual tests, and compare performance across versions.
    * **Production Monitoring:** Tracks key metrics like latency, token cost, error rate, and output quality over time.
    * **Iterative Improvement:** By providing detailed insights, LangSmith helps you make data-driven decisions to refine prompts, select better models, or optimize architecture.




---

## 2. Setting up and Configuring LangSmith for Your Project

Integrating LangSmith into your LangChain project is straightforward.

1.  **Install the library:**
    ```bash
    pip install langsmith
    ```
2.  **Sign up for a LangSmith account:**
    Visit [app.langsmith.com](https://www.langchain.com/langsmith) and sign up for an account. You will receive a `LANGCHAIN_API_KEY`.
3.  **Set environment variables:**
    For LangChain to automatically send traces to LangSmith, you need to set these three environment variables:
    ```bash
    export LANGCHAIN_TRACING_V2="true"
    export LANGCHAIN_API_KEY="<your-langsmith-api-key>"
    export LANGCHAIN_PROJECT="<your-project-name>" # Example: "My First LLM App"
    ```
    Or within your Python code (before initializing any LangChain components):
    ```python
    import os
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_API_KEY"] = "YOUR_LANGSMITH_API_KEY"
    os.environ["LANGCHAIN_PROJECT"] = "My First LLM App"
    ```
    * `LANGCHAIN_TRACING_V2="true"`: Enables v2 tracing.
    * `LANGCHAIN_API_KEY`: Your API key from LangSmith.
    * `LANGCHAIN_PROJECT`: The name of the project under which your traces will be grouped in LangSmith. This is very useful for organizing different experiments and applications.
4.  **Initialize LangSmith Client (Optional but useful):**
    If you want to interact directly with the LangSmith API (e.g., to create datasets, run custom evaluations), you can initialize a `Client`:
    ```python
    from langsmith import Client
    langsmith_client = Client()
    ```




---

## 3. Tracing and Logging Chains and Agents

Once LangSmith is configured, every run of `Runnable`s in LangChain (including LLM calls, Chains, Agents, Tools, Retrievers) will automatically be logged and sent to your LangSmith dashboard.

* **Trace View:** This is the core feature. You will see a visual graph of your application's execution flow. Each box in the graph represents a step (run), and you can click on it to see details:
    * **Inputs/Outputs:** The input and output data for that step.
    * **Logs:** Any logs generated during execution.
    * **Errors:** Any errors or exceptions that occurred.
    * **Duration:** The execution time of the step.
    * **Tokens/Cost:** The number of tokens used and estimated cost (for LLM calls).
* **Run Details:** Each trace is a "Run." You can view a list of Runs in your project, filter by time, status, or type.
* **Agent Scratchpad:** For Agents, LangSmith clearly displays the `agent_scratchpad`, allowing you to follow the Agent's thought, action, and observation sequence through each iterative step. This is incredibly useful for debugging complex ReAct loops.




---

## 4. Analyzing Performance, Cost, and Issues

LangSmith not only helps with debugging but also provides analytical tools to understand your application's performance and cost.

* **Overview Statistics:** The project dashboard displays aggregated metrics such as total runs, total cost, and average latency.
* **Performance Analysis:**
    * **Latency Distribution Charts:** Helps identify unusually slow cases.
    * **Cost Analysis:** Tracks cost over time, by model, or by call type.
    * **Bottleneck Identification:** By viewing the execution time of each step in the trace, you can quickly identify components causing the highest latency.
* **Issue Identification and Resolution:**
    * **Error Filtering:** Easily filter failed Runs to inspect and understand the root cause.
    * **A/B Comparison:** LangSmith allows you to run A/B tests by grouping Runs into different "experiments" and comparing metrics between them. This is very useful when you're experimenting with different prompt versions, models, or architectures.
    * **Regression Analysis:** After making changes, you can re-run tests on the same dataset and compare traces to ensure no performance degradation.




---

## 5. Using LangSmith to Create and Run Evaluation Datasets, Compare Versions

LangSmith is not just a monitoring tool but also a powerful platform for LLM application evaluation.

* **Creating Datasets:**
    * You can create evaluation datasets directly in LangSmith by uploading (question, context, ground truth answer) pairs or extracting them from existing traces.
    * This data can be used for regression testing or quality evaluation.
* **Running Evaluations:**
    * **Automatic Evaluators:** LangSmith has built-in automatic evaluators (e.g., measuring accuracy, factual consistency, relevance using LLM-as-a-Judge). You can configure these evaluators to run on your dataset.
    * **Human Annotators:** You can send Runs to a group of human evaluators to manually score them based on qualitative criteria.
    * **Version Comparison:** LangSmith allows you to run different versions of your application on the same dataset and visualize evaluation results to compare performance. This helps you make data-driven decisions about which version is better to deploy.
* **Prompt Hub:**
    * LangSmith also provides a Prompt Hub where you can store, version, and share your prompts. This helps ensure prompt consistency and reusability across projects.




---

## 6. Practical Example: Integrating LangSmith into an Existing LangChain Application and Performing Monitoring, Evaluation

We will reuse the simple Q&A system example from Lesson 9.2 and integrate LangSmith to trace runs and prepare for evaluation.

**Preparation:**
* Ensure you have the necessary libraries installed: `langchain-openai`, `langsmith`.
* Set the `OPENAI_API_KEY`, `LANGCHAIN_API_KEY`, `LANGCHAIN_TRACING_V2="true"`, `LANGCHAIN_PROJECT="My LangSmith Q&A App"` environment variables.

In [None]:
# Install libraries if not already installed
# pip install langchain-openai openai langsmith

import os
from typing import List, Dict, Any
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langsmith import Client # To interact with LangSmith

# --- Set environment variables for OpenAI and LangSmith API keys ---
# Make sure you replace "YOUR_OPENAI_API_KEY" and "YOUR_LANGSMITH_API_KEY" with your actual keys.
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# os.environ["LANGCHAIN_API_KEY"] = "YOUR_LANGSMITH_API_KEY"

# Enable tracing and set LangSmith project name
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "My LangSmith Q&A App"

# Initialize LangSmith Client (optional, but useful for advanced tasks)
langsmith_client = Client()

# --- 1. Simple Q&A System (our application) ---
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

def simple_qa_system(question: str, context: str = None) -> str:
    """
    A simple Q&A system that uses an LLM to answer questions based on context.
    """
    qa_prompt_template = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful Q&A assistant. Answer the following question. If context is provided, use it to answer. Otherwise, answer based on your general knowledge."),
        ("user", f"Context: {context}\n\nQuestion: {question}" if context else f"Question: {question}"),
    ])
    qa_chain = qa_prompt_template | llm | StrOutputParser()
    
    # When you call invoke/stream/batch on a Runnable, LangSmith will automatically create a trace.
    response = qa_chain.invoke({"question": question, "context": context})
    return response

# --- 2. Test Dataset (to generate runs that LangSmith will trace) ---
test_questions = [
    {"question": "What is the capital of Vietnam?", "context": "Hanoi is the capital of Vietnam."}, # Thủ đô của Việt Nam là gì? Hà Nội là thủ đô của Việt Nam.
    {"question": "Who invented the light bulb?", "context": None}, # Ai là người phát minh ra bóng đèn?
    {"question": "Calculate 123 multiplied by 45.", "context": None}, # Tính 123 nhân 45.
    {"question": "How many days are there in 2024?", "context": "2024 is a leap year, with 366 days."}, # Năm 2024 có bao nhiêu ngày? Năm 2024 là năm nhuận, có 366 ngày.
]

# --- 3. Run the application and observe on LangSmith ---
print("Starting Q&A application runs to generate traces on LangSmith...")

for i, item in enumerate(test_questions):
    question = item["question"]
    context = item["context"]
    print(f"\n--- Question {i+1} ---")
    print(f"Question: {question}")
    if context:
        print(f"Context: {context}")
    
    # Call the Q&A system
    llm_response = simple_qa_system(question, context)
    print(f"AI Response: {llm_response}")

print("\nFinished running questions. Please check your LangSmith dashboard.")
print(f"Visit: https://app.langsmith.com/projects/{os.getenv('LANGCHAIN_PROJECT')}/runs")

# --- 4. (Optional) Log custom feedback to LangSmith ---
# You can log feedback directly from code if desired.
# For example: Suppose you want to log feedback for the first question.
# After simple_qa_system runs, you can get the run_id from the trace.
# In a real environment, you would get the run_id from a callback or API response.

# Example illustrating how to log feedback (not automatically run in this example)
# run_id_example = "some_run_id_from_langsmith_trace"
# try:
#     langsmith_client.create_feedback(
#         run_id=run_id_example,
#         key="user_satisfaction",
#         score=5, # Example: 1-5 stars
#         comment="Response was very accurate and helpful!"
#     )
#     print(f"\nLogged feedback for run_id: {run_id_example}")
# except Exception as e:
#     print(f"\nCould not log feedback: {e}. Ensure run_id is valid and you have access.")

# --- 5. (Optional) Create Dataset and run Evaluation in LangSmith ---
# This is a conceptual/guidance section, no direct executable code here
# because creating datasets and running evaluations are typically done via the LangSmith UI
# or by using the LangSmith SDK in a more detailed manner.

# Step 1: Create Dataset in LangSmith
#   You can create a new dataset on the LangSmith UI and import (question, ground_truth_answer, context) pairs.
#   Or use client.upload_examples().

# Step 2: Run Evaluation on Dataset
#   On the LangSmith UI, select your Dataset, then choose "Run Evaluation."
#   You can select LLM Evaluators (LLM-as-a-Judge) or connect to Human Evaluators.
#   LangSmith will run your application on each example in the dataset and collect scores.

# Step 3: Compare Versions
#   After running evaluations for multiple versions (e.g., version A with old prompt, version B with new prompt),
#   you can use the "Compare Runs" or "Compare Experiments" feature in LangSmith
#   to view evaluation metrics and detailed traces, helping you decide which version performs better.

print("\n--- End of Practical ---")