# Step 5 (Optional): Use LangSmith to track evaluation and trace

This notebook uses langsmith to trace and track the evaluation. You need an API key to run this notebook. Langsmith offers a free tier with 3000 traces per month (as of Feb 2024). This notebook uses approximately 200 observations. You can get the API key by signing up on smith.langchain.com and creating a new key in the settings.

In [1]:
%run 01-llm-app-setup.ipynb

## Tracing
Because we are already using langchain, in order to start tracing, we just need to set the following environment variables along with the API key.

In [2]:
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "LLM Eval Workshop"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# If you haven't set it in your ".env" file, you can set your API key here.
# os.environ["LANGCHAIN_API_KEY"] = "ls__..."

## Other ways to log traces to LangSmith
If you're not using LangChain, don't worry! There are other ways of using LangSmith, you can find them [here](https://docs.smith.langchain.com/tracing/faq/logging_and_viewing#logging-traces).

For non-langchain apps, we find adding `traceable` decorator to be the easiest way to log. Here's an example

In [3]:
from langsmith import traceable
from langsmith.wrappers import wrap_openai
import openai

@traceable(run_type="retriever", name="Retrieve Context")
def retrieve_docs(question: str) -> str:
    docs = retriever.get_relevant_documents(question)
    return format_to_string_list(docs)

# Langsmith also provides a specific wrapper for OpenAI's API, or we can also use the traceable like above
client = wrap_openai(openai.Client())

@traceable(name="RAG Pipeline Trace")
def rag_pipeline(question: str):
    context_list = retrieve_docs(question)
    
    messages = [
        { "role": "system", "content": "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise." },
        { "role": "user", "content": f"Question: {question} \nContext: {concat_string(context_list)} \nAnswer:"}
    ]
    chat_completion = client.chat.completions.create(
        model="gpt-3.5-turbo", messages=messages
    )
    return {
        "answer": chat_completion.choices[0].message.content,
        "context": context_list
    }

rag_pipeline("What is Task Decomposition?")

{'answer': 'Task decomposition is a technique that breaks down complex tasks into smaller and more manageable steps. This approach helps agents, models, or individuals to tackle difficult tasks by dividing them into simpler subtasks. Task decomposition can be achieved through techniques like Chain of Thought and Tree of Thoughts, which prompt the model to think step by step and explore multiple reasoning possibilities at each step.',
 'context': ['Fig. 1. Overview of a LLM-powered autonomous agent system.\nComponent One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed 

## Uploading evaluation to LangSmith 
### First, we need to register our eval dataset to LangSmith

In [4]:
import pandas as pd

gen_dataset = pd.read_csv("generated_qa.csv").fillna("")

In [5]:
from langsmith import Client

client = Client()
dataset_name = "RAG QA Dataset v2"


dataset = client.upload_dataframe(
    df=gen_dataset,
    input_keys=["question"],
    output_keys=["ground_truth", "ground_truth_context"],
    name=dataset_name,
    description="Dataset to test out QA with RAG.",
    data_type="kv" # The default
)

### Then, we evaluate our app
First, let's setup our custom evaluators. LangSmith requires results to be returned with a class `EvaluationResult`


In [6]:
%run 03-metrics-definition.ipynb

In [7]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def ls_context_correctness(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs["ground_truth_context"]
    retrieved_contexts = run.outputs["context"] or []
    return EvaluationResult(key="context_correctness", score=context_correctness(ground_truth_context, retrieved_contexts))
    
    
@run_evaluator
def ls_ground_truth_context_rank(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs["ground_truth_context"]
    retrieved_contexts = run.outputs.get("context") or []
    return EvaluationResult(key="ground_truth_context_rank", score=ground_truth_context_rank(ground_truth_context, retrieved_contexts))

@run_evaluator
def ls_context_rougel_score(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs["ground_truth_context"]
    retrieved_contexts = run.outputs["context"]
    return EvaluationResult(key="context_rougel_score", score=context_rougel_score(ground_truth_context, retrieved_contexts))


In [8]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[ls_context_correctness, ls_ground_truth_context_rank, ls_context_rougel_score],
    
    # You can also use a prebuilt evaluator
    # by providing a name or RunEvalConfig.<configured evaluator>
    evaluators=[
        # You can specify an evaluator by name/enum.
        RunEvalConfig.Criteria("harmfulness"),
        # And also define your own custom LLM evaluator.
        RunEvalConfig.Criteria(
            {
                "helpfulness": "Are the answers helpful and provide new information to the user?"
            }
        ),
    ],
    
    input_key="question",
    reference_key="ground_truth",
    prediction_key="answer"
)


## Run the evaluation

In [9]:
from langsmith import Client

client = Client()

client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=itemgetter("question") | rag_chain,
    evaluation=eval_config,
    verbose=True,
    # Any experiment metadata can be specified here
    project_metadata={"version": "0.0.1"},
    
)

View the evaluation results for project 'excellent-gate-27' at:
https://smith.langchain.com/o/1f1a0b6d-5609-5d96-85ab-9e2f8e91c6f3/datasets/60ed6f99-71e1-41aa-91cb-77cca7d1789b/compare?selectedSessions=73298c44-66c1-428e-a5df-6fa0b013ee80

View all tests for Dataset RAG QA Dataset v2 at:
https://smith.langchain.com/o/1f1a0b6d-5609-5d96-85ab-9e2f8e91c6f3/datasets/60ed6f99-71e1-41aa-91cb-77cca7d1789b
[------------------------------------------------->] 76/76

Unnamed: 0,feedback.harmfulness,feedback.helpfulness,feedback.context_correctness,feedback.ground_truth_context_rank,feedback.context_rougel_score,error,execution_time,run_id
count,76.0,76.0,76,76.0,76.0,0.0,76.0,76
unique,,,2,,,0.0,,76
top,,,True,,,,,7e2303bb-2ab3-4801-8cff-2e89672e2e67
freq,,,59,,,,,1
mean,0.0,0.921053,,0.0,0.663048,,9.155459,
std,0.0,0.271448,,0.783156,0.445221,,24.838253,
min,0.0,0.0,,-1.0,0.0,,0.839262,
25%,0.0,1.0,,0.0,0.112393,,1.499281,
50%,0.0,1.0,,0.0,1.0,,1.991135,
75%,0.0,1.0,,0.0,1.0,,2.425742,


{'project_name': 'excellent-gate-27',
 'results': {'4061157c-e376-4b83-a11c-ab3b1f985435': {'input': {'question': 'What is the main purpose of the benchmark mentioned in the text?'},
   'feedback': [EvaluationResult(key='harmfulness', score=0, value='N', comment='The criterion for this assessment is whether the submission is harmful, offensive, or inappropriate.\n\nLooking at the submission, it is a response to a question about the main purpose of a benchmark mentioned in a text. The response is focused on the topic, providing a detailed explanation of the purpose of the benchmark in relation to tool-augmented LLMs and API tools.\n\nThere is no language or content in the submission that could be considered harmful, offensive, or inappropriate. The submission is purely informational and relevant to the question asked.\n\nTherefore, the submission is not harmful, offensive, or inappropriate.\n\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('a85a4604-ceff-475d-abac-54a

## Please go to smith.langchain.com to see your run

![](langsmith.png)