# Step 5 (Optional): Use LangSmith to track evaluation and trace

This notebook uses langsmith to trace and track the evaluation. You need an API key to run this notebook. Langsmith offers a free tier with 5000 traces per month (as of June 2024). This notebook uses approximately 200 observations. You can get the API key by signing up on smith.langchain.com and creating a new key in the settings.

In [1]:
%run 01-llm-app-setup.ipynb

Repo card metadata block was not found. Setting CardData to empty.
100%|██████████| 5/5 [00:00<00:00, 2879.91it/s]


Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]

Documents before chunking: 5
Documents after chunking: 35




 = Plain maskray = 
 
 The plain maskray or brown stingray ( Neotrygon annotata ) is a species of stingray in the family Dasyatidae . It is found in shallow , soft-bottomed habitats off northern Australia . Reaching 24 cm ( 9.4 in ) in width , this species has a diamond-shaped , grayish green pectoral fin disc . Its short , whip-like tail has alternating black and white bands and fin folds above and below . There are short rows of thorns on the back and the base of the tail , but otherwise the s


Generating embeddings:   0%|          | 0/35 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The plain maskray is found in the continental shelf of northern Australia, from the Wellesley Islands in Queensland to the Bonaparte Archipelago in Western Australia, including the Gulf of Carpentaria and the Timor and Arafura Seas. There are also unsubstantiated reports that its range extends to southern Papua New Guinea.

Source: Plain maskray
Relevant Snippet: "The plain maskray inhabits the continental shelf of northern Australia from the Wellesley Islands in Queensland to the Bonaparte Archipelago in Western Australia, including the Gulf of Carpentaria and the Timor and Arafura Seas. There are unsubstantiated reports that its range extends to southern Papua New Guinea."


## Setup

In [2]:
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "GenAI Bootcamp 3.0"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# If you haven't set it in your ".env" file, you can set your API key here.
# os.environ["LANGCHAIN_API_KEY"] = "..."

## Logging to LangSmith

If you're not using LangChain, don't worry! There are other ways of using LangSmith, you can find them [here](https://docs.smith.langchain.com/tracing/faq/logging_and_viewing#logging-traces).

For non-langchain apps, we find adding `traceable` decorator to be the easiest way to log. Here's an example:

In [3]:
from langsmith import traceable

@traceable(name="RAG Pipeline Trace")
def rag_pipeline(question: str):
    model_output = openai_query_engine.query(question)
    response = model_output.response
    context = [node.text for node in model_output.source_nodes]
    return {
        "output": response,
        "context": context
    }

rag_pipeline("What is the scientific name of the brown stingray?")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{'output': 'The scientific name of the brown stingray is *Neotrygon annotata*.\n\nSource: Plain maskray\nRelevant Snippet: "The plain maskray or brown stingray (Neotrygon annotata) is a species of stingray in the family Dasyatidae."',
 'context': ['= Plain maskray = \n \n The plain maskray or brown stingray ( Neotrygon annotata ) is a species of stingray in the family Dasyatidae . It is found in shallow , soft-bottomed habitats off northern Australia . Reaching 24 cm ( 9.4 in ) in width , this species has a diamond-shaped , grayish green pectoral fin disc . Its short , whip-like tail has alternating black and white bands and fin folds above and below . There are short rows of thorns on the back and the base of the tail , but otherwise the skin is smooth . While this species possesses the dark mask-like pattern across its eyes common to its genus , it is not ornately patterned like other maskrays . \n Benthic in nature , the plain maskray feeds mainly on caridean shrimp and polychaete w

## Uploading evaluation to LangSmith 
### First, we need to register our eval dataset to LangSmith

In [4]:
import pandas as pd

gen_dataset = pd.read_csv("generated_qa.csv").fillna("")

In [5]:
from langsmith import Client

client = Client()
dataset_name = "RAG QA Dataset v2"


dataset = client.upload_dataframe(
    df=gen_dataset,
    input_keys=["question"],
    output_keys=["ground_truth", "ground_truth_context"],
    name=dataset_name,
    description="Dataset to test out QA with RAG.",
    data_type="kv" # The default
)

### Then, we evaluate our app
First, let's setup our custom evaluators. LangSmith requires results to be returned with a class `EvaluationResult`


In [6]:
%run 03-metrics-definition.ipynb

In [7]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def ls_context_correctness(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs["ground_truth_context"]
    retrieved_contexts = run.outputs["context"] or []
    return EvaluationResult(key="context_correctness", score=context_correctness(ground_truth_context, retrieved_contexts))
    
    
@run_evaluator
def ls_ground_truth_context_rank(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs["ground_truth_context"]
    retrieved_contexts = run.outputs.get("context") or []
    return EvaluationResult(key="ground_truth_context_rank", score=ground_truth_context_rank(ground_truth_context, retrieved_contexts))

@run_evaluator
def ls_context_rougel_score(run, example) -> EvaluationResult:
    ground_truth_context = example.outputs["ground_truth_context"]
    retrieved_contexts = run.outputs["context"]
    return EvaluationResult(key="context_rougel_score", score=context_rougel_score(ground_truth_context, retrieved_contexts))


In [8]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[ls_context_correctness, ls_ground_truth_context_rank, ls_context_rougel_score],
    
    # You can also use a prebuilt evaluator
    # by providing a name or RunEvalConfig.<configured evaluator>
    evaluators=[
        # You can specify an evaluator by name/enum.
        RunEvalConfig.Criteria("harmfulness"),
        # And also define your own custom LLM evaluator.
        RunEvalConfig.Criteria(
            {
                "helpfulness": "Are the answers helpful and provide new information to the user?"
            }
        ),
    ],
    
    input_key="question",
    reference_key="ground_truth",
    prediction_key="answer"
)


## Run the evaluation

In [9]:
def llm_call(input):
    model_output = openai_query_engine.query(input["question"])
    response = model_output.response
    context = [node.text for node in model_output.source_nodes]
    return {
        "output": response,
        "context": context
    }

In [10]:
from langsmith import Client

client = Client()

client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=rag_pipeline,
    evaluation=eval_config,
    verbose=True,
    # Any experiment metadata can be specified here
    project_metadata={"version": "0.0.1"},
    
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

View the evaluation results for project 'virtual-protest-75' at:
https://smith.langchain.com/o/1f1a0b6d-5609-5d96-85ab-9e2f8e91c6f3/datasets/10c00bea-ed0e-4516-9515-c2aa87668041/compare?selectedSessions=2e4cc2a7-1355-41ea-8005-c60eccd9a393

View all tests for Dataset RAG QA Dataset v2 at:
https://smith.langchain.com/o/1f1a0b6d-5609-5d96-85ab-9e2f8e91c6f3/datasets/10c00bea-ed0e-4516-9515-c2aa87668041
[------------------------------------------------->] 45/45

Unnamed: 0,feedback.harmfulness,feedback.helpfulness,feedback.context_correctness,feedback.ground_truth_context_rank,feedback.context_rougel_score,error,execution_time,run_id
count,0.0,0.0,45,45.0,45.0,0.0,45.0,45
unique,0.0,0.0,2,,,0.0,,45
top,,,True,,,,,c7f492af-13d5-498d-896d-26c2072962ec
freq,,,30,,,,,1
mean,,,,-0.177778,0.575388,,1.855363,
std,,,,0.716332,0.462988,,0.789498,
min,,,,-1.0,0.0,,0.774108,
25%,,,,-1.0,0.153614,,1.312143,
50%,,,,0.0,1.0,,1.622293,
75%,,,,0.0,1.0,,2.233052,


{'project_name': 'virtual-protest-75',
 'results': {'c6831904-2e55-4d18-855d-796ddca32456': {'input': {'question': 'What is the full Japanese title of Valkyria Chronicles III?'},
   'feedback': [EvaluationResult(key='harmfulness', score=None, value=None, comment="Error evaluating run c7f492af-13d5-498d-896d-26c2072962ec: Run with ID c7f492af-13d5-498d-896d-26c2072962ec doesn't have the expected prediction key 'answer'. Available prediction keys in this Run are: output, context. Adjust the evaluator's prediction_key or ensure the Run object's outputs the expected key.", correction=None, evaluator_info={}, feedback_config=None, source_run_id=None, target_run_id=None),
    EvaluationResult(key='helpfulness', score=None, value=None, comment="Error evaluating run c7f492af-13d5-498d-896d-26c2072962ec: Run with ID c7f492af-13d5-498d-896d-26c2072962ec doesn't have the expected prediction key 'answer'. Available prediction keys in this Run are: output, context. Adjust the evaluator's prediction

## Please go to smith.langchain.com to see your run

![](langsmith.png)

![Screenshot of the LangSmith interface as of June 2024](langsmith.png)