## LangSmith 评估你的RAG系统

使用LangSmith平台替代手动的RAG系统评估。

1. 管理数据集
2. 评估准确性
3. 评估延迟
4. 可视化评估效果

## 1.Environment  
需要提前准备的第三方package

In [None]:
! pip install -U -q langchain  tiktoken unstructured==0.12.5 openai pandas langchain-community chromadb langchain-openai

## 2. Set up 

我们将使用到Embedding model,Rerank model,Chat model,LangChain
1. Embedding model(OpenAI key)
2. Rerank model(Cohere key)
3. Chat model(OpenAI key)
4. Langchain API 

In [25]:
import getpass
import os

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass.getpass()
# Set your Langchain APi key 
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

## 3.Load Data
加载arxiv论文，以[RAGAS](https://arxiv.org/pdf/2309.15217)论文为例。

In [26]:
from langchain.document_loaders import ArxivLoader

paper_docs = ArxivLoader(query="2309.15217", load_max_docs=1).load()
len(paper_docs)

1

## 4.Prepare Data

- text splitter
- embedding
- store 

In [27]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

##split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)
docs = text_splitter.split_documents(paper_docs)

##embed
vectorstore = Chroma.from_documents(docs[:10], OpenAIEmbeddings())

## Index
retriever = vectorstore.as_retriever()

## 5.Build RAG Chain

In [28]:
### RAG

import openai
from langsmith import traceable
from langsmith.wrappers import wrap_openai

class RagBot:
    
    def __init__(self, retriever, model: str = "gpt-4o-2024-05-13"):
        self._retriever = retriever
        # Wrapping the client instruments the LLM
        self._client = wrap_openai(openai.Client())
        self._model = model

    @traceable()
    def retrieve_docs(self, question):
        return self._retriever.invoke(question)

    @traceable()
    def get_answer(self, question: str):
        similar = self.retrieve_docs(question)
        response = self._client.chat.completions.create(
            model=self._model,
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert language model designed to answer questions about academic papers in the fields of computer science, physics, mathematics, and statistics, among others, as indexed on arXiv."
                    " Use the following docs to produce accurate answers to the user question.\n\n"
                    f"## Docs\n\n{similar}",
                },
                {"role": "user", "content": question},
            ],
        )

        # Evaluators will expect "answer" and "contexts"
        return {
            "answer": response.choices[0].message.content,
            "contexts": [str(doc) for doc in similar],
        }


rag_bot = RagBot(retriever)

In [29]:
response = rag_bot.get_answer("What is ragas?")
response["answer"][:150]

'RAGAS (Retrieval Augmented Generation Assessment) is a framework designed for the reference-free evaluation of Retrieval Augmented Generation (RAG) pi'

## 6.Load Grund-Truth Dataset

In [30]:
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'

In [31]:
import json

# 加载JSON文件
with open('ragas_qa.json', 'r', encoding='utf-8') as f:
    data = json.load(f)


In [32]:
inputs = []
outputs = []

for row in data:
  question = row['question']
  answer = row['answer']
  inputs.append(question)
  outputs.append(answer)

qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

In [33]:
from langsmith import Client

# Create dataset
client = Client()
dataset_name = "rags-test2"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about ragas2.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

HTTPError: [Errno 409 Client Error: Conflict for url: https://api.smith.langchain.com/datasets] {"detail":"Dataset with this name already exists."}

## 7.Evaluate RAG 

In [34]:
# RAG chain
def predict_rag_answer(example: dict):
    """Use this for answer evaluation"""
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

def predict_rag_answer_with_context(example: dict):
    """Use this for evaluation of retrieved documents and hallucinations"""
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"], "contexts": response["contexts"]}

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Evaluator
qa_evalulator = [
    LangChainStringEvaluator(
        "qa",
        prepare_data=lambda run, example: {
            "prediction": run.outputs["answer"],
            "reference": example.outputs["answer"],
            "input": example.inputs["question"],
        },
      ),
]
experiment_results = evaluate(
    predict_rag_answer,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="ragas-paper-qa-gp4o",
    metadata={"variant": "ragas-paper-page2, gpt-4o-2024-05-13"},
)