# 使用LangChain和RAGAS对RAG系统进行自动有效评估

我们主要讨论一下LLM RAG问答系统中一个重要的组成部分:

- Evaluation

我们主要使用LangChain 构建RAG问答系统，利用 RAGAS 框架进行评估，因为它正逐渐成为评估 RAG 系统的标准方法

### 首先安装 依赖

In [None]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

In [None]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### 数据准备

主要以Arxiv的论文为例进行评估，通过 `ArxivLoader` 加载数据(论文)作为RAG的上下文。

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)


In [None]:
from langchain.document_loaders import ArxivLoader

paper_docs = ArxivLoader(query="2309.15217", load_max_docs=1).load()
len(paper_docs)

In [None]:
for doc in paper_docs:
  print(doc.metadata)

### 创建RAG文本分割、Embedding model 、 向量库存储

我们主要使用 `RecursiveCharacterTextSplitter` 切割文本，通过`OpenAIEmbeddings()`进行文本编码，存储到 `VectorStore`。

- `RecursiveCharacterTextSplitter()`
- `OpenAIEmbeddings()`
- `Chroma`

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

docs = text_splitter.split_documents(paper_docs)

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

In [None]:
len(docs)

In [None]:
print(max([len(chunk.page_content) for chunk in docs]))

现在我们可以利用 `Chroma` 向量库的 `.as_retriever()` 方式进行检索，需要控制的主要参数为 `k`

In [None]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 3})

In [None]:
relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")

In [None]:
len(relevant_docs)

### 创建prompt ——— 生成答案
我们需要利用`LLM`对`Context` 生成一系列的问题的`answer`


In [None]:
from langchain import PromptTemplate

template = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

Question: {question} 

Context: {context} 

Answer:
"""

prompt = PromptTemplate(
    template=template, 
    input_variables=["context","question"]
  )

print(prompt)

### 生成`answer`,利用LLM
利用 `Runnable` 定义一个 `chain` 实现rag全流程。

In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

rag_chain = (
    {"context": base_retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm
    | StrOutputParser() 
)

#### 创建 RAGAs 所需的数据
question  Answer   contexts  ground_truths

In [None]:
# Ragas 数据集格式要求  ['question', 'answer', 'contexts', 'ground_truths']
'''
{
    "question": [], <-- 问题基于Context的
    "answer": [], <-- 答案基于LLM生成的
    "contexts": [], <-- context
    "ground_truths": [] <-- 标准答案
}
'''

from datasets import Dataset

questions = ["What is faithfulness ?", 
             "How many pages are included in the WikiEval dataset, and which years do they cover information from?",
             "Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?",
            ]
ground_truths = [["Faithfulness refers to the idea that the answer should be grounded in the given context."],
                 [" To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022."],
                ["Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself."]]
answers = []
contexts = []

# 生成答案
for query in questions:
    answers.append(rag_chain.invoke(query))
    contexts.append([docs.page_content for docs in base_retriever.get_relevant_documents(query)])

# 构建数据
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}
dataset = Dataset.from_dict(data)


In [None]:
dataset

### 使用RAGAs 进行评估

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset = dataset, 
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

result

In [None]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

df = result.to_pandas()
df