# 使用LangChain和RAGAS對RAG系統進行自動有效評估

### 首先安裝 依賴

In [None]:
!pip install -U -q langchain ragas arxiv pymupdf chromadb wandb tiktoken

In [6]:
import os


#os.environ["OPENAI_API_KEY"] = 
LANGSMITH_TRACING=True
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY="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"
LANGSMITH_PROJECT="pr-another-sound-23"
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = LANGSMITH_API_KEY


### 資料準備

主要以Arxiv的論文為例進行評估，透過 `ArxivLoader` 載入資料(論文)作為RAG的脈絡。

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)


In [7]:
from langchain.document_loaders import ArxivLoader

paper_docs = ArxivLoader(query="2309.15217", load_max_docs=1).load()
len(paper_docs)

1

In [8]:
for doc in paper_docs:
  print(doc.metadata)

{'Published': '2023-09-26', 'Title': 'RAGAS: Automated Evaluation of Retrieval Augmented Generation', 'Authors': 'Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert', 'Summary': 'We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework\nfor reference-free evaluation of Retrieval Augmented Generation (RAG)\npipelines. RAG systems are composed of a retrieval and an LLM based generation\nmodule, and provide LLMs with knowledge from a reference textual database,\nwhich enables them to act as a natural language layer between a user and\ntextual databases, reducing the risk of hallucinations. Evaluating RAG\narchitectures is, however, challenging because there are several dimensions to\nconsider: the ability of the retrieval system to identify relevant and focused\ncontext passages, the ability of the LLM to exploit such passages in a faithful\nway, or the quality of the generation itself. With RAGAs, we put forward a\nsuite of metrics which can be used to eval

### 建立RAG文字分割、Embedding model 、 向量庫儲存

使用 `RecursiveCharacterTextSplitter` 切割文本，透過`sentence-transformers/all-MiniLM-L6-v2`進行文本編碼，儲存到 `VectorStore`。

- `RecursiveCharacterTextSplitter()`
- `sentence-transformers/all-MiniLM-L6-v2`
- `Chroma`

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

docs = text_splitter.split_documents(paper_docs)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = Chroma.from_documents(docs, embedding_model)

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
len(docs)

107

In [12]:
print(max([len(chunk.page_content) for chunk in docs]))

497


現在我們可以利用 `Chroma` 向量庫的 `.as_retriever()` 方式進行檢索，需要控制的主要參數為 `k`

In [13]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 3})

In [14]:
relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")

  relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")


In [15]:
len(relevant_docs)

3

### 建立prompt ——— 產生答案
我們需要利用`LLM`對`Context` 產生一系列的問題的`answer`


In [16]:
from langchain import PromptTemplate

template = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

Question: {question} 

Context: {context} 

Answer:
"""

prompt = PromptTemplate(
    template=template, 
    input_variables=["context","question"]
  )

print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} template="You are an assistant for question-answering tasks. \nUse the following pieces of retrieved context to answer the question. \nIf you don't know the answer, just say that you don't know. \n\nQuestion: {question} \n\nContext: {context} \n\nAnswer:\n"


### 產生`answer`,利用LLM
利用 `Runnable` 定義一個 `chain` 實作rag全流程。

In [17]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
#from langchain.chat_models import ChatOpenAI
from langchain.chat_models import ChatOllama
#llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
llm = ChatOllama(model = "llama3.1")

rag_chain = (
    {"context": base_retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm
    | StrOutputParser() 
)

  llm = ChatOllama(model = "llama3.1")


#### 建立 RAGAs 所需的數據
user_input,  response,   retrieved_contexts,  reference

In [None]:
# Ragas 數據集格式要求  ['user_input', 'response', 'retrieved_contexts', 'reference']
'''
{
    "user_input": [], <-- 基於Context的問題
    "response": [], <-- 基於LLM生成的答案
    "retrieved_contexts": [], <-- context
    "reference": [] <-- 標準答案
}
'''

from datasets import Dataset

user_input = ["What is faithfulness ?", 
             "How many pages are included in the WikiEval dataset, and which years do they cover information from?",
             "Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?",
            ]
reference = ["Faithfulness refers to the idea that the answer should be grounded in the given context.",
                 " To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022.",
                "Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself."]
response = []
retrieved_contexts = []

# 生成答案
for query in user_input:
    text = rag_chain.invoke(query)
    response.append(text)
    retrieved_contexts.append([docs.page_content for docs in base_retriever.get_relevant_documents(query)])

# 構建數據集
data = {
    "user_input": user_input,
    "response": response,
    "retrieved_contexts": retrieved_contexts,
    "reference": reference
}
dataset = Dataset.from_dict(data)


In [19]:
dataset

Dataset({
    features: ['user_input', 'response', 'retrieved_contexts', 'reference'],
    num_rows: 3
})

### 使用RAGAs 進行評估
ContextRecall: reference跟retrieved_contexts的關聯

Faithfulness: retrieved_contexts跟response的關聯

Correctness: response跟reference的關聯

In [20]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper


evaluator_llm = LangchainLLMWrapper(llm)
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness

result = evaluate(dataset=dataset,metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],llm=evaluator_llm)
result

result

Evaluating: 100%|██████████| 9/9 [00:35<00:00,  3.93s/it]


{'context_recall': 0.8333, 'faithfulness': 0.9583, 'factual_correctness': 0.5400}

In [21]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

df = result.to_pandas()
df

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_recall,faithfulness,factual_correctness
0,What is faithfulness ?,"[First, Faithfulness refers to the idea that the an-\nswer should be grounded in the given context. This\nis important to avoid hallucinations, and to ensure\nthat the retrieved context can act as a justification\nfor the generated answer. Indeed, RAG systems are\noften used in applications where the factual con-\nsistency of the generated text w.r.t. the grounded\nsources is highly important, e.g. in domains such as\nlaw, where information is constantly evolving. Sec-, Faithfulness measures the information\nconsistency of the answer against the\ngiven context. Any claims that are made\nin the answer that cannot be deduced\nfrom context should be penalized.\nGiven an answer and context, assign a\nscore for faithfulness in the range 0-10.\ncontext: [context]\nanswer: [answer]\nTies, where the same score is assigned by the LLM\nto both answer candidates, were broken randomly.\nThe second baseline, shown as GPT Ranking, in-, an answer as(q). When building a RAG system,\nwe usually do not have access to human-annotated\ndatasets or reference answers. We therefore fo-\ncus on metrics that are fully self-contained and\nreference-free. We focus in particular three quality\naspects, which we argue are of central importance.\nFirst, Faithfulness refers to the idea that the an-\nswer should be grounded in the given context. This\nis important to avoid hallucinations, and to ensure]","Faithfulness refers to the idea that the answer should be grounded in the given context, avoiding hallucinations and ensuring that any claims made can be deduced from the context. It measures the information consistency of the answer against the given context, penalizing any claims that cannot be inferred from it. Faithfulness is a crucial aspect in evaluating Retrieval Augmented Generation (RAG) pipelines, where systems are expected to act as a natural language layer between users and textual databases without compromising factual consistency.",Faithfulness refers to the idea that the answer should be grounded in the given context.,1.0,0.875,0.25
1,"How many pages are included in the WikiEval dataset, and which years do they cover information from?","[which we refer to as WikiEval4. To construct the\ndataset, we first selected 50 Wikipedia pages cov-\nering events that have happened since the start of\n20225. In selecting these pages, we prioritised\nthose with recent edits. For each of the 50 pages,\nwe then asked ChatGPT to suggest a question that\ncan be answered based on the introductory section\nof the page, using the following prompt:\nYour task is to formulate a question from\ngiven context satisfying the rules given\nbelow:, which are annotated with human judgments. We\ncan then verify to what extent our metrics agree\nwith human assessments of faithfulness, answer\nrelevance and context relevance. Since we are not\naware of any publicly available datasets that could\nbe used for this purpose, we created a new dataset,\nwhich we refer to as WikiEval4. To construct the\ndataset, we first selected 50 Wikipedia pages cov-\nering events that have happened since the start of\n20225. In selecting these pages, we prioritised, opment of such a framework, we have introduced\nWikiEval, a dataset which human judgements of\nthese three different aspects. Finally, we have also\ndescribed RAGAs, our implementation of the three\nconsidered quality aspects. This framework is easy\nto use and can provide deverlopers of RAG sys-\ntems with valuable insights, even in the absence\nof any ground truth. Our evaluation on WikiEval\nhas shown that the predictions from RAGAs are\nclosely aligned with human predictions, especially]","The WikiEval dataset includes 50 Wikipedia pages, covering events since the start of 2022. The exact number of pages included in the dataset is mentioned as 50. The years covered by this dataset are from 2022 onwards, which indicates a specific time period rather than a range of years.","To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022.",0.5,1.0,0.57
2,Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?,"[RAGAS: Automated Evaluation of Retrieval Augmented Generation\nShahul Es†, Jithin James†, Luis Espinosa-Anke∗♢, Steven Schockaert∗\n†Exploding Gradients\n∗CardiffNLP, Cardiff University, United Kingdom\n♢AMPLYFI, United Kingdom\nshahules786@gmail.com,jamesjithin97@gmail.com\n{espinosa-ankel,schockaerts1}@cardiff.ac.uk\nAbstract\nWe introduce RAGAS (Retrieval Augmented\nGeneration Assessment), a framework for\nreference-free evaluation of Retrieval Aug-\nmented Generation (RAG) pipelines.\nRAG, that are only available through APIs.\nWhile the usefulness of retrieval-augmented\nstrategies is clear, their implementation requires\na significant amount of tuning, as the overall per-\nformance will be affected by the retrieval model,\nthe considered corpus, the LM, or the prompt for-\nmulation, among others. Automated evaluation of\nretrieval-augmented systems is thus paramount. In\npractice, RAG systems are often evaluated in terms\nof the language modelling task itself, i.e. by mea-, Abstract\nWe introduce RAGAS (Retrieval Augmented\nGeneration Assessment), a framework for\nreference-free evaluation of Retrieval Aug-\nmented Generation (RAG) pipelines.\nRAG\nsystems are composed of a retrieval and an\nLLM based generation module, and provide\nLLMs with knowledge from a reference textual\ndatabase, which enables them to act as a natu-\nral language layer between a user and textual\ndatabases, reducing the risk of hallucinations.\nEvaluating RAG architectures is, however, chal-]","Evaluating Retrieval Augmented Generation (RAG) systems is challenging because there are several dimensions to consider, including:\n\n1. The ability of the retrieval system to identify relevant and focused context passages.\n2. The ability of the LLM to exploit such passages in a faithful way.\n3. The quality of the generation itself.\n\nThis makes it difficult to evaluate RAG architectures using traditional methods.","Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself.",1.0,1.0,0.8
