# Evaluating RAG Systems

Langchain has the [data connection](https://python.langchain.com/docs/modules/data_connection/) module which helps you connect your own data with LLMs and build Retrieval Augmented Generation pipelines. We will be testing out

1. QA
2. COT_QA
2. CONTEXT_QA


Lets see how we can evaluate those but first lets build a QA system to test.

## Build a RAG pipeline

In [1]:
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator

loader = TextLoader("nyc_text.txt")
index = VectorstoreIndexCreator().from_loaders([loader])

In [2]:
question = "How did New York City get its name?"
index.query(question)

' New York City was named after King Charles II of England, who granted the lands to his brother, the Duke of York.'

In [44]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True
)
result = qa_chain({"query": question})
print(len(result['source_documents']))
result['source_documents'][0]

4


Document(page_content="== Etymology ==\n\nIn 1664, New York was named in honor of the Duke of York, who would become King James II of England. James's elder brother, King Charles II, appointed the Duke as proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control.\n\n\n== History ==", metadata={'source': 'nyc_text.txt'})

In [45]:
result

{'query': 'How did New York City get its name?',
 'result': 'New York City got its name in 1664 when it was renamed by the British after King Charles II granted the lands to his brother, the Duke of York. It was originally settled by the Dutch and named New Amsterdam in 1626. After coming under British control, the city was renamed New York in honor of the Duke of York, who would later become King James II of England.',
 'source_documents': [Document(page_content="== Etymology ==\n\nIn 1664, New York was named in honor of the Duke of York, who would become King James II of England. James's elder brother, King Charles II, appointed the Duke as proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control.\n\n\n== History ==", metadata={'source': 'nyc_text.txt'}),
  Document(page_content="During the Wisconsin glaciation, 75,000 to 11,000 years ago, the New York City area was situated at the edge of a large ice sheet o

## Coldstart eval data

We don't have any evaluation data to run the evaluation against so lets synthesis some. We need questions and Answer for this and we are going to use [LlamaIndex](https://github.com/jerryjliu/llama_index) for this.

In [5]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.evaluation import DatasetGenerator

with open("./nyc_text.txt") as f:
    docs = [Document(text=f.read())]
    
question_generator = DatasetGenerator.from_documents(docs)
eval_questions = question_generator.generate_questions_from_nodes(5)

eval_questions

['What is the population of New York City as of 2020?',
 'Which borough of New York City has the highest population?',
 'What is the economic significance of New York City?',
 'How did New York City get its name?',
 'What is the significance of the Statue of Liberty in New York City?']

## Run Evaluation chains
Lets start of with 
1. `EvaluatorType.CONTEXT_QA` which evaluates the generated answer to check if it is factually correct with the context.

In [23]:
examples = []
predictions = []

for q in eval_questions:
    example = {"query": q}
    result = qa_chain(example)
    example["context"] = "\n".join([c.page_content for c in result["source_documents"]])
    
    examples.append(example)
    predictions.append(result)

In [24]:
from langchain.evaluation import load_evaluator, EvaluatorType

context_qa_eval = load_evaluator(EvaluatorType.CONTEXT_QA)

# evaluation
context_qa_eval.evaluate(examples, predictions)

[{'text': 'CORRECT'},
 {'text': 'CORRECT'},
 {'text': 'CORRECT'},
 {'text': 'CORRECT'},
 {'text': 'CORRECT'}]

2. `EvaluatorType.COT_QA` which is the same as `EvaluatorType.CONTEXT_QA` but uses Chain of Thought (COT) reasoning for better answer.

In [27]:
cot_qa_eval = load_evaluator(EvaluatorType.COT_QA)
result = cot_qa_eval.evaluate(examples, predictions)

In [28]:
result

[{'text': "The context states that the population of New York City in 2020 is 8,804,190. The student's answer matches this information exactly. Therefore, the student's answer is correct.\nGRADE: CORRECT"},
 {'text': 'The context mentions that "Brooklyn (Kings County), on the western tip of Long Island, is the city\'s most populous borough." This statement directly aligns with the student\'s answer, which states that Brooklyn has the highest population in New York City. Therefore, the student\'s answer is correct.\nGRADE: CORRECT'},
 {'text': "The student's answer correctly identifies New York City as the headquarters for the U.S. financial industry, including Wall Street, and mentions the presence of large financial companies and startups. The student also correctly identifies New York City as a global hub for business and commerce, attracting capital, business, and tourists from around the world. The student correctly lists several industries that are centered in New York City, such 

3. `EvaluatorType.QA` which compares the answer generated and the ground truth to check if there are any factual inconsistancies.


to run this we need ground truth answers, for this we are going to use llamaIndex as the source of truth (don't do this in prod).

In [15]:
# create vector index
vector_index = VectorStoreIndex.from_documents(docs)
qe = vector_index.as_query_engine()

In [25]:
# generate answer in examples
for i, e in enumerate(eval_questions):
    r = qe.query(e)
    examples[i]["answer"] = r.response

In [26]:
# load and evaluate
qa_eval = load_evaluator(EvaluatorType.QA)
qa_eval.evaluate(examples, predictions)

[{'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

## Evaluating QA with langsmith

Lets try evaluating qa systems with langsmith. Checkout the langsmith intro notebook to know more about setting langsmith up.

### 1. Create a dataset

lets create and upload the `examples` as dataset for evaluations.

In [35]:
from langsmith import Client

client = Client()

dataset_name = "NYC test"

dataset_list = client.list_datasets()
dataset_exists = False
dataset = None
for d in dataset_list:
    if d.name == dataset_name:
        dataset_exists=True
        dataset = d
        
if not dataset_exists:
    dataset = client.create_dataset(
        dataset_name=dataset_name
    )
    
dataset.id

UUID('754c4ec6-14ec-4d4c-b955-4a8b4ac2e4c5')

In [42]:
e

{'query': 'What is the significance of the Statue of Liberty in New York City?',
 'context': 'from 1785 until 1790, and has been the largest U.S. city since 1790. The Statue of Liberty greeted millions of immigrants as they came to the U.S. by ship in the late 19th and early 20th centuries, and is a symbol of the U.S. and its ideals of liberty and peace. In the 21st century, New York City has emerged as a global node of creativity, entrepreneurship, and as a symbol of freedom and cultural diversity. The New York Times has won the most Pulitzer Prizes for journalism and remains the U.S. media\'s "newspaper of record". In 2019, New York City was voted the greatest city in the world in a survey of over 30,000 people from 48 cities worldwide, citing its cultural diversity.Many districts and monuments in New York City are major landmarks, including three of the world\'s ten most visited tourist attractions in 2013. A record 66.6 million tourists visited New York City in 2019. Times Square i

In [51]:
# add examples to the dataset
for e in examples:
    client.create_example(
        inputs={"query": e["query"]},
        outputs={"answer": e["answer"]},
        dataset_id=dataset.id
    )

### 2. Define the QA chain to evaluate

In order to evaluate each example individually, langsmith requries you to pass in a factory function that creates the QA chain to evaluate. This is especially important if the QA chains contain memory or other stateful variables wrapped in it.

In [40]:
def create_qa_chain():
    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=index.vectorstore.as_retriever(),
        return_source_documents=False
    )
    return qa_chain

### 3. Evaluate



In [48]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    evaluators=[
      "qa",
      "cot_qa",
      "context_qa",
  ],
    reference_key = "answer"
)

In [52]:
run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=create_qa_chain,
    evaluation=eval_config,
    verbose=True,
)

View the evaluation results for project '2023-08-01-23-47-46-RetrievalQA' at:
https://smith.langchain.com/projects/p/8856c6e9-98c5-47fc-9319-d72942a9af5b?eval=true
5 processed

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..


{'project_name': '2023-08-01-23-47-46-RetrievalQA',
 'results': {'6ddbfe30-309c-401e-8a6b-e17f091cfa1b': ['The Statue of Liberty in New York City is significant as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants as they arrived in the U.S. by ship in the late 19th and early 20th centuries, representing the hope for a new life and freedom. It has since become an iconic landmark and a symbol of cultural diversity and freedom.'],
  '8d0bf6f3-096b-4092-b2e9-cacc0b5f21de': ['New York City was named in honor of the Duke of York, who later became King James II of England. In 1664, King Charles II appointed the Duke as proprietor of the former territory of New Netherland, which included the city of New Amsterdam (now New York City), when England seized it from Dutch control. The city was then renamed New York after the Duke of York.'],
  'b3704607-ed11-4da5-bc01-3788e58e28c0': ["New York City has significant economic significance in several 